GitOps — Audit Trail Platform (ATP)¶
Declarative deployment with Git as truth — ATP GitOps ensures infrastructure and application state are versioned, auditable, and continuously reconciled across all Azure environments using Azure DevOps, AKS, and FluxCD.
Purpose & Scope¶
This document defines the GitOps deployment model for the ConnectSoft Audit Trail Platform (ATP), establishing how infrastructure and application manifests are version-controlled in Azure Repos, automatically reconciled to AKS clusters, and continuously monitored for drift with full traceability and compliance evidence using Azure-native tools and services.
What This Document Covers¶
GitOps Fundamentals:
- GitOps philosophy and core principles (declarative, versioned, pulled, reconciled)
- Comparison with traditional CI/CD (push-based) deployments
- Benefits for audit trail requirements (immutable history, compliance, security)
- Azure-native GitOps implementation patterns with FluxCD
Infrastructure & Repository Structure:
- Azure Repos structure for GitOps manifests (monorepo pattern for Kubernetes manifests)
- Branching strategies per environment (main, staging, test, dev, feature/, hotfix/)
- Access control and RBAC for Git repositories (Azure AD integration, branch policies)
- Naming conventions and versioning strategies (SemVer, Git tags, commit SHA)
FluxCD on Azure Kubernetes Service (AKS):
- FluxCD installation, bootstrap, and multi-cluster setup
- GitRepository and Kustomization custom resources for continuous reconciliation
- Azure Repos integration (SSH keys, PAT, Azure AD Workload Identity)
- Drift detection, self-healing, and automatic reconciliation loops
Declarative Manifests & Configuration:
- Kubernetes manifests (Deployments, Services, ConfigMaps, Secrets, Ingress)
- Helm charts for ATP microservices (templates, values files, dependencies, versioning)
- Kustomize overlays for environment-specific configurations (base + overlays pattern)
- Manifest validation, linting, and security policy enforcement
CI/CD Integration:
- Azure Pipelines to GitOps handoff (build → test → publish → commit manifest update)
- Automated manifest updates (image tag bumping after successful CI builds)
- Multi-service coordination and orchestration (atomic updates across services)
- Artifact metadata and provenance (SBOM, vulnerability scans, build attestations)
Secrets Management:
- Azure Key Vault integration (External Secrets Operator or CSI Driver)
- Azure AD Workload Identity for pod authentication (no credentials in Git)
- Secret rotation procedures and zero-downtime updates
- Compliance controls (SOC 2, GDPR, HIPAA) for secret handling and audit logging
Multi-Environment Deployment:
- Environment-specific configurations (dev, test, staging, production, preview, hotfix)
- Kustomize overlays and Helm values files per environment
- Resource quotas, limits, and autoscaling policies per environment
- Promotion workflows and approval gates (manual for staging/production)
Advanced Deployment Strategies:
- Rolling updates (default Kubernetes strategy with maxSurge/maxUnavailable)
- Blue-green deployments (namespace switching with traffic routing)
- Canary releases (progressive traffic shifting with Flagger)
- Preview environments (ephemeral namespaces per pull request)
- Zero-downtime deployments and rollback procedures
Security & Compliance:
- Azure Policy for Kubernetes (Pod Security Standards, network policies, resource quotas)
- Image signing and verification (Cosign, Notary, admission controllers)
- SBOM generation and vulnerability scanning (integrated with ACR)
- Audit logging and compliance evidence collection (immutable Git history)
Multi-Tenancy:
- Namespace-per-tenant isolation strategy
- Dynamic tenant provisioning and offboarding workflows
- Tenant-specific configurations and resource quotas
- Cost allocation and compliance enforcement per tenant
Observability & Monitoring:
- Azure Monitor Container Insights integration for AKS metrics
- FluxCD metrics export to Prometheus/Grafana for reconciliation monitoring
- Deployment tracking and DORA metrics (deployment frequency, lead time, MTTR, change failure rate)
- Alerting on sync failures, drift detection, and health check failures
Day 2 Operations:
- Troubleshooting GitOps issues (sync failures, drift, image pull errors, secret access failures)
- Routine maintenance tasks (FluxCD upgrades, AKS patching, certificate renewals)
- Disaster recovery and rollback procedures (Git revert, cluster recreation from IaC)
- On-call runbooks and escalation paths
Governance & Training:
- GitOps workflow ownership and change management processes
- Developer and operations onboarding guides (Git workflow, manifest authoring)
- Best practices catalog and reference architectures
- Compliance reporting and audit evidence collection automation
Out of Scope¶
This document does NOT cover:
- Kubernetes fundamentals — See
infrastructure/kubernetes.mdfor AKS cluster architecture, pod design, container orchestration basics, and Kubernetes API concepts - Azure Pipelines (CI stage) — See
ci-cd/azure-pipelines.mdfor build, test, security scanning, artifact publishing, and quality gate enforcement - Quality gates — See
ci-cd/quality-gates.mdfor test coverage thresholds, security scanning policies, and compliance gate enforcement - Infrastructure provisioning (non-Kubernetes) — See
infrastructure/pulumi.mdfor Azure SQL, Service Bus, Storage, Key Vault, and other PaaS resource provisioning - Application architecture — See
architecture/hld.mdfor ATP service design, domain models, business logic, and system architecture - Service-specific deployment details — See individual service documentation in
planning/core-services/for service-specific configuration, dependencies, and operational characteristics - Observability implementation — See
operations/observability.mdfor OpenTelemetry instrumentation, metrics collection, distributed tracing, and log aggregation - Backup and restore procedures — See
operations/backups-restore-ediscovery.mdfor data backup strategies, disaster recovery, and eDiscovery procedures
Readers & Ownership¶
Primary Readers:
- Platform Engineers: Implement GitOps workflows, configure FluxCD, author Kubernetes manifests, manage GitOps repository structure
- DevOps Engineers: Integrate Azure Pipelines with GitOps, automate manifest updates, implement promotion workflows, troubleshoot CI/CD handoff
- SRE Team: Monitor FluxCD reconciliation, respond to drift alerts, execute rollback procedures, perform incident response, conduct DR drills
- Security Team: Review security policies, validate RBAC configurations, enforce Pod Security Standards, audit secret management, ensure compliance
- Developers: Understand GitOps workflow, submit manifest changes via pull requests, test changes in preview environments, troubleshoot deployment issues
- Compliance Officers: Validate audit trail completeness, review deployment approvals, ensure evidence collection for SOC 2/GDPR/HIPAA audits
Document Owner: Platform Engineering Team
Technical Reviewers: SRE Lead, Cloud Architect, Security Officer
Compliance Reviewer: Compliance Officer (for SOC 2/GDPR/HIPAA sections)
Approval Authority: CTO
Last Reviewed: 2024-10-30
Next Review: 2025-Q2 (after Cycle 6 completion — multi-environment observability)
Review Frequency: Quarterly (or after significant GitOps workflow changes)
Artifacts Produced¶
By following this document, teams will produce the following artifacts and deliverables:
1. GitOps Repository (atp-gitops in Azure Repos):
- Declarative Kubernetes manifests for all 7 ATP microservices
- Helm charts with templates, values files, and dependency specifications
- Kustomize base manifests and environment-specific overlays (dev, test, staging, production)
- FluxCD bootstrap configuration files (GitRepository, Kustomization resources)
- Security policies (Pod Security Policies, Network Policies, Azure Policies)
- Multi-tenant namespace configurations and resource quotas
2. FluxCD Installation:
- FluxCD controllers deployed on all AKS clusters (dev, test, staging, production)
- GitRepository resources configured for Azure Repos integration (SSH/PAT authentication)
- Kustomization resources for each application and environment
- Notification controllers for alerting (Slack, Teams, Azure Monitor)
- Health assessment and drift detection configurations
3. CI/CD Integration:
- Azure Pipelines templates for GitOps handoff (build → publish → manifest update → commit)
- Automated manifest update scripts (image tag bumping, Helm values updates)
- Preview environment provisioning pipelines (ephemeral namespaces per PR)
- Multi-service coordination scripts (atomic updates across dependent services)
4. Infrastructure as Code:
- Pulumi C# programs for AKS cluster provisioning (node pools, networking, SKUs)
- Environment-specific Pulumi stack configurations (dev, test, staging, production)
- Drift detection automation (scheduled reconciliation validation)
- Disaster recovery scripts (cluster recreation from Git and IaC)
5. Secrets Management:
- External Secrets Operator or CSI Driver installation and configuration
- ClusterSecretStore resources (Azure Key Vault integration per environment)
- ExternalSecret or SecretProviderClass definitions for each application
- Azure AD Workload Identity configuration (federated credentials, ServiceAccount annotations)
- Secret rotation runbooks and automation scripts
6. Observability & Compliance:
- Azure Monitor dashboards for GitOps metrics (reconciliation status, drift events, deployment frequency)
- Grafana dashboards for FluxCD monitoring (reconciliation duration, success rate, resource health)
- Compliance evidence collection scripts (deployment receipts, approval records, Git audit trail)
- KQL queries for audit trail analysis (who deployed what, when, why)
- DORA metrics dashboard (deployment frequency, lead time for changes, MTTR, change failure rate)
7. Security & Policy Enforcement:
- Azure Policy definitions for AKS (Pod Security Standards, network policies, resource limits)
- Pod Security Admission configurations (baseline, restricted profiles)
- Network policy templates (default deny, service-to-service rules)
- Image signing workflows (Cosign signatures, admission controller verification)
- RBAC configurations (ServiceAccounts, Roles, RoleBindings per service and tenant)
8. Runbooks & Documentation:
- Troubleshooting guides (sync failures, drift resolution, image pull errors)
- Rollback procedures (simple rollback with
git revert, complex multi-service rollbacks) - DR test plans (cluster failure scenarios, region outage, complete platform loss)
- Developer onboarding guides (GitOps workflow, manifest authoring, PR process)
- Operations runbooks (routine maintenance, FluxCD upgrades, AKS patching)
What is GitOps?¶
Definition: GitOps is an operational framework that applies DevOps best practices—version control, collaboration, compliance, and CI/CD—to infrastructure automation and application deployment. The core principle is using Git repositories as the single source of truth for declarative infrastructure and application configurations.
Core Concept: Instead of operators running manual kubectl apply commands or CI/CD pipelines pushing changes to clusters, a GitOps agent (FluxCD, ArgoCD) running inside the Kubernetes cluster continuously pulls the desired state from Git and reconciles the actual cluster state to match it.
History & Evolution¶
GitOps emerged from the evolution of Infrastructure as Code (IaC) practices combined with Kubernetes' declarative nature:
Timeline:
| Year | Milestone | Impact on Industry | Relevance to ATP |
|---|---|---|---|
| 2010-2014 | Infrastructure as Code (IaC) emerges | Terraform, CloudFormation, Ansible enable declarative infrastructure | Foundation for declarative Azure resources |
| 2015 | Kubernetes released (v1.0) | Declarative configuration becomes standard for container orchestration | ATP targets AKS for microservice deployment |
| 2017 | Weaveworks coins "GitOps" term | Flux v1 released as first GitOps operator for Kubernetes | GitOps pattern recognized |
| 2018 | ArgoCD released by Intuit | Alternative GitOps implementation; feature-rich UI | ArgoCD evaluated (FluxCD chosen for simplicity) |
| 2019 | OpenGitOps working group formed | CNCF standardizes 4 core GitOps principles | ATP adopts OpenGitOps principles |
| 2020 | FluxCD v2 released | Complete rewrite with modular architecture (GitOps Toolkit) | ATP uses FluxCD v2 for production |
| 2021 | Flux and Argo join CNCF | GitOps becomes cloud-native standard (incubating projects) | Industry validation for ATP choice |
| 2022 | Azure Arc GitOps integration | Microsoft provides native GitOps support for AKS and Arc-enabled clusters | Azure-native GitOps validated |
| 2024 | Widespread adoption | CNCF surveys show 70%+ of production Kubernetes use GitOps | ATP joins industry leaders |
Why GitOps Now?:
- Kubernetes maturity: Declarative APIs well-established; GitOps is natural evolution
- Security focus: Zero-trust principles demand eliminating cluster credentials from CI/CD
- Compliance: Audit trail requirements favor Git's immutable history
- Cloud-native patterns: CNCF-endorsed pattern with mature tooling (FluxCD, ArgoCD)
Pull-Based vs Push-Based Deployment Models¶
Purpose: Understand the fundamental architectural difference between traditional CI/CD and GitOps deployment models.
Push-Based Deployment (Traditional CI/CD)¶
Architecture:
graph TD
A[Developer] -->|1. git push| B[Source Code<br/>Repository]
B -->|2. trigger webhook| C[CI/CD Pipeline<br/>Azure Pipelines]
C -->|3. build| D[Compile &<br/>Test]
D -->|4. publish| E[Docker Image]
E -->|5. push| F[Container<br/>Registry<br/>ACR]
C -->|6. deploy<br/>kubectl apply| G[Kubernetes<br/>Cluster]
H[Secrets<br/>Vault] -.->|credentials<br/>stored| C
style G fill:#ffcccc
style C fill:#FFE5B4
style H fill:#ffcccc
Characteristics:
- External deployment: CI/CD pipeline (running outside cluster) has direct access to Kubernetes API via kubeconfig or service account token
- Push on trigger: Deployment happens during pipeline execution (synchronous operation)
- Credentials required: Pipeline needs cluster credentials stored as secrets or service connections
- No continuous reconciliation: Cluster state checked only during deployment; drift undetected between deployments
- Secret management: Secrets often stored in CI/CD system variables (security risk)
Workflow Example (Azure Pipelines - Push Model):
# ❌ PUSH-BASED: Pipeline has direct cluster access
- stage: Deploy_Production
jobs:
- deployment: DeployToAKS
environment: ATP-Production-AKS # Requires approval
strategy:
runOnce:
deploy:
steps:
# Pipeline has full kubectl access to production cluster
- task: Kubernetes@1
displayName: 'Deploy ATP Ingestion to Production'
inputs:
connectionType: 'Kubernetes Service Connection'
kubernetesServiceEndpoint: 'ATP-Production-AKS' # ⚠️ Cluster credentials
namespace: 'atp-production'
command: 'apply'
useConfigurationFile: true
configuration: 'manifests/production/atp-ingestion.yaml'
# Update image tag imperatively
- task: Kubernetes@1
inputs:
kubernetesServiceEndpoint: 'ATP-Production-AKS'
command: 'set'
arguments: 'image deployment/atp-ingestion atp-ingestion=$(containerRegistry)/atp/ingestion:$(Build.BuildNumber)'
Security Concerns:
# ⚠️ SECURITY RISK: Cluster credentials stored in Azure DevOps
# Service Connection: ATP-Production-AKS
# Type: Kubernetes Service Connection
# Authentication: Service Account (has cluster-admin rights!)
#
# Attack vectors:
# 1. Anyone with "Use" permission on service connection can deploy to production
# 2. Compromised Azure DevOps account = compromised production cluster
# 3. Service account token rotation requires updating all pipelines
# 4. Credentials visible in pipeline logs (if logging enabled)
Pull-Based Deployment (GitOps)¶
Architecture:
graph TD
A[Developer] -->|1. git push| B[Source Code<br/>Repository]
B -->|2. trigger| C[CI Pipeline<br/>Azure Pipelines]
C -->|3. build + test| D[Docker Image]
D -->|4. push| E[Container<br/>Registry<br/>ACR]
C -->|5. update manifest<br/>commit + push| F[GitOps<br/>Repository]
subgraph "Inside Kubernetes Cluster"
G[FluxCD Agent]
H[AKS Cluster]
G -->|6. git pull<br/>every 1 min| F
G -->|7. kubectl apply| H
H -.->|8. drift<br/>detection| G
G -.->|9. self-heal| H
end
I[Azure Key<br/>Vault] -->|10. secrets<br/>sync| J[External Secrets<br/>Operator]
J -->|11. create K8s<br/>secrets| H
K[Azure Monitor] -.->|metrics| C
K -.->|metrics| G
K -.->|logs| H
style H fill:#90EE90
style G fill:#90EE90
style F fill:#FFE5B4
Characteristics:
- Internal agent: GitOps operator (FluxCD) runs inside Kubernetes cluster as Deployment
- Continuous pull: Agent polls Git repository at regular intervals (configurable: 30s to 10m)
- No external access: Cluster credentials never leave cluster; enhanced security
- Automatic reconciliation: Cluster state continuously compared to Git state; drift corrected automatically
- Secret sync: Secrets managed in Azure Key Vault; synced to cluster via External Secrets Operator
Workflow Example (FluxCD - Pull Model):
# ✅ PULL-BASED: FluxCD inside cluster pulls from Git
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
interval: 1m # Poll Git every 1 minute
url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
ref:
branch: production # Production environment uses 'production' branch
secretRef:
name: azure-devops-ssh-key # Read-only SSH key (no cluster credentials!)
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
namespace: flux-system
spec:
interval: 5m # Reconcile every 5 minutes
path: ./apps/atp-ingestion/overlays/production
prune: true # Delete resources removed from Git
sourceRef:
kind: GitRepository
name: atp-gitops
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
Security Benefits:
# ✅ NO cluster credentials outside cluster
# FluxCD ServiceAccount has RBAC permissions inside cluster
# CI pipeline NEVER touches cluster; only commits to Git
# FluxCD ServiceAccount (inside cluster)
apiVersion: v1
kind: ServiceAccount
metadata:
name: kustomize-controller
namespace: flux-system
---
# FluxCD RBAC (cluster-admin for reconciliation)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kustomize-controller
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin # Full access inside cluster only
subjects:
- kind: ServiceAccount
name: kustomize-controller
namespace: flux-system
Comparison Summary¶
| Deployment Model | When to Use | When to Avoid |
|---|---|---|
| Push-Based (Traditional CI/CD) | - Small teams with simple deployments - Non-Kubernetes environments - Immediate feedback required - Team unfamiliar with GitOps |
- Production Kubernetes deployments - Compliance requirements (SOC 2, GDPR, HIPAA) - Multi-environment with drift concerns - Security-sensitive environments |
| Pull-Based (GitOps) | - Production Kubernetes deployments - Compliance/audit requirements - Multi-cluster/multi-region - Configuration drift is a concern - Security-first environments |
- Non-Kubernetes deployments - Legacy infrastructure - Team unwilling to learn GitOps - Immediate deployment feedback critical |
ATP Decision: ✅ GitOps (Pull-Based) for all Kubernetes deployments
Rationale:
- Audit trail requirement: Git provides immutable, permanent history (vs 30-90 day pipeline logs)
- Security requirement: Zero-trust principle; no cluster credentials outside cluster
- Compliance requirement: SOC 2, GDPR, HIPAA demand tamper-evident change records
- Multi-tenancy: Git structure enables isolated tenant configurations
- Operational resilience: Disaster recovery RTO reduced from 4 hours to 30 minutes
GitOps in Audit Trail Platform Context¶
Purpose: Explain why GitOps is essential (not just beneficial) for ATP's unique requirements.
Audit Trail Requirements¶
ATP provides immutable, tamper-evident audit logs for customers. The platform's own infrastructure must meet the same standards:
Requirement 1: Complete Change History
Every infrastructure change must be tracked with full attribution (who, what, when, why):
# Git history provides complete audit trail
git log --all --pretty=format:"%h | %ai | %an | %ae | %s" \
--since="2024-01-01" \
--grep="production"
# Example output (can be exported for SOC 2 audits):
# abc123d | 2024-10-30 14:23:45 +0000 | Alice Chen | alice.chen@connectsoft.example | feat(ingestion): upgrade to v1.2.3
# def456e | 2024-10-25 10:15:22 +0000 | Bob Smith | bob.smith@connectsoft.example | fix(query): index performance (ATP-BUG-789)
# ghi789f | 2024-10-20 16:42:11 +0000 | Carol Davis | carol.davis@connectsoft.example | scale(integrity): replicas 3→5 (ATP-INC-456)
Requirement 2: Tamper-Evidence
Git commits must be cryptographically signed to prevent tampering:
# Generate GPG key for commit signing
gpg --full-generate-key
# Select: RSA and RSA, 4096 bits, no expiration
# User ID: "Platform Team <platform-team@connectsoft.example>"
# Export public key for verification
gpg --armor --export platform-team@connectsoft.example > platform-team-gpg-public.key
# Configure Git to sign all commits
git config --global user.signingkey <GPG_KEY_ID>
git config --global commit.gpgsign true
git config --global tag.gpgsign true
# Commit with signature
git add apps/atp-ingestion/overlays/production/kustomization.yaml
git commit -S -m "feat(ingestion): upgrade to v1.2.3
- Updated image tag to v1.2.3-abc123d
- Increased memory limit 512Mi → 1Gi (performance optimization)
- Enabled tamper-evidence in production config
Relates to: ATP-EPIC-456
Approved by: architect@connectsoft.example
Tested in: Staging (2024-10-26 to 2024-10-29)"
# Verify signature
git log --show-signature -1
# Output:
# commit abc123d1234567890abcdef1234567890abcdef (HEAD -> production)
# gpg: Signature made Wed Oct 30 14:23:45 2024 UTC
# gpg: using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"
# Author: Platform Team <platform-team@connectsoft.example>
# Date: Wed Oct 30 14:23:45 2024 +0000
#
# feat(ingestion): upgrade to v1.2.3
# ...
Requirement 3: Long-Term Retention
Git history must be retained indefinitely for compliance (SOC 2: 1 year minimum, ATP: 7 years for parity with audit logs):
# Backup Git repository to immutable Azure Blob Storage
az storage blob upload-batch \
--account-name atpgitbackupprod \
--destination gitops-backups \
--source .git/ \
--destination-path "$(date +%Y%m%d)/" \
--overwrite false # Immutable: cannot overwrite
# Enable legal hold (WORM storage)
az storage container legal-hold set \
--account-name atpgitbackupprod \
--container-name gitops-backups \
--tags "compliance=soc2-gdpr-hipaa" "retention=7-years"
# Retention: 7 years (matches audit log retention)
# Cost: ~$50/month for 10 GB Git history (cold storage tier)
Security Benefits¶
No Direct Cluster Access:
Problem Statement: Traditional CI/CD stores cluster credentials in Azure DevOps, creating security risks:
- Broad attack surface: Anyone with Azure DevOps access can potentially access cluster credentials
- Credential sprawl: Each environment/cluster needs separate service connection
- Rotation complexity: Updating service account tokens requires updating all pipelines
- Audit trail: Difficult to trace who accessed cluster credentials
GitOps Solution:
graph TD
subgraph "Outside Cluster"
A[Developer] -->|git push| B[Azure Repos<br/>atp-gitops]
C[CI Pipeline] -->|update manifest<br/>commit + push| B
end
subgraph "Inside AKS Cluster - No External Access"
D[FluxCD Agent]
E[Kustomize Controller]
F[Helm Controller]
G[Kubernetes API]
D -->|git pull| B
D -->|render| E
E -->|render| F
F -->|kubectl apply| G
end
H[Azure Key Vault] -->|Workload Identity<br/>federated auth| I[External Secrets<br/>Operator]
I -->|create secrets| G
J[Azure Monitor] -.->|observability| D
style G fill:#90EE90
style D fill:#90EE90
Security Improvements:
| Security Aspect | Traditional CI/CD | GitOps | Improvement |
|---|---|---|---|
| Cluster Credentials | Stored in CI/CD system | Never leave cluster | ✅ 100% reduction in external credentials |
| Attack Surface | CI/CD + cluster | Git repository only | ✅ 50% reduction in attack surface |
| Credential Rotation | Manual; update all pipelines | Automatic (Workload Identity) | ✅ Zero-touch rotation |
| Least Privilege | Often cluster-admin for simplicity | RBAC per FluxCD controller | ✅ Principle of least privilege |
| Audit Trail | Pipeline logs (ephemeral) | Git history (permanent) | ✅ Immutable audit evidence |
| Secrets in Git | Risk of accidental commit | Prevented (pre-commit hooks + PR validation) | ✅ Zero secrets in Git |
Separation of Duties:
ATP enforces role-based access control at multiple levels:
| Role | Azure Repos Access | AKS Cluster Access | FluxCD Admin | Azure Key Vault Access | Approval Authority |
|---|---|---|---|---|---|
| Developer | ✅ Submit PRs (feature/*) | ❌ No access | ❌ No | ❌ No | None |
| DevOps Engineer | ✅ Approve PRs (dev/test) | ⚠️ Read-only (dev/test) | ⚠️ Read-only | ❌ No | Dev/Test deployments |
| Architect | ✅ Approve PRs (staging/prod) | ⚠️ Read-only (all envs) | ⚠️ Read-only | ⚠️ Read-only (audit) | Staging/Prod deployments |
| SRE Engineer | ✅ Approve PRs (production) | ⚠️ Read-only (production) | ✅ Admin (suspend/resume reconciliation) | ⚠️ Read-only (audit) | Production deployments |
| Security Officer | ✅ Audit access (read-only) | ⚠️ Read-only (all envs) | ⚠️ Read-only | ✅ Admin (rotate secrets) | Security policy changes |
| Compliance Officer | ✅ Audit access (read-only) | ❌ No access | ❌ No | ⚠️ Read-only (audit) | None (audit only) |
| FluxCD Agent | ✅ Read-only (GitOps repo) | ✅ Full access (via ServiceAccount RBAC) | N/A | ❌ No (uses External Secrets Operator) | None (automated) |
| External Secrets Operator | ❌ No | ✅ Create secrets (namespace-scoped) | ❌ No | ✅ Read secrets (Workload Identity) | None (automated) |
RBAC Example (FluxCD ServiceAccount):
# FluxCD runs with least privilege (namespace-scoped for app deployments)
apiVersion: v1
kind: ServiceAccount
metadata:
name: kustomize-controller-atp-apps
namespace: flux-system
---
# Role: namespace-scoped permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: flux-apps-deployer
namespace: atp-production
rules:
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["services", "configmaps", "secrets", "persistentvolumeclaims"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses", "networkpolicies"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# RoleBinding: bind ServiceAccount to Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: flux-apps-deployer
namespace: atp-production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: flux-apps-deployer
subjects:
- kind: ServiceAccount
name: kustomize-controller-atp-apps
namespace: flux-system
Multi-Tenancy¶
Tenant Isolation in Git:
ATP's namespace-per-tenant model is naturally represented in Git:
Directory Structure:
atp-gitops/
├── tenants/
│ ├── tenant-acme-corp/ # Tenant: ACME Corporation
│ │ ├── namespace.yaml # Namespace: atp-tenant-acme
│ │ ├── resource-quota.yaml # Limits: 10 CPU, 20 GB RAM
│ │ ├── network-policy.yaml # Deny cross-tenant traffic
│ │ ├── rbac.yaml # Tenant-specific RBAC
│ │ ├── config.yaml # Data residency: US
│ │ └── kustomization.yaml # FluxCD Kustomization
│ │
│ ├── tenant-widgets-inc/ # Tenant: Widgets Inc.
│ │ ├── namespace.yaml # Namespace: atp-tenant-widgets
│ │ ├── resource-quota.yaml # Limits: 5 CPU, 10 GB RAM
│ │ ├── network-policy.yaml
│ │ ├── rbac.yaml
│ │ ├── config.yaml # Data residency: EU (GDPR)
│ │ └── kustomization.yaml
│ │
│ └── tenant-global-bank/ # Tenant: Global Bank (Enterprise)
│ ├── namespace.yaml # Namespace: atp-tenant-global
│ ├── resource-quota.yaml # Limits: 20 CPU, 40 GB RAM
│ ├── network-policy.yaml # Strict isolation (financial data)
│ ├── rbac.yaml
│ ├── config.yaml # Compliance: HIPAA + SOC 2 + GDPR
│ └── kustomization.yaml
Tenant Configuration Example:
# tenants/tenant-acme-corp/config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-config
namespace: atp-tenant-acme
data:
# Tenant metadata
tenant-id: "acme-corp"
tenant-name: "ACME Corporation"
tenant-tier: "standard" # standard, premium, enterprise
# Data residency
data-residency: "us" # us, eu, apac
primary-region: "eastus"
backup-region: "westus"
# Compliance requirements
compliance-profile: "soc2-hipaa" # soc2, gdpr, hipaa, soc2-gdpr, soc2-hipaa, soc2-gdpr-hipaa
retention-days: "2555" # 7 years
immutability-enabled: "true"
tamper-evidence-enabled: "true"
# Feature flags (tenant-specific)
enable-advanced-query: "true"
enable-ai-anomaly-detection: "false" # Premium feature
enable-realtime-alerts: "true"
# Resource limits
max-ingestion-rate-rps: "1000" # 1000 requests per second
max-query-rate-rps: "500"
max-storage-gb: "1000" # 1 TB
Benefits:
- ✅ Isolated changes: Tenant config changes don't affect other tenants (isolated Git directories)
- ✅ Audit trail per tenant: git log -- tenants/tenant-acme-corp/ shows all changes for ACME Corp
- ✅ Compliance per tenant: GDPR/HIPAA requirements enforced via namespace labels and network policies
- ✅ Cost allocation: Resource quotas enable accurate chargeback/showback per tenant
Operational Resilience¶
Disaster Recovery from Git:
Scenario: Production AKS cluster destroyed (region outage, ransomware, infrastructure failure)
Recovery Steps:
#!/bin/bash
# disaster-recovery-production.sh — Recover production AKS from Git
set -euo pipefail
echo "🔴 DISASTER RECOVERY: Recreating production AKS cluster from Git + IaC"
# ──────────────────────────────────────────────────────────────────────────
# Step 1: Provision new AKS cluster with Pulumi (15 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 1/5: Provisioning AKS cluster with Pulumi..."
cd infrastructure/pulumi-aks
pulumi stack select production
pulumi refresh --yes # Detect destroyed resources
pulumi up --yes # Recreate cluster
# ──────────────────────────────────────────────────────────────────────────
# Step 2: Configure kubectl context (1 minute)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 2/5: Configuring kubectl..."
az aks get-credentials \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--overwrite-existing
export KUBECONFIG=~/.kube/config
# ──────────────────────────────────────────────────────────────────────────
# Step 3: Install FluxCD and bootstrap from Git (5 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 3/5: Bootstrapping FluxCD..."
flux bootstrap git \
--url=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
--branch=production \
--path=clusters/production \
--private-key-file=~/.ssh/azure-devops-flux \
--author-name="Platform Team" \
--author-email="platform-team@connectsoft.example"
# ──────────────────────────────────────────────────────────────────────────
# Step 4: Wait for FluxCD to reconcile all resources (10 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 4/5: Waiting for FluxCD reconciliation..."
flux get kustomizations --watch --timeout=15m
# ──────────────────────────────────────────────────────────────────────────
# Step 5: Verify all services healthy (3 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 5/5: Verifying service health..."
for service in ingestion query integrity export policy search gateway; do
echo "Checking atp-$service..."
kubectl wait --for=condition=available --timeout=300s \
deployment/atp-$service -n atp-production
done
echo "✅ Disaster recovery complete!"
echo "RTO achieved: ~30 minutes"
echo "RPO: 0 minutes (Git contains complete desired state)"
RTO/RPO Targets:
| Environment | RTO Target | RTO Actual (GitOps) | RPO Target | RPO Actual (GitOps) |
|---|---|---|---|---|
| Dev | 4 hours | 20 minutes | 24 hours | 0 minutes |
| Test | 2 hours | 25 minutes | 12 hours | 0 minutes |
| Staging | 1 hour | 30 minutes | 4 hours | 0 minutes |
| Production | 30 minutes | 30-35 minutes | 1 hour | 0 minutes |
GitOps Impact: RPO reduced to zero (Git has complete desired state, no data loss for infrastructure config).
Summary¶
- GitOps in ATP Context Literature: Essential (not just beneficial) for ATP's audit trail, security, and compliance requirements
- Audit Trail Requirements: Complete change history (Git log with attribution), tamper-evidence (GPG-signed commits), long-term retention (7 years in immutable Azure Blob Storage)
- Security Benefits: No cluster credentials outside cluster, separation of duties (7 roles with RBAC matrix), secret management via Key Vault (zero secrets in Git)
- Multi-Tenancy: Namespace-per-tenant with isolated Git directories, tenant-specific configs (data residency, compliance, resource quotas, feature flags)
- Operational Resilience: DR RTO 30-35 minutes (Pulumi 15min + FluxCD 10min + validate 5min), RPO 0 minutes (Git has full state)
- Rollback Simplicity:
git reverttriggers automatic rollback within 5-10 minutes (vs re-running pipeline)
Four Core Principles (OpenGitOps)¶
The OpenGitOps working group (CNCF) defines 4 core principles that any GitOps implementation must follow. ATP adheres to all four principles using FluxCD, Azure DevOps, and AKS.
Principle 1: Declarative¶
Definition: The desired system state is represented as declarative specifications (what you want, not how to get it). Configuration is stored in a version-controlled source (Git) rather than generated by scripts.
Key Concepts:
- Declarative vs Imperative: Declarative describes the end state (e.g., "3 replicas, 1 GB RAM"), while imperative describes steps (e.g., "scale up by 1, set memory to 1 GB")
- Idempotency: Applying the same declarative configuration multiple times produces the same result
- Configuration as Code: All infrastructure and application config stored as YAML/JSON in Git
ATP Implementation:
ATP uses three layers of declarative configuration:
- Base Kubernetes Manifests (YAML): Raw Kubernetes resource definitions
- Helm Charts: Templated, parameterized manifests with values files
- Kustomize Overlays: Environment-specific customizations applied to base manifests
Kubernetes Deployment Manifest (Base)¶
Complete Example (ATP Ingestion Service):
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-production
labels:
app: atp-ingestion
component: ingestion
tier: backend
version: v1.2.3
managed-by: fluxcd
spec:
replicas: 3 # Desired state: 3 replicas
selector:
matchLabels:
app: atp-ingestion
template:
metadata:
labels:
app: atp-ingestion
version: v1.2.3
spec:
serviceAccountName: atp-ingestion-sa
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
env:
- name: ASPNETCORE_ENVIRONMENT
value: Production
- name: OpenTelemetry__ServiceName
value: atp-ingestion
envFrom:
- configMapRef:
name: atp-ingestion-config
- secretRef:
name: atp-ingestion-secrets
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
imagePullSecrets:
- name: acr-credentials
---
# apps/atp-ingestion/base/service.yaml
apiVersion: v1
kind: Service
metadata:
name: atp-ingestion
namespace: atp-production
labels:
app: atp-ingestion
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
selector:
app: atp-ingestion
---
# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: atp-ingestion-config
namespace: atp-production
data:
ASPNETCORE_ENVIRONMENT: "Production"
OpenTelemetry__ServiceName: "atp-ingestion"
OpenTelemetry__SamplingRatio: "0.1"
Audit__EnableImmutability: "true"
Audit__RetentionDays: "2555"
Declarative Characteristics:
- ✅ Desired state:
replicas: 3declares the goal (not "scale by +1") - ✅ Idempotent: Reapplying same manifest produces same result
- ✅ Version-controlled: Stored in Git, not generated by scripts
- ✅ Immutable: Image tag includes commit SHA (
v1.2.3-abc123d)
Helm Charts (Parameterized Declarative)¶
Chart Structure:
apps/atp-ingestion/helm/
├── Chart.yaml # Chart metadata
├── values.yaml # Default values
├── values-dev.yaml # Dev environment overrides
├── values-production.yaml # Production environment overrides
└── templates/
├── deployment.yaml # Templated Deployment
├── service.yaml # Templated Service
├── configmap.yaml # Templated ConfigMap
└── ingress.yaml # Templated Ingress
Chart.yaml:
# apps/atp-ingestion/helm/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: ATP Ingestion Service - Receives audit records via HTTP/gRPC
version: 1.2.3 # Chart version (SemVer)
appVersion: 1.2.3 # Application version
type: application
keywords:
- audit-trail
- ingestion
- microservice
maintainers:
- name: ConnectSoft Platform Team
email: platform-team@connectsoft.example
dependencies:
- name: redis
version: 17.x.x
repository: https://charts.bitnami.com/bitnami
condition: redis.enabled
values.yaml (Default):
# apps/atp-ingestion/helm/values.yaml
# Default values for atp-ingestion chart
replicaCount: 3
image:
repository: connectsoft.azurecr.io/atp/ingestion
pullPolicy: IfNotPresent
tag: "" # Overridden by .Values.appVersion or CI pipeline
imagePullSecrets:
- name: acr-credentials
serviceAccount:
create: true
annotations:
azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
name: atp-ingestion-sa
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
service:
type: ClusterIP
port: 80
targetPort: 8080
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: ingestion.atp.connectsoft.example
paths:
- path: /
pathType: Prefix
tls:
- secretName: atp-ingestion-tls
hosts:
- ingestion.atp.connectsoft.example
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
# Environment-specific configuration
env:
ASPNETCORE_ENVIRONMENT: Production
OpenTelemetry__ServiceName: atp-ingestion
# External Secrets Operator integration
externalSecrets:
enabled: true
secretStore: azure-keyvault
secrets:
- name: ConnectionStrings__Database
key: sql-connection-string
- name: ConnectionStrings__Redis
key: redis-connection-string
- name: ConnectionStrings__RabbitMQ
key: rabbitmq-connection-string
# Redis sub-chart (optional dependency)
redis:
enabled: false # Use Azure Cache for Redis instead
Helm Template (Deployment):
# apps/atp-ingestion/helm/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "atp-ingestion.fullname" . }}
namespace: {{ .Values.namespace }}
labels:
{{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
{{- with .Values.podAnnotations }}
{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
spec:
serviceAccountName: {{ .Values.serviceAccount.name }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
env:
{{- range $key, $value := .Values.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}
livenessProbe:
{{- toYaml .Values.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.readinessProbe | nindent 12 }}
Benefits of Helm:
- ✅ Parameterization: One chart, multiple environments (values-dev.yaml, values-production.yaml)
- ✅ Reusability: Chart can be used across multiple services with different values
- ✅ Dependency management: Declare sub-charts (e.g., Redis) as dependencies
- ✅ Versioning: Chart version and app version tracked separately
Kustomize Overlays (Environment-Specific Customization)¶
Directory Structure:
apps/atp-ingestion/
├── base/ # Base manifests (reusable)
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── configmap.yaml
│ └── kustomization.yaml
│
└── overlays/ # Environment-specific overlays
├── dev/
│ ├── kustomization.yaml
│ ├── deployment-patch.yaml
│ └── configmap-patch.yaml
├── staging/
│ ├── kustomization.yaml
│ ├── deployment-patch.yaml
│ └── hpa-patch.yaml
└── production/
├── kustomization.yaml
├── deployment-patch.yaml
├── hpa-patch.yaml
└── configmap-patch.yaml
Base Kustomization:
# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-production
resources:
- deployment.yaml
- service.yaml
- configmap.yaml
commonLabels:
app: atp-ingestion
component: ingestion
managed-by: fluxcd
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3-abc123d # Updated by CI pipeline
Production Overlay:
# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-production
# Base resources
resources:
- ../../base
# Strategic merge patches
patchesStrategicMerge:
- deployment-patch.yaml
- hpa-patch.yaml
# Image tag override (updated by CI pipeline)
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3-abc123d
# ConfigMap generator (add production-specific values)
configMapGenerator:
- name: atp-ingestion-config
behavior: merge # Merge with base ConfigMap
literals:
- ASPNETCORE_ENVIRONMENT=Production
- OpenTelemetry__SamplingRatio=0.1
- Audit__EnableImmutability=true
- Audit__RetentionDays=2555
# Labels applied to all resources
commonLabels:
environment: production
managed-by: fluxcd
compliance: soc2-gdpr-hipaa
# Annotations applied to all resources
commonAnnotations:
gitops.toolkit.fluxcd.io/reconcile: enabled
azure.connectsoft.com/cost-center: atp-production
# Replicas override (production has more replicas)
replicas:
- name: atp-ingestion
count: 5 # Production: 5 replicas (base has 3)
Deployment Patch (Production-specific changes):
# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 5 # Production: 5 replicas
template:
spec:
containers:
- name: ingestion
resources:
requests:
cpu: 1000m # Production: 1 CPU core (base: 500m)
memory: 1Gi # Production: 1 GB RAM (base: 512Mi)
limits:
cpu: 2000m # Production: 2 CPU cores (base: 1000m)
memory: 2Gi # Production: 2 GB RAM (base: 1Gi)
env:
- name: ASPNETCORE_ENVIRONMENT
value: Production
- name: OpenTelemetry__SamplingRatio
value: "0.1" # Production: 10% sampling (dev: 100%)
Benefits of Kustomize:
- ✅ DRY (Don't Repeat Yourself): Base manifests reused; only differences in overlays
- ✅ Environment isolation: Each environment has isolated overlay directory
- ✅ Strategic merge patches: Fine-grained control over what changes per environment
- ✅ Build-time customization: No runtime templating; manifests rendered before applying
Principle 2: Versioned & Immutable¶
Definition: All desired states are versioned (stored in Git) and immutable (cannot be changed after commit). Changes are made by creating new versions, not modifying existing ones.
Key Concepts:
- Git as Version Control: All manifests stored in Git with commit history
- Immutable Git History: Commits cannot be modified (only new commits added)
- Semantic Versioning: Version numbers follow SemVer (major.minor.patch)
- Image Tagging: Container images tagged with version + commit SHA
- GPG Signing: Commits cryptographically signed to prove authenticity
Git Commit Signing with GPG¶
Purpose: Ensure commits are tamper-evident and authentic (SOC 2, GDPR compliance).
Setup:
# Generate GPG key (one-time per developer/team)
gpg --full-generate-key
# Select:
# - RSA and RSA (default)
# - 4096 bits (secure)
# - No expiration (or 2 years)
# - Real name: "Platform Team"
# - Email: platform-team@connectsoft.example
# - Comment: "ATP GitOps Commits"
# List keys
gpg --list-secret-keys --keyid-format LONG
# Output:
# sec rsa4096/1234567890ABCDEF 2024-01-15 [SC]
# ABC123DEF456GHI789JKL012MNO345PQR678STU
# uid [ultimate] Platform Team <platform-team@connectsoft.example>
# Configure Git to use GPG key
git config --global user.signingkey 1234567890ABCDEF # Use key ID from above
git config --global commit.gpgsign true # Sign all commits automatically
git config --global tag.gpgsign true # Sign all tags automatically
# Export public key (share with team)
gpg --armor --export 1234567890ABCDEF > platform-team-gpg-public.key
# Import public key (for verification)
gpg --import platform-team-gpg-public.key
Commit with Signature:
# Standard commit (automatically signed due to commit.gpgsign=true)
git add apps/atp-ingestion/overlays/production/kustomization.yaml
git commit -m "feat(ingestion): upgrade to v1.2.4
- Updated image tag to v1.2.4-def456e
- Increased memory limit 1Gi → 2Gi (performance optimization)
- Enabled advanced query features
Relates to: ATP-EPIC-789
Approved by: architect@connectsoft.example
Tested in: Staging (2024-10-28 to 2024-10-30)"
# Explicit signing (if auto-sign disabled)
git commit -S -m "..."
# Verify signature
git log --show-signature -1
# Output:
# commit def456e789abcdef0123456789abcdef01234567 (HEAD -> production)
# gpg: Signature made Mon Oct 30 15:30:22 2024 UTC
# gpg: using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"
# Author: Platform Team <platform-team@connectsoft.example>
# Date: Mon Oct 30 15:30:22 2024 +0000
#
# feat(ingestion): upgrade to v1.2.4
Azure DevOps Branch Policy (Require Signed Commits):
# Azure DevOps Branch Policy: Require GPG-signed commits
# Configured in Azure DevOps Portal:
# Repos > Branches > production > Branch Policies > Branch Policies
# ✓ Require signed commits (GPG or SSH)
# ✓ Require pull request (minimum 1 reviewer)
# ✓ Require status checks (CI pipeline must pass)
Verify All Commits Signed (Compliance Audit):
#!/bin/bash
# verify-all-commits-signed.sh — Verify all commits in production branch are GPG-signed
BRANCH="production"
UNSIGNED_COMMITS=()
for commit in $(git log --format=%H origin/$BRANCH); do
if ! git verify-commit $commit 2>/dev/null; then
UNSIGNED_COMMITS+=($commit)
fi
done
if [ ${#UNSIGNED_COMMITS[@]} -eq 0 ]; then
echo "✅ All commits are GPG-signed"
exit 0
else
echo "❌ Found ${#UNSIGNED_COMMITS[@]} unsigned commits:"
for commit in "${UNSIGNED_COMMITS[@]}"; do
echo " - $commit"
done
exit 1
fi
Semantic Versioning Strategy¶
Strategy: ATP uses Semantic Versioning (SemVer) for application versions: MAJOR.MINOR.PATCH
- MAJOR: Breaking changes (API incompatibility, schema changes)
- MINOR: New features (backward-compatible)
- PATCH: Bug fixes (backward-compatible)
Version Tagging:
# Tag release in source code repository
git tag -a v1.2.4 -m "Release v1.2.4
- Feature: Advanced query API
- Bug fix: Memory leak in Redis connection pooling
- Security: Upgrade to .NET 8.0
Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"
git push origin v1.2.4
# CI pipeline uses tag to build Docker image
# Docker image tagged as: connectsoft.azurecr.io/atp/ingestion:v1.2.4
Version Examples:
v1.2.4 # Minor feature release
v1.2.3 # Patch release (bug fix)
v2.0.0 # Major release (breaking changes)
v1.2.4-hotfix1 # Hotfix release
Git Tags for Releases¶
Tag Structure:
# Production release tag
git tag -a v1.2.4 -m "Production Release v1.2.4" production
git push origin v1.2.4
# Hotfix release tag
git tag -a v1.2.4-hotfix1 -m "Hotfix: Memory leak fix" hotfix/memory-leak
git push origin v1.2.4-hotfix1
Tag Verification (Ensure Tags Match Commits):
# Verify tag points to expected commit
git tag -v v1.2.4
# Output:
# object abc123d7890def4567890abc123def4567890ab
# type commit
# tag v1.2.4
# tagger Platform Team <platform-team@connectsoft.example> 2024-10-30 16:00:00 +0000
#
# Production Release v1.2.4
# gpg: Signature made Mon Oct 30 16:00:00 2024 UTC
# gpg: using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"
Environment-Wide Release Tags:
# Production release tag (all services)
git tag -a release/v1.2.4 -m "Production Release v1.2.4
Services:
- atp-ingestion: v1.2.4
- atp-query: v1.3.0
- atp-integrity: v1.1.5
- atp-export: v1.0.2
- atp-policy: v1.2.0
- atp-search: v1.1.0
- atp-gateway: v1.4.0
Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"
git push origin release/v1.2.4
Image Tagging with Version + Commit SHA¶
Strategy: ATP uses immutable image tags combining version + commit SHA:
Format: {version}-{commit-sha}
Example: v1.2.4-abc123d
Where:
- v1.2.4 = Semantic version (from Git tag)
- abc123d = First 7 characters of Git commit SHA
Benefits: - ✅ Traceability: Image tag links to exact Git commit - ✅ Immutability: Same tag always points to same image (never overwritten) - ✅ Version clarity: Version number visible in tag - ✅ Rollback simplicity: Revert to previous version tag
Docker Image Tagging (Azure Pipelines):
# Azure Pipelines: Tag Docker image with version + commit SHA
- task: Docker@2
displayName: 'Build and push Docker image'
inputs:
containerRegistry: 'ConnectSoft-ACR'
repository: 'atp/ingestion'
command: 'buildAndPush'
Dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
tags: |
$(Build.BuildNumber) # v1.2.4
$(Build.BuildNumber)-$(Build.SourceVersion) # v1.2.4-abc123d
latest # Latest (for dev only)
ACR Tagging Rules:
| Tag Pattern | Mutable? | Use Case | Example |
|---|---|---|---|
v{VERSION} |
❌ Immutable | Production releases | v1.2.4 |
v{VERSION}-{SHA} |
❌ Immutable | Production releases (traceable) | v1.2.4-abc123d |
latest |
✅ Mutable | Development only | latest |
Git History as Audit Trail¶
Compliance Report Generation:
#!/bin/bash
# generate-compliance-report.sh — Generate SOC 2 audit report from Git history
BRANCH="production"
START_DATE="2024-01-01"
END_DATE="2024-12-31"
OUTPUT_FILE="compliance-report-q4-2024.md"
echo "# GitOps Compliance Report — Q4 2024" > $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "**Report Period**: $START_DATE to $END_DATE" >> $OUTPUT_FILE
echo "**Branch**: $BRANCH" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "## All Production Deployments" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "| Commit | Timestamp | Author | Email | Description | Signature |" >> $OUTPUT_FILE
echo "|--------|-----------|--------|-------|-------------|-----------|" >> $OUTPUT_FILE
git log --format="%h | %ai | %an | %ae | %s | %G? |" \
--since="$START_DATE" \
--until="$END_DATE" \
origin/$BRANCH | \
sed 's/G$/✅ Good/' | \
sed 's/B$/❌ Bad/' | \
sed 's/U$/⚠️ Unknown/' | \
sed 's/N$/❌ None/' | \
sed 's/X$/❌ Expired/' | \
>> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "## Summary" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
TOTAL=$(git log --oneline --since="$START_DATE" --until="$END_DATE" origin/$BRANCH | wc -l)
SIGNED=$(git log --show-signature --since="$START_DATE" --until="$END_DATE" origin/$BRANCH | grep -c "Good signature")
echo "- **Total Commits**: $TOTAL" >> $OUTPUT_FILE
echo "- **Signed Commits**: $SIGNED" >> $OUTPUT_FILE
echo "- **Unsigned Commits**: $((TOTAL - SIGNED))" >> $OUTPUT_FILE
echo "✅ Compliance report generated: $OUTPUT_FILE"
Output Example:
# GitOps Compliance Report — Q4 2024
**Report Period**: 2024-01-01 to 2024-12-31
**Branch**: production
## All Production Deployments
| Commit | Timestamp | Author | Email | Description | Signature |
|--------|-----------|--------|-------|-------------|-----------|
| abc123d | 2024-10-30 14:23:45 | Platform Team | platform-team@connectsoft.example | feat(ingestion): upgrade to v1.2.4 | ✅ Good |
| def456e | 2024-10-25 10:15:22 | Alice Chen | alice.chen@connectsoft.example | fix(query): resolve index issue | ✅ Good |
| ghi789f | 2024-10-20 16:42:11 | Bob Smith | bob.smith@connectsoft.example | scale(integrity): replicas 3→5 | ✅ Good |
## Summary
- **Total Commits**: 45
- **Signed Commits**: 45
- **Unsigned Commits**: 0
Long-Term Retention (7 years for compliance):
# Backup Git repository to immutable Azure Blob Storage
az storage blob upload-batch \
--account-name atpgitbackupprod \
--destination gitops-backups \
--source .git/ \
--destination-path "$(date +%Y%m%d)/" \
--overwrite false # Immutable: cannot overwrite
# Enable legal hold (WORM storage)
az storage container legal-hold set \
--account-name atpgitbackupprod \
--container-name gitops-backups \
--tags "compliance=soc2-gdpr-hipaa" "retention=7-years"
# Retention: 7 years (matches audit log retention)
# Cost: ~$50/month for 10 GB Git history (cold storage tier)
Principle 3: Pulled Automatically¶
Definition: The desired state is automatically pulled from the source (Git repository) by an agent running inside the cluster, rather than being pushed by an external system.
Key Concepts:
- Pull-Based Architecture: GitOps agent (FluxCD) inside cluster pulls from Git
- Polling Intervals: Agent checks Git at regular intervals (e.g., every 1 minute)
- Webhook Triggers: Optional webhooks for immediate sync (faster than polling)
- GitRepository Resource: FluxCD custom resource that defines Git source
- Kustomization Resource: FluxCD custom resource that defines what to deploy
FluxCD Architecture Overview¶
Component Diagram:
graph TD
A[Git Repository<br/>Azure Repos] -->|git pull<br/>every 1 min| B[Source Controller<br/>flux-system namespace]
B -->|fetch Git| C[GitRepository<br/>Custom Resource]
C -->|notify| D[Kustomize Controller<br/>flux-system namespace]
D -->|render manifests| E[Kustomization<br/>Custom Resource]
E -->|kubectl apply| F[Kubernetes API<br/>AKS Cluster]
F -.->|watch for drift| D
D -.->|reconcile| F
G[Helm Controller] -.->|if Helm chart| E
H[Notification Controller] -->|alerts| I[Slack / Teams]
style B fill:#90EE90
style D fill:#90EE90
style G fill:#90EE90
style H fill:#FFE5B4
FluxCD Components:
| Component | Purpose | Namespace |
|---|---|---|
| source-controller | Fetches Git repositories, Helm charts, OCI artifacts | flux-system |
| kustomize-controller | Renders Kustomize manifests and applies to cluster | flux-system |
| helm-controller | Installs/upgrades Helm charts | flux-system |
| notification-controller | Sends alerts to Slack, Teams, etc. | flux-system |
GitRepository Resource¶
Definition: Defines source of truth (Git repository URL, branch, authentication).
Example:
# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
interval: 1m # Poll Git every 1 minute
url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
ref:
branch: production # Git branch to watch
secretRef:
name: azure-devops-ssh-key # SSH key secret for authentication
ignore: |
/*.md
!README.md
suspend: false # Set to true to pause reconciliation
Status (Reconciled):
# Check GitRepository status
kubectl describe gitrepository atp-gitops -n flux-system
# Output:
# Status:
# Artifact:
# Checksum: abc123def4567890
# Last Update: 2024-10-30T15:30:00Z
# Path: gitrepository/flux-system/atp-gitops/abc123.tar.gz
# Revision: production/abc123d7890def4567890abc123def4567890ab
# URL: http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/atp-gitops/abc123.tar.gz
# Conditions:
# Last Transition Time: 2024-10-30T15:30:00Z
# Message: Fetched revision: production/abc123d7890def4567890abc123def4567890ab
# Observed Generation: 1
# Reason: GitOperationSucceed
# Status: True
# Type: Ready
# Observed Generation: 1
# URL: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
Authentication Methods:
Option 1: SSH Key (Recommended for Azure DevOps):
# Create SSH key secret
apiVersion: v1
kind: Secret
metadata:
name: azure-devops-ssh-key
namespace: flux-system
type: Opaque
stringData:
identity: |
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQNuZW5lAAAABQAAAAB...
-----END OPENSSH PRIVATE KEY-----
known_hosts: |
ssh.dev.azure.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC7...
---
# GitRepository references secret
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
secretRef:
name: azure-devops-ssh-key
Option 2: Personal Access Token (PAT) (Alternative):
# Create PAT secret
apiVersion: v1
kind: Secret
metadata:
name: azure-devops-pat
namespace: flux-system
type: Opaque
stringData:
username: git
password: <AZURE_DEVOPS_PAT> # Token with Code (Read) permission
---
# GitRepository uses PAT
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
secretRef:
name: azure-devops-pat
Kustomization Resource¶
Definition: Defines what to deploy (path in Git repository, reconciliation interval, health checks).
Example:
# clusters/production/kustomizations/atp-ingestion.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
namespace: flux-system
spec:
interval: 5m # Reconcile every 5 minutes
path: ./apps/atp-ingestion/overlays/production # Path in Git repository
prune: true # Delete resources removed from Git
sourceRef:
kind: GitRepository
name: atp-gitops
namespace: flux-system
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
timeout: 10m # Timeout for reconciliation
retryInterval: 2m # Retry interval on failure
suspend: false
Status (Reconciled):
# Check Kustomization status
kubectl describe kustomization atp-ingestion -n flux-system
# Output:
# Status:
# Conditions:
# Last Transition Time: 2024-10-30T15:35:00Z
# Message: Applied revision: production/abc123d7890def4567890abc123def4567890ab
# Observed Generation: 1
# Reason: ReconciliationSucceeded
# Status: True
# Type: Ready
# Inventory:
# Entries:
# Id: apps_v1_Deployment_atp-production_atp-ingestion
# V: v1
# Last Applied Revision: production/abc123d7890def4567890abc123def4567890ab
# Last Attempted Revision: production/abc123d7890def4567890abc123def4567890ab
# Observed Generation: 1
Automatic Sync Policies per Environment¶
Per-Environment Configuration:
| Environment | Git Branch | Polling Interval | Reconciliation Interval | Webhook Trigger |
|---|---|---|---|---|
| Dev | dev |
30 seconds | 1 minute | Enabled (immediate) |
| Test | test |
1 minute | 2 minutes | Enabled (immediate) |
| Staging | staging |
1 minute | 5 minutes | Disabled (manual approval) |
| Production | production |
1 minute | 5 minutes | Disabled (manual approval + 24h cooldown) |
Production Sync Policy (Conservative):
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion-production
namespace: flux-system
spec:
interval: 5m # Reconcile every 5 minutes (not immediate)
path: ./apps/atp-ingestion/overlays/production
prune: true
sourceRef:
kind: GitRepository
name: atp-gitops
# Production: Manual approval required (webhook disabled)
# Production: No automatic sync on push (polling only)
Behavior:
1. Developer pushes commit to production branch
2. GitRepository polls Git every 1 minute (detects new commit)
3. Kustomization reconciles every 5 minutes (applies changes)
4. Total delay: Up to 6 minutes (1 min poll + 5 min reconcile)
Dev Sync Policy (Aggressive):
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion-dev
namespace: flux-system
spec:
interval: 1m # Reconcile every 1 minute
path: ./apps/atp-ingestion/overlays/dev
prune: true
sourceRef:
kind: GitRepository
name: atp-gitops-dev
Polling Intervals and Webhook Triggers¶
Polling Configuration:
# GitRepository polling interval
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
spec:
interval: 1m # Minimum: 30s, Maximum: 24h
Webhook Triggers (Optional):
Purpose: Trigger immediate reconciliation when Git push occurs (faster than polling).
Azure DevOps Webhook (Receive POST on push):
# FluxCD Receiver (webhook endpoint)
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Receiver
metadata:
name: atp-gitops-webhook
namespace: flux-system
spec:
type: git
events:
- "push"
resources:
- kind: GitRepository
name: atp-gitops
secretRef:
name: webhook-token
# Azure DevOps webhook URL:
# https://<fluxcd-receiver>/hook/xyz123abc456...
Configuration in Azure DevOps:
Azure DevOps > Repos > Hooks > Add Subscription
Name: FluxCD Webhook
Event: Code pushed
Filters:
- Branch: dev, test (production excluded for safety)
Service Hook URL: https://<fluxcd-receiver>/hook/xyz123abc456...
Benefits: - ✅ Faster sync: Changes applied within seconds (vs at most 6 minutes with polling) - ✅ Reduced Git polling: Lower load on Azure DevOps Git servers
Trade-offs: - ⚠️ Security: Webhook endpoint must be publicly accessible (or use Azure DevOps IP allowlist) - ⚠️ Production risk: Disabled for production (manual approval required)
Principle 4: Continuously Reconciled¶
Definition: Software agents automatically and continuously ensure the actual system state matches the desired state (stored in Git). Any drift from the desired state is automatically corrected.
Key Concepts:
- Drift Detection: Continuous monitoring of cluster state vs Git state
- Self-Healing: Automatic correction of configuration drift
- Reconciliation Loop: Periodic checks and corrections (every 1-5 minutes)
- Drift Correction: Revert manual changes to match Git state
Drift Detection Mechanisms¶
How FluxCD Detects Drift:
- Periodic Reconciliation: FluxCD compares cluster state to Git state every reconciliation interval
- Resource Watching: Kubernetes watch API detects resource changes in real-time
- Inventory Tracking: FluxCD maintains inventory of applied resources (GitOps Toolkit)
Drift Detection Example:
# Scenario: Operator manually scales deployment (NOT via Git)
kubectl scale deployment atp-ingestion --replicas=5 -n atp-production
# FluxCD detects drift within 5 minutes (reconciliation interval)
flux get kustomizations
# Output:
# NAME READY MESSAGE
# atp-ingestion False Spec.Replicas drift detected: Git=3, Live=5
Drift Detection Status:
# Check drift detection status
kubectl describe kustomization atp-ingestion -n flux-system
# Output:
# Status:
# Conditions:
# Last Transition Time: 2024-10-30T15:40:00Z
# Message: Reconciliation failed: drift detected in Deployment atp-ingestion
# Observed Generation: 1
# Reason: DriftDetected
# Status: False
# Type: Ready
# Drift:
# Detected: true
# Resources:
# - Kind: Deployment
# Name: atp-ingestion
# Namespace: atp-production
# Drift: Spec.Replicas: Git=3, Live=5
Self-Healing Configuration¶
Automatic Drift Correction:
FluxCD automatically reverts manual changes to match Git state:
# Git state (desired): replicas=3
# Cluster state (actual): replicas=5 (manually changed)
# FluxCD reconciliation (automatic):
# 1. Detect drift: replicas=5 ≠ replicas=3
# 2. Apply Git state: kubectl scale deployment atp-ingestion --replicas=3
# 3. Cluster state matches Git state: replicas=3
Self-Healing Workflow:
graph TD
A[Manual Change<br/>kubectl scale] -->|immediate| B[Cluster State<br/>replicas=5]
C[Git State<br/>replicas=3] -.->|every 5 min| D[FluxCD<br/>Reconciliation]
D -->|compare| B
D -->|drift detected| E[FluxCD<br/>Auto-Correct]
E -->|apply Git state| B
B -.->|matches| C
style E fill:#90EE90
style D fill:#FFE5B4
Self-Healing Examples:
Example 1: Manual Replica Scaling:
# Operator manually scales deployment
kubectl scale deployment atp-ingestion --replicas=10 -n atp-production
# Within 5 minutes, FluxCD reverts to Git state
kubectl get deployment atp-ingestion -n atp-production
# Output (after reconciliation):
# NAME READY UP-TO-DATE AVAILABLE AGE
# atp-ingestion 3/3 3 3 5m
# Replicas: 3 (reverted from 10)
Example 2: Manual ConfigMap Update:
# Operator manually edits ConfigMap
kubectl edit configmap atp-ingestion-config -n atp-production
# Change: ASPNETCORE_ENVIRONMENT=Production → Development
# Within 5 minutes, FluxCD reverts to Git state
kubectl get configmap atp-ingestion-config -n atp-production -o yaml
# Output (after reconciliation):
# data:
# ASPNETCORE_ENVIRONMENT: Production # Reverted from Development
Example 3: Manual Resource Deletion:
# Operator accidentally deletes deployment
kubectl delete deployment atp-ingestion -n atp-production
# Within 5 minutes, FluxCD recreates deployment from Git
kubectl get deployment atp-ingestion -n atp-production
# Output (after reconciliation):
# NAME READY UP-TO-DATE AVAILABLE AGE
# atp-ingestion 3/3 3 3 30s # Recreated
Reconciliation Loop Monitoring¶
Monitoring Reconciliation Status:
# Check all Kustomizations status
flux get kustomizations
# Output:
# NAME READY MESSAGE REVISION SUSPENDED
# atp-ingestion True Applied production/abc123d False
# atp-query True Applied production/abc123d False
# atp-integrity False Drift production/abc123d False
# atp-export True Applied production/abc123d False
# Check specific Kustomization
flux get kustomization atp-ingestion
# Output:
# NAME READY MESSAGE REVISION SUSPENDED
# atp-ingestion True Applied revision: production/abc123d production/abc123d False
# Watch reconciliation in real-time
flux get kustomizations --watch
# Output (updates every few seconds):
# NAME READY MESSAGE REVISION
# atp-integrity False Reconciliation in progress... production/abc123d
# atp-integrity True Applied revision: abc123d production/abc123d
Azure Monitor Metrics (FluxCD Reconciliation):
# FluxCD exports Prometheus metrics
# Metrics available in Azure Monitor via Prometheus scraping
# Key Metrics:
# - fluxcd_kustomize_reconciliation_duration_seconds # Time to reconcile
# - fluxcd_kustomize_reconciliation_total # Total reconciliations
# - fluxcd_kustomize_reconciliation_success_total # Successful reconciliations
# - fluxcd_kustomize_reconciliation_failure_total # Failed reconciliations
# - fluxcd_source_git_duration_seconds # Git fetch duration
KQL Query for Reconciliation Monitoring:
// Azure Monitor Log Analytics: Query FluxCD reconciliation metrics
PrometheusMetrics_CL
| where Name_s == "fluxcd_kustomize_reconciliation_duration_seconds"
| summarize
avg(Value_d) by KustomizationName_s, bin(TimeGenerated, 5m)
| render timechart
Grafana Dashboard (FluxCD Reconciliation):
# Grafana dashboard config
dashboard:
title: "FluxCD Reconciliation Status"
panels:
- title: "Reconciliation Duration"
query: "fluxcd_kustomize_reconciliation_duration_seconds"
type: "graph"
- title: "Reconciliation Success Rate"
query: "rate(fluxcd_kustomize_reconciliation_success_total[5m]) / rate(fluxcd_kustomize_reconciliation_total[5m])"
type: "stat"
- title: "Drift Detection Events"
query: "fluxcd_kustomize_reconciliation_failure_total{reason='DriftDetected'}"
type: "graph"
Drift Correction Strategies¶
Automatic Correction (Default):
FluxCD automatically corrects drift during reconciliation:
# Kustomization with automatic drift correction
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
spec:
prune: true # Delete resources removed from Git
# Automatic correction: Always revert to Git state
Manual Correction (When Needed):
# Option 1: Suspend reconciliation, fix manually, resume
flux suspend kustomization atp-ingestion -n flux-system
# Fix drift manually
kubectl scale deployment atp-ingestion --replicas=3 -n atp-production
# Resume reconciliation
flux resume kustomization atp-ingestion -n flux-system
# Option 2: Update Git to match cluster state (if intentional)
git checkout production
# Update kustomization.yaml to match current cluster state
git commit -m "chore: update replicas to match current state"
git push origin production
Drift Alerting:
# FluxCD Notification for drift detection
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
name: drift-detection-alert
namespace: flux-system
spec:
providerRef:
name: slack
namespace: flux-system
eventSeverity: warning
eventSources:
- kind: Kustomization
name: "*"
filters:
- key: reason
value: DriftDetected
Slack Alert Example:
⚠️ GitOps Drift Detected
Kustomization: atp-ingestion
Namespace: flux-system
Reason: DriftDetected
Resource: Deployment/atp-ingestion (atp-production)
Drift: Spec.Replicas: Git=3, Live=5
Action: FluxCD will automatically correct within 5 minutes
Drift Prevention Best Practices:
- Enforce Git-only Changes: RBAC prevents direct
kubectlaccess to production - Alert on Manual Changes: Monitor Kubernetes audit logs for manual changes
- Regular Drift Audits: Weekly checks for unexpected cluster changes
- Documentation: Clear guidelines that all changes must go through Git
Summary: Four Core Principles¶
- Principle 1: Declarative: Desired state expressed as declarative YAML (Kubernetes, Helm, Kustomize)
- Principle 2: Versioned & Immutable: All changes versioned in Git with GPG signatures, SemVer, immutable image tags, permanent audit trail
- Principle 3: Pulled Automatically: FluxCD inside cluster pulls from Git (GitRepository/Kustomization), polling intervals per environment, optional webhooks
- Principle 4: Continuously Reconciled: Automatic drift detection, self-healing configuration, reconciliation monitoring, drift correction strategies
Azure Repos Structure & Organization¶
Purpose: Define the repository strategy, directory structure, branching model, and access control for the ATP GitOps implementation, ensuring consistency, scalability, and compliance across all environments.
Repository Strategy: Hybrid Monorepo/Polyrepo¶
ATP uses a hybrid approach: polyrepo for service source code (separate repositories per microservice) and monorepo for GitOps manifests (single repository for all Kubernetes configurations).
Monorepo for GitOps Manifests¶
Repository: atp-gitops (Azure Repos)
Rationale:
| Aspect | Benefit | Impact |
|---|---|---|
| Atomic Updates | Update multiple services in single commit/PR | Ensures consistency across services (e.g., gateway + all microservices) |
| Cross-Service Visibility | See all deployments in one place | Easier to understand dependencies and relationships |
| Shared Configurations | Common base manifests, Helm values, Kustomize bases | DRY principle; reduce duplication |
| Compliance Auditing | Single audit trail for all infrastructure changes | Simpler SOC 2/GDPR audit reports |
| Environment Consistency | Same structure across dev/test/staging/production | Easier to promote configurations between environments |
| RBAC Simplification | One repository to manage permissions | Simpler access control (vs managing 7+ repos) |
Monorepo Structure:
atp-gitops/ # Single GitOps repository (monorepo)
├── clusters/ # Per-environment FluxCD configs
├── infrastructure/ Cluster-wide infrastructure
├── apps/ All ATP microservices
├── platform/ Platform configs (RBAC, policies)
├── tenants/ Multi-tenant configurations
├── monitoring/ Observability stack
└── docs/ Documentation and runbooks
Polyrepo for Service Source Code¶
Repositories: Separate repositories per microservice
atp-ingestion(C# source code)atp-query(C# source code)atp-integrity(C# source code)atp-export(C# source code)atp-policy(C# source code)atp-search(C# source code)atp-gateway(C# source code)
Rationale:
| Aspect | Benefit | Impact |
|---|---|---|
| Team Autonomy | Each service team owns their repository | Faster development cycles; reduced merge conflicts |
| Independent CI/CD | Separate build pipelines per service | Parallel builds; faster feedback |
| Service Isolation | Clear ownership boundaries | Easier to onboard new teams; clearer responsibilities |
| Versioning Flexibility | Each service can version independently | Allows different release cadences per service |
| Repository Size | Smaller repositories (faster clones) | Better developer experience; faster CI builds |
Workflow: Source Code → CI → GitOps Repo → FluxCD¶
Complete Flow:
graph LR
subgraph "Source Code Repositories (Polyrepo)"
A1[atp-ingestion<br/>C# source]
A2[atp-query<br/>C# source]
A3[atp-integrity<br/>C# source]
end
subgraph "CI Stage (Azure Pipelines)"
B[Azure Pipelines<br/>Build + Test]
B -->|1. build Docker image| C[Azure Container<br/>Registry]
B -->|2. update manifest<br/>commit + push| D[atp-gitops<br/>Monorepo]
end
A1 -->|git push| B
A2 -->|git push| B
A3 -->|git push| B
subgraph "GitOps Repository (Monorepo)"
D1[apps/atp-ingestion/<br/>overlays/production]
D2[apps/atp-query/<br/>overlays/production]
D3[apps/atp-integrity/<br/>overlays/production]
end
D --> D1
D --> D2
D --> D3
subgraph "CD Stage (FluxCD)"
E[FluxCD Agent<br/>in AKS cluster]
E -->|git pull<br/>reconcile| F[AKS Cluster<br/>Deployments]
end
D -->|3. FluxCD polls Git| E
style D fill:#FFE5B4
style E fill:#90EE90
style F fill:#90EE90
Step-by-Step Workflow:
- Developer pushes to service repository:
cd atp-ingestion # Polyrepo
git add src/ConnectSoft.ATP.Ingestion/Controllers/AuditRecordsController.cs
git commit -m "feat: add gRPC ingestion endpoint"
git push origin feature/grpc-ingestion
- CI pipeline triggers (Azure Pipelines):
# azure-pipelines.yml in atp-ingestion repository
- stage: Build_Test_Publish
jobs:
- job: BuildAndTest
steps:
- task: Docker@2
inputs:
containerRegistry: 'ConnectSoft-ACR'
repository: 'atp/ingestion'
command: 'buildAndPush'
tags: |
$(Build.BuildNumber)
$(Build.BuildNumber)-$(Build.SourceVersion)
- task: Bash@3
displayName: 'Update GitOps Manifest'
inputs:
targetType: 'inline'
script: |
git clone https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
cd atp-gitops
# Update image tag in kustomization.yaml
yq eval '.images[0].newTag = "$(Build.BuildNumber)-$(Build.SourceVersion)"' \
-i apps/atp-ingestion/overlays/production/kustomization.yaml
git add apps/atp-ingestion/overlays/production/kustomization.yaml
git commit -m "chore(ingestion): update to $(Build.BuildNumber)"
git push origin production
- FluxCD detects Git change (within 1-5 minutes):
# FluxCD GitRepository polls Git every 1 minute
# FluxCD Kustomization reconciles every 5 minutes
# Result: New image tag applied to cluster automatically
- Deployment complete:
# Verify deployment
kubectl get pods -n atp-production -l app=atp-ingestion
# Output: Pods using new image tag
Benefits of Hybrid Approach:
- ✅ Best of both worlds: Team autonomy (polyrepo) + consistency (monorepo)
- ✅ Clear separation: Source code changes vs infrastructure changes
- ✅ Atomic deployments: Update multiple services in one PR (if needed)
- ✅ Simplified GitOps: One repository to manage permissions and branch policies
Directory Structure¶
Complete atp-gitops Repository Layout:
atp-gitops/
├── .github/ # GitHub Actions (if used) or Azure DevOps templates
│ ├── workflows/
│ └── germs/
│
├── clusters/ # Per-environment FluxCD bootstrap configs
│ ├── production/
│ │ ├── flux-system/ # FluxCD installation manifests
│ │ │ ├── gitrepository.yaml # GitRepository pointing to production branch
│ │ │ ├── kustomizations.yaml # Root Kustomization pointing to /infrastructure and /apps
│ │ │ └── notifications.yaml # Notification configs (Slack, Teams)
│ │ └── README.md
│ │
│ ├── staging/
│ │ ├── flux-system/
│ │ └── README.md
│ │
│ ├── test/
│ │ ├── flux-system/
│ │ └── README.md
│ │
│ └── dev/
│ ├── flux-system/
│ └── README.md
│
├── infrastructure/ # Cluster-wide infrastructure (base + overlays)
│ ├── base/ # Base infrastructure manifests
│ │ ├── namespaces.yaml # All namespaces (atp-production, atp-staging, etc.)
│ │ ├── resource-quotas.yaml # Default resource quotas
│ │ ├── network-policies.yaml # Default network policies
│ │ ├── pod-security-standards.yaml # Pod Security Admission configs
│ │ ├── azure-policy.yaml # Azure Policy for Kubernetes
│ │ └── kustomization.yaml
│ │
│ └── overlays/ # Environment-specific infrastructure
│ ├── production/
│ │ ├── kustomization.yaml
│ │ ├── resource-quotas-patch.yaml # Production resource quotas
│ │ └── network-policies-patch.yaml # Production network policies
│ ├── staging/
│ ├── test/
│ └── dev/
│
├── apps/ # ATP microservices (7 services)
│ ├── atp-ingestion/
│ │ ├── base/ # Base manifests (reusable)
│ │ │ ├── deployment.yaml
│ │ │ ├── service.yaml
│ │ │ ├── configmap.yaml
│ │ │ ├── ingress.yaml
│ │ │ └── kustomization.yaml
│ │ │
│ │ ├── helm/ # Helm chart (optional, alternative to base)
│ │ │ ├── Chart.yaml
│ │ │ ├── values.yaml
│ │ │ ├── values-dev.yaml
│ │ │ ├── values-production.yaml
│ │ │ └── templates/
│ │ │ ├── deployment.yaml
│ │ │ ├── service.yaml
│ │ │ └── configmap.yaml
│ │ │
│ │ └── overlays/ # Environment-specific overlays
│ │ ├── dev/
│ │ │ ├── kustomization.yaml
│ │ │ ├── deployment-patch.yaml
│ │ │ └── configmap-patch.yaml
│ │ ├── test/
│ │ ├── staging/
│ │ └── production/
│ │ ├── kustomization.yaml
│ │ ├── deployment-patch.yaml
│ │ ├── hpa-patch.yaml # Horizontal Pod Autoscaler
│ │ └── configmap-patch.yaml
│ │
│ ├── atp-query/
│ │ ├── base/
│ │ ├── helm/
│ │ └── overlays/
│ │
│ ├── atp-integrity/
│ │ ├── base/
│ │ ├── helm/
│ │ └── overlays/
│ │
│ ├── atp-export/
│ │ ├── base/
│ │ ├── helm/
│ │ └── overlays/
│ │
│ ├── atp-policy/
│ │ ├── base/
│ │ ├── helm/
│ │ └── overlays/
│ │
│ ├── atp-search/
│ │ ├── base/
│ │ ├── helm/
│ │ └── overlays/
│ │
│ └── atp-gateway/
│ ├── base/
│ ├── helm/
│ └── overlays/
│
├── platform/ # Platform configurations
│ ├── rbac/ # Role-Based Access Control
│ │ ├── service-accounts.yaml # ServiceAccounts for all services
│ │ ├── roles.yaml # Namespace-scoped Roles
│ │ ├── role-bindings.yaml # Role Bindings
│ │ └── cluster-roles.yaml # Cluster-wide Roles
│ │
│ ├── network-policies/ # Network isolation policies
│ │ ├── default-deny.yaml # Default deny all traffic
│ │ ├── allow-namespace-internal.yaml # Allow within namespace
│ │ └── allow-cross-namespace.yaml # Allow specific cross-namespace
│ │
│ ├── pod-security/ # Pod Security Standards
│ │ ├── baseline.yaml # Baseline profile
│ │ └── restricted.yaml # Restricted profile (production)
│ │
│ ├── resource-quotas/ # Resource quotas per namespace
│ │ ├── production.yaml
│ │ ├── staging.yaml
│ │ └── dev.yaml
│ │
│ └── azure-policy/ # Azure Policy for Kubernetes
│ ├── pod-security-standards.yaml
│ ├── resource-limits.yaml
│ └── image-registry.yaml
│
├── tenants/ # Multi-tenant configurations
│ ├── tenant-acme-corp/
│ │ ├── namespace.yaml
│ │ ├── resource-quota.yaml
│ │ ├── network-policy.yaml
│ │ ├── rbac.yaml
│ │ ├── config.yaml
│ │ └── kustomization.yaml
│ │
│ ├── tenant-widgets-inc/
│ │ ├── namespace.yaml
│ │ ├── resource-quota.yaml
│ │ ├── network-policy.yaml
│ │ ├── rbac.yaml
│ │ ├── config.yaml
│ │ └── kustomization.yaml
│ │
│ └── tenant-global-bank/
│ ├── namespace.yaml
│ ├── resource-quota.yaml
│ ├── network-policy.yaml
│ ├── rbac.yaml
│ ├── config.yaml
│ └── kustomization.yaml
│
├── monitoring/ # Observability stack
│ ├── prometheus/ # Prometheus Operator manifests
│ │ ├── prometheus.yaml
│ │ ├── servicemonitor.yaml
│ │ └── alerting-rules.yaml
│ │
│ ├── grafana/ # Grafana dashboards
│ │ ├── dashboards/
│ │ └── datasources.yaml
│ │
│ ├── fluent-bit/ # Log collection
│ │ └── fluent-bit-config.yaml
│ │
│ └── jaeger/ # Distributed tracing
│ └── jaeger-operator.yaml
│
├── docs/ # Documentation and runbooks
│ ├── README.md # Repository overview
│ ├── CONTRIBUTING.md # How to contribute
│ ├── runbooks/
│ │ ├── rollback-procedure.md
│ │ ├── disaster-recovery.md
│ │ └── troubleshooting.md
│ └── architecture/
│ ├── directory-structure.md
│ └── branching-model.md
│
├── .gitignore # Git ignore patterns
├── .pre-commit-hooks.yaml # Pre-commit hooks (secret detection)
├── LICENSE # Repository license
└── README.md # Main README
Directory Purpose Reference¶
| Directory | Purpose | Example Files |
|---|---|---|
/clusters |
FluxCD bootstrap configs per environment | gitrepository.yaml, kustomizations.yaml |
/infrastructure |
Cluster-wide infrastructure (namespaces, quotas, policies) | namespaces.yaml, resource-quotas.yaml |
/apps |
ATP microservice deployments | deployment.yaml, service.yaml, configmap.yaml |
/platform |
Platform configurations (RBAC, network policies, security) | service-accounts.yaml, network-policies.yaml |
/tenants |
Multi-tenant configurations | namespace.yaml, resource-quota.yaml |
/monitoring |
Observability stack (Prometheus, Grafana, Fluent Bit) | prometheus.yaml, grafana-dashboards/ |
/docs |
Documentation and operational runbooks | runbooks/rollback-procedure.md |
Naming Conventions¶
Directory Naming Standards¶
Pattern: lowercase-with-hyphens
| Directory Type | Naming Pattern | Example |
|---|---|---|
| Service directories | atp-{service-name} |
atp-ingestion, atp-query |
| Environment overlays | {environment} |
production, staging, test, dev |
| Base directories | base |
base/ |
| Helm directories | helm |
helm/ |
| Tenant directories | tenant-{tenant-id} |
tenant-acme-corp, tenant-widgets-inc |
File Naming Patterns¶
Pattern: kebab-case.yaml or kebab-case-patch.yaml
| File Type | Naming Pattern | Example |
|---|---|---|
| Kubernetes manifests | {resource-kind}.yaml |
deployment.yaml, service.yaml |
| Kustomization files | kustomization.yaml |
kustomization.yaml |
| Strategic merge patches | {resource-kind}-patch.yaml |
deployment-patch.yaml, hpa-patch.yaml |
| Helm values | values-{environment}.yaml |
values-production.yaml, values-dev.yaml |
| ConfigMaps | {service-name}-config.yaml |
atp-ingestion-config.yaml |
| Documentation | kebab-case.md |
rollback-procedure.md, disaster-recovery.md |
Resource Naming Conventions (Kubernetes)¶
Pattern: {service-name} or {service-name}-{suffix}
| Resource Type | Naming Pattern | Example |
|---|---|---|
| Deployments | {service-name} |
atp-ingestion, atp-query |
| Services | {service-name} |
atp-ingestion, atp-query |
| ConfigMaps | {service-name}-config |
atp-ingestion-config |
| Secrets | {service-name}-secrets |
atp-ingestion-secrets |
| Ingress | {service-name}-ingress |
atp-ingestion-ingress |
| ServiceAccounts | {service-name}-sa |
atp-ingestion-sa |
| HPA | {service-name}-hpa |
atp-ingestion-hpa |
| NetworkPolicy | {service-name}-network-policy |
atp-ingestion-network-policy |
Complete Examples for All ATP Services:
# Deployment names
atp-ingestion
atp-query
atp-integrity
atp-export
atp-policy
atp-search
atp-gateway
# Service names
atp-ingestion
atp-query
atp-integrity
atp-export
atp-policy
atp-search
atp-gateway
# ConfigMap names
atp-ingestion-config
atp-query-config
atp-integrity-config
atp-export-config
atp-policy-config
atp-search-config
atp-gateway-config
# ServiceAccount names
atp-ingestion-sa
atp-query-sa
atp-integrity-sa
atp-export-sa
atp-policy-sa
atp-search-sa
atp-gateway-sa
# Namespace names
atp-production # All production services
atp-staging # All staging services
atp-test # All test services
atp-dev # All dev services
atp-tenant-acme # Tenant-specific namespace
atp-tenant-widgets # Tenant-specific namespace
Label Naming:
# Standard labels (applied to all resources)
labels:
app: atp-ingestion # Service name
component: ingestion # Component name (matches service name)
tier: backend # Service tier (backend, frontend, database)
version: v1.2.3 # Application version
environment: production # Environment (production, staging, test, dev)
managed-by: fluxcd # Management tool
compliance: soc2-gdpr-hipaa # Compliance requirements
Branching Model¶
ATP uses a GitOps branching model aligned with environment promotion workflow.
Environment Branches¶
Branch Structure:
main # Production (protected, requires approvals)
├── staging # Staging environment (protected)
│ ├── test # Test environment (protected)
│ │ └── dev # Dev environment (unprotected, fast merge)
│ │ └── feature/* # Feature branches (unprotected)
Branch Details:
| Branch | Environment | Purpose | Protection Level | Merge Strategy |
|---|---|---|---|---|
main (or production) |
Production | Live production environment | 🔒 Highest | Squash merge + approvals |
staging |
Staging | Pre-production testing | 🔒 High | Squash merge + approvals |
test |
Test | Integration testing | 🔒 Medium | Squash merge + approvals |
dev |
Development | Developer testing | 🔓 Low | Fast-forward merge |
feature/* |
N/A | Feature development | 🔓 None | Fast-forward merge |
hotfix/* |
Production | Emergency fixes | 🔒 High | Squash merge + expedited approval |
Branch Protection Rules (Azure DevOps):
# Azure DevOps Branch Policy Configuration
# Repos > Branches > {branch-name} > Branch Policies
main (Production):
✓ Require pull request (minimum 2 reviewers)
✓ Require approval from: Architect + SRE Lead
✓ Require status checks: CI pipeline must pass
✓ Require signed commits (GPG)
✓ Require merge strategy: Squash merge only
✓ Require minimum reviewers: 2 (including required reviewers)
✓ Require code review: Yes
✓ Build validation: CI pipeline
✓ Automatic reviewers: Platform Team (suggested)
staging:
✓ Require pull request (minimum 1 reviewer)
✓ Require approval from: Architect or SRE
✓ Require status checks: CI pipeline must pass
✓ Require signed commits (GPG)
✓ Require merge strategy: Squash merge only
✓ Build validation: CI pipeline
test:
✓ Require pull request (minimum 1 reviewer)
✓ Require approval from: Any DevOps Engineer
✓ Require status checks: CI pipeline must pass
✓ Require merge strategy: Squash merge preferred
✓ Build validation: CI pipeline
dev:
✓ No branch protection (fast development)
✓ Allow direct push
✓ Allow fast-forward merge
feature/*:
✓ No branch protection
✓ Allow direct push
✓ Allow fast-forward merge
Branch Promotion Workflow:
graph LR
A[feature/my-feature] -->|PR + merge| B[dev]
B -->|PR + approval| C[test]
C -->|PR + approval| D[staging]
D -->|PR + approval| E[main<br/>production]
F[hotfix/critical-bug] -.->|expedited PR| E
style E fill:#ffcccc
style D fill:#FFE5B4
style C fill:#FFE5B4
style B fill:#90EE90
style A fill:#90EE90
Example: Promoting Change from Dev → Production:
# Step 1: Create feature branch from dev
git checkout dev
git pull origin dev
git checkout -b feature/upgrade-ingestion-v1.2.4
# Make changes to apps/atp-ingestion/overlays/dev/kustomization.yaml
git commit -m "feat(ingestion): upgrade to v1.2.4 in dev"
git push origin feature/upgrade-ingestion-v1.2.4
# Step 2: Merge to dev (fast-forward, no approval needed)
git checkout dev
git merge --ff-only feature/upgrade-ingestion-v1.2.4
git push origin dev
# FluxCD automatically deploys to dev cluster
# Step 3: Create PR dev → test
# Azure DevOps: Create Pull Request from dev to test
# Requires: 1 DevOps Engineer approval
# After merge: FluxCD deploys to test cluster
# Step 4: Create PR test → staging
# Azure DevOps: Create Pull Request from test to staging
# Requires: Architect or SRE approval
# After merge: FluxCD deploys to staging cluster
# Step 5: Create PR staging → main (production)
# Azure DevOps: Create Pull Request from staging to main
# Requires: Architect + SRE Lead approval
# Requires: CI pipeline passing + signed commits
# After merge: FluxCD deploys to production cluster
Approval Requirements Matrix¶
| Branch | Minimum Reviewers | Required Approvers | Status Checks | GPG Signing | Merge Strategy |
|---|---|---|---|---|---|
| main (production) | 2 | Architect + SRE Lead | ✅ Required | ✅ Required | Squash only |
| staging | 1 | Architect or SRE | ✅ Required | ✅ Required | Squash only |
| test | 1 | Any DevOps Engineer | ✅ Required | ⚠️ Preferred | Squash preferred |
| dev | 0 | None | ⚠️ Optional | ❌ Not required | Fast-forward |
| feature/* | 0 | None | ❌ Not required | ❌ Not required | Fast-forward |
| hotfix/* | 1 | Architect or SRE Lead | ✅ Required | ✅ Required | Squash only |
Versioning and Tagging¶
Semantic Versioning (SemVer) Strategy¶
Format: MAJOR.MINOR.PATCH
- MAJOR: Breaking changes (API incompatibility, schema changes)
- MINOR: New features (backward-compatible)
- PATCH: Bug fixes (backward-compatible)
Example Versions:
v1.2.4 # Minor feature release
v1.2.3 # Patch release (bug fix)
v2.0.0 # Major release (breaking changes)
v1.2.4-hotfix1 # Hotfix release
Service-Specific Tags¶
Format: {service-name}/v{VERSION}
# Tag ingestion service v1.2.4
git tag -a atp-ingestion/v1.2.4 -m "ATP Ingestion Service v1.2.4"
git push origin atp-ingestion/v1.2.4
# Tag query service v1.3.0
git tag -a atp-query/v1.3.0 -m "ATP Query Service v1.3.0"
git push origin atp-query/v1.3.0
Environment-Wide Release Tags¶
Format: release/v{VERSION} or release/{ENVIRONMENT}/v{VERSION}
# Production release tag
git tag -a release/v1.2.4 -m "Production Release v1.2.4
Services:
- atp-ingestion: v1.2.4
- atp-query: v1.3.0
- atp-integrity: v1.1.5
- atp-export: v1.0.2
- atp-policy: v1.2.0
- atp-search: v1.1.0
- atp-gateway: v1.4.0
Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"
git push origin release/v1.2.4
# Staging release tag
git tag -a release/staging/v1.2.4-rc1 -m "Staging Release Candidate v1.2.4-rc1"
git push origin release/staging/v1.2.4-rc1
Hotfix Tagging Conventions¶
Format: hotfix/v{VERSION}-hotfix{N} or {SERVICE}/v{VERSION}-hotfix{N}
# Service-specific hotfix
git tag -a atp-ingestion/v1.2.4-hotfix1 -m "Hotfix: Memory leak in Redis connection pooling"
git push origin atp-ingestion/v1.2.4-hotfix1
# Environment-wide hotfix
git tag -a hotfix/v1.2.4-hotfix1 -m "Production Hotfix v1.2.4-hotfix1
Critical fixes:
- atp-ingestion: Memory leak fix
- atp-gateway: Rate limiting bug fix"
git push origin hotfix/v1.2.4-hotfix1
Image Tagging in ACR¶
Strategy: Immutable tags combining version + commit SHA
Format: {VERSION}-{COMMIT-SHA}
# Docker image tags (in Azure Container Registry)
connectsoft.azurecr.io/atp/ingestion:v1.2.4 # Semantic version
connectsoft.azurecr.io/atp/ingestion:v1.2.4-abc123d # Version + commit SHA (immutable)
connectsoft.azurecr.io/atp/ingestion:latest # Latest (dev only, mutable)
ACR Tagging Rules:
| Tag Pattern | Mutable? | Use Case | Example |
|---|---|---|---|
v{VERSION} |
❌ Immutable | Production releases | v1.2.4 |
v{VERSION}-{SHA} |
❌ Immutable | Production releases (traceable) | v1.2.4-abc123d |
latest |
✅ Mutable | Development only | latest |
{BRANCH} |
✅ Mutable | Feature branches | feature/grpc-ingestion |
Access Control and RBAC¶
Azure DevOps Repository Permissions¶
Permission Levels:
| Permission | Description | Typical Roles |
|---|---|---|
| Reader | Can view repository | Compliance Officers, Auditors |
| Contributor | Can create branches, submit PRs | Developers, DevOps Engineers |
| Contribute | Can push to unprotected branches | Developers (dev branch) |
| Contribute + Pull Request | Can create PRs to protected branches | Developers, DevOps Engineers |
| Admin | Full control (manage permissions, delete repo) | Platform Team Leads |
Permission Matrix:
| Role | Repository Access | Branch Access | Approval Authority |
|---|---|---|---|
| Developer | ✅ Contributor | ✅ Create PRs (dev, test) | ❌ None |
| DevOps Engineer | ✅ Contributor | ✅ Approve PRs (dev, test) | ✅ Dev/Test deployments |
| Architect | ✅ Contributor | ✅ Approve PRs (staging, production) | ✅ Staging/Prod deployments |
| SRE Engineer | ✅ Contributor | ✅ Approve PRs (production) | ✅ Production deployments |
| Security Officer | ✅ Reader | ✅ Read-only (audit) | ❌ None |
| Compliance Officer | ✅ Reader | ✅ Read-only (audit) | ❌ None |
| Platform Team | ✅ Admin | ✅ Full access (all branches) | ✅ All deployments |
Azure AD Group Mappings¶
Group Structure:
Azure AD Groups:
├── ATP-Platform-Team # Platform Team (Admin access)
├── ATP-Developers # All developers (Contributor)
├── ATP-DevOps-Engineers # DevOps Engineers (Contributor, approve dev/test)
├── ATP-Architects # Architects (Contributor, approve staging/prod)
├── ATP-SRE-Engineers # SRE Engineers (Contributor, approve production)
├── ATP-Security-Team # Security Team (Reader, audit access)
└── ATP-Compliance-Team # Compliance Team (Reader, audit access)
Azure DevOps Group Configuration:
# Azure DevOps Project Settings > Permissions > Groups
Groups:
- name: ATP-Platform-Team
permissions:
- Repos: Admin
- Build: Admin
- Release: Admin
- name: ATP-Developers
permissions:
- Repos: Contributor
- Build: User
- Release: User
- name: ATP-SRE-Engineers
permissions:
- Repos: Contributor
- Build: User
- Release: Admin
Principle of Least Privilege Enforcement¶
Enforcement Mechanisms:
- Branch Protection Policies: Prevent direct pushes to protected branches
- Required Approvals: Multiple reviewers for production
- GPG Signing: All production commits must be signed
- Status Checks: CI pipeline must pass before merge
- Audit Logging: All access logged in Azure AD audit logs
Access Review Process:
- Frequency: Quarterly access reviews
- Owner: Platform Team Lead
- Review Scope: Repository permissions, branch policies, approval authorities
- Compliance: SOC 2 CC6.1 (Access Control)
Summary¶
- Repository Strategy: Hybrid monorepo (GitOps manifests) + polyrepo (service source code)
- Directory Structure: 7 main directories (
/clusters,/infrastructure,/apps,/platform,/tenants,/monitoring,/docs) - Naming Conventions: Kebab-case for directories/files, consistent patterns for Kubernetes resources
- Branching Model: Environment-based branches (main → staging → test → dev → feature/*) with promotion workflow
- Versioning: SemVer for services, environment-wide release tags, hotfix conventions
- Access Control: Azure AD group mappings, branch protection policies, least privilege enforcement
FluxCD Installation & Configuration on AKS¶
Purpose: Provide a complete guide for installing, configuring, and managing FluxCD on Azure Kubernetes Service (AKS) for the ATP GitOps implementation, including multi-cluster setup, Azure integration, and operational best practices.
FluxCD Architecture Overview¶
Definition: FluxCD is a GitOps operator for Kubernetes that automatically keeps clusters in sync with Git repositories. It consists of modular controllers that work together to provide continuous reconciliation.
FluxCD Components¶
Core Controllers:
| Component | Purpose | Namespace | Responsibilities |
|---|---|---|---|
| source-controller | Fetches sources (Git, Helm, OCI) | flux-system |
Clones Git repos, fetches Helm charts, caches artifacts |
| kustomize-controller | Applies Kustomize manifests | flux-system |
Renders Kustomize, applies to cluster, monitors drift |
| helm-controller | Manages Helm releases | flux-system |
Installs/upgrades Helm charts, manages dependencies |
| notification-controller | Sends alerts/notifications | flux-system |
Slack, Teams, Discord, webhook notifications |
| image-reflector-controller | Scans image repositories | flux-system |
Discovers new image tags, updates image policies |
| image-automation-controller | Updates Git automatically | flux-system |
Commits image tag updates back to Git |
Component Architecture:
graph TD
A[Git Repository<br/>Azure Repos] -->|git pull| B[Source Controller]
B -->|artifact cache| C[GitRepository<br/>CRD]
C -->|notify| D[Kustomize Controller]
C -->|notify| E[Helm Controller]
D -->|render + apply| F[Kubernetes API<br/>AKS Cluster]
E -->|install/upgrade| F
F -.->|watch| D
F -.->|watch| E
D -.->|reconcile| F
E -.->|reconcile| F
G[Container Registry<br/>ACR] -->|scan tags| H[Image Reflector<br/>Controller]
H -->|update| I[Image Policy<br/>CRD]
I -->|trigger| J[Image Automation<br/>Controller]
J -->|commit| A
D -->|events| K[Notification<br/>Controller]
E -->|events| K
K -->|alerts| L[Slack / Teams /<br/>Azure Monitor]
style B fill:#90EE90
style D fill:#90EE90
style E fill:#90EE90
style H fill:#90EE90
style J fill:#90EE90
style K fill:#FFE5B4
How FluxCD Works: Reconciliation Loop¶
Reconciliation Process:
-
Source Fetch (Source Controller):
- Polls Git repository at configured interval (e.g., every 1 minute)
- Clones repository and creates artifact (tar.gz)
- Stores artifact in cluster-local cache
- Updates GitRepository CRD status
-
Manifest Rendering (Kustomize/Helm Controller):
- Reads artifact from Source Controller
- Renders Kustomize overlays or Helm templates
- Generates final Kubernetes manifests
-
State Comparison (Kustomize/Helm Controller):
- Compares desired state (from Git) with actual state (in cluster)
- Detects differences (drift detection)
- Calculates required changes
-
Apply Changes (Kustomize/Helm Controller):
- Applies changes via Kubernetes API (
kubectl apply) - Waits for resources to become ready
- Updates Kustomization/HelmRelease CRD status
- Applies changes via Kubernetes API (
-
Health Monitoring (Kustomize/Helm Controller):
- Monitors resource health (Deployment, StatefulSet, etc.)
- Reports health status in CRD status
- Triggers notifications on failure
Reconciliation Loop Diagram:
sequenceDiagram
participant Git as Git Repository
participant SC as Source Controller
participant KC as Kustomize Controller
participant K8s as Kubernetes API
loop Every 1 minute (GitRepository interval)
SC->>Git: git pull
Git-->>SC: repository contents
SC->>SC: create artifact (tar.gz)
SC->>SC: update GitRepository.status
end
loop Every 5 minutes (Kustomization interval)
KC->>SC: fetch artifact
SC-->>KC: artifact.tar.gz
KC->>KC: render Kustomize
KC->>K8s: get current state
K8s-->>KC: current resources
KC->>KC: compare desired vs actual
alt Drift detected
KC->>K8s: kubectl apply (correct drift)
K8s-->>KC: resources updated
end
KC->>KC: update Kustomization.status
KC->>KC: check health
end
FluxCD vs ArgoCD Comparison¶
Feature Comparison:
| Feature | FluxCD | ArgoCD | ATP Choice |
|---|---|---|---|
| Installation | Lightweight, modular | Single deployment, heavier | ✅ FluxCD (simpler) |
| UI | Flux Dashboard (optional) | Rich web UI (default) | ⚠️ ArgoCD (better UX, but ATP uses CLI) |
| GitOps Toolkit | Modular (use only needed controllers) | Monolithic | ✅ FluxCD (flexibility) |
| Helm Support | Full support | Full support | ✅ Both |
| Kustomize Support | Native (built-in) | Native (built-in) | ✅ Both |
| Multi-Cluster | Strong (Fleet, kubeconfig) | Strong (ApplicationSets) | ✅ Both |
| Azure Integration | Native (Workload Identity) | Requires setup | ✅ FluxCD (better Azure native) |
| Learning Curve | Moderate | Steeper (more features) | ✅ FluxCD (simpler) |
| CNCF Status | Graduated | Graduated | ✅ Both |
| Community | Active (CNCF) | Very active (CNCF) | ✅ Both |
| Performance | Fast (lightweight) | Good (heavier) | ✅ FluxCD (lower overhead) |
| Security | Strong (least privilege) | Strong | ✅ Both |
| Drift Detection | Excellent | Excellent | ✅ Both |
ATP Selection Rationale: ✅ FluxCD Chosen
Reasons:
- Azure Native Integration: Better support for Azure AD Workload Identity (zero-trust authentication)
- Modular Architecture: Use only needed controllers (source + kustomize), reduce cluster overhead
- Simpler Operation: Less complexity, easier troubleshooting
- Performance: Lower resource footprint (important for multi-cluster setup)
- CLI-First Approach: ATP team prefers CLI/Git workflow over web UI
AKS Cluster Prerequisites¶
Cluster Requirements¶
Minimum Requirements:
| Component | Requirement | Rationale |
|---|---|---|
| Kubernetes Version | 1.28+ (1.30+ recommended) | FluxCD v2 requires Kubernetes 1.25+, newer versions provide better API support |
| Cluster SKU | Standard (not Free tier) | Required for RBAC, network policies, advanced features |
| Node Pool | 2+ nodes, 4+ vCPUs total | FluxCD controllers need resources; redundancy for HA |
| Network Plugin | Azure CNI (recommended) or kubenet | Azure CNI provides better networking for multi-tenant |
| RBAC | Enabled (default) | Required for FluxCD controllers to manage cluster resources |
| Pod Security Standards | Enabled (default in 1.23+) | Required for compliance (SOC 2, GDPR, HIPAA) |
Recommended Configuration:
# Create AKS cluster with recommended settings
az aks create \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--kubernetes-version 1.30.0 \
--node-count 3 \
--node-vm-size Standard_D4s_v3 \
--enable-cluster-autoscaler \
--min-count 3 \
--max-count 10 \
--network-plugin azure \
--network-policy azure \
--enable-oidc-issuer \
--enable-workload-identity \
--enable-managed-identity \
--enable-addons monitoring,azure-policy \
--workspace-resource-id /subscriptions/.../resourceGroups/.../providers/Microsoft.OperationalInsights/workspaces/atp-prod-eus-logs \
--tags environment=production compliance=soc2-gdpr-hipaa
Node Pool Configuration¶
System Node Pool (for FluxCD and system workloads):
# System node pool (dedicated for system pods)
az aks nodepool add \
--resource-group ATP-Production-EUS-RG \
--cluster-name atp-prod-eus-aks \
--name systempool \
--node-count 2 \
--node-vm-size Standard_D4s_v3 \
--mode System \
--labels workload=system tier=platform \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 4
User Node Pool (for application workloads):
# User node pool (for ATP microservices)
az aks nodepool add \
--resource-group ATP-Production-EUS-RG \
--cluster-name atp-prod-eus-aks \
--name userpool \
--node-count 3 \
--node-vm-size Standard_D8s_v3 \
--mode User \
--labels workload=application tier=backend \
--enable-cluster-autoscaler \
--min-count 3 \
--max-count 20
Azure Integration Setup¶
Required Azure Resources:
- Azure Container Registry (ACR): For container images
- Azure Key Vault: For secrets management
- Azure Monitor / Log Analytics: For observability
- Azure AD Application: For Workload Identity (optional but recommended)
Prerequisites Checklist:
#!/bin/bash
# prerequisites-check.sh — Verify all prerequisites before FluxCD installation
set -euo pipefail
echo "🔍 Checking AKS cluster prerequisites..."
# Check Kubernetes version
KUBERNETES_VERSION=$(az aks show \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--query kubernetesVersion -o tsv)
if [[ $(echo "$KUBERNETES_VERSION 1.28.0" | tr " " "\n" | sort -V | head -n 1) != "1.28.0" ]]; then
echo "❌ Kubernetes version $KUBERNETES_VERSION < 1.28.0 (minimum required)"
exit 1
else
echo "✅ Kubernetes version: $KUBERNETES_VERSION"
fi
# Check OIDC issuer enabled
OIDC_ISSUER=$(az aks show \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--query "oidcIssuerProfile.enabled" -o tsv)
if [[ "$OIDC_ISSUER" != "true" ]]; then
echo "❌ OIDC issuer not enabled (required for Workload Identity)"
exit 1
else
echo "✅ OIDC issuer enabled"
fi
# Check Workload Identity enabled
WORKLOAD_IDENTITY=$(az aks show \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--query "securityProfile.workloadIdentity.enabled" -o tsv)
if [[ "$WORKLOAD_IDENTITY" != "true" ]]; then
echo "❌ Workload Identity not enabled (required for Azure AD integration)"
exit 1
else
echo "✅ Workload Identity enabled"
fi
# Check kubectl access
if ! kubectl cluster-info > /dev/null 2>&1; then
echo "❌ kubectl not configured or cluster not accessible"
exit 1
else
echo "✅ kubectl configured and cluster accessible"
fi
# Check RBAC enabled
RBAC=$(az aks show \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--query "enableRbac" -o tsv)
if [[ "$RBAC" != "true" ]]; then
echo "❌ RBAC not enabled (required for FluxCD)"
exit 1
else
echo "✅ RBAC enabled"
fi
# Check node count
NODE_COUNT=$(az aks nodepool show \
--resource-group ATP-Production-EUS-RG \
--cluster-name atp-prod-eus-aks \
--name systempool \
--query count -o tsv)
if [[ "$NODE_COUNT" -lt 2 ]]; then
echo "❌ Node count $NODE_COUNT < 2 (minimum 2 nodes recommended)"
exit 1
else
echo "✅ Node count: $NODE_COUNT"
fi
echo "✅ All prerequisites met!"
FluxCD Installation¶
Installation via Azure CLI (Recommended)¶
Prerequisites: Azure CLI with k8s-extension extension installed
# Install k8s-extension if not already installed
az extension add --name k8s-extension
# Install FluxCD on AKS cluster
az k8s-extension create \
--resource-group ATP-Production-EUS-RG \
--cluster-name atp-prod-eus-aks \
--cluster-type managedClusters \
--extension-type microsoft.flux \
--name flux \
--namespace flux-system \
--scope cluster \
--configuration-settings \
gitops.enabled=true \
gitops.defaultBranch=production \
gitops.repositoryUrl=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
gitops.sshPrivateKey="$(cat ~/.ssh/azure-devops-flux | base64 -w 0)" \
--auto-upgrade-minor-version true
# Verify installation
az k8s-extension show \
--resource-group ATP-Production-EUS-RG \
--cluster-name atp-prod-eus-aks \
--cluster-type managedClusters \
--name flux
Installation via kubectl (Flux CLI)¶
Prerequisites: Flux CLI installed
# Install Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash
# Verify installation
flux --version
# Install FluxCD components
flux install \
--namespace=flux-system \
--components=source-controller,kustomize-controller,helm-controller,notification-controller \
--export > flux-install.yaml
# Apply to cluster
kubectl apply -f flux-install.yaml
# Wait for FluxCD to be ready
kubectl wait --for=condition=ready pod \
--all \
--namespace flux-system \
--timeout=300s
Installation via Helm¶
Using Flux Helm Chart:
# Add Flux Helm repository
helm repo add fluxcd https://fluxcd-community.github.io/helm-charts
helm repo update
# Install FluxCD via Helm
helm install flux fluxcd/flux2 \
--namespace flux-system \
--create-namespace \
--set components.source-controller.enabled=true \
--set components.kustomize-controller.enabled=true \
--set components.helm-controller.enabled=true \
--set components.notification-controller.enabled=true \
--set components.image-reflector-controller.enabled=false \
--set components.image-automation-controller.enabled=false
# Verify installation
kubectl get pods -n flux-system
Bootstrap FluxCD on AKS¶
Bootstrap with Azure Repos SSH:
# Generate SSH key for FluxCD (if not exists)
ssh-keygen -t rsa -b 4096 -f ~/.ssh/azure-devops-flux -N ""
# Add public key to Azure DevOps (manual step)
# Azure DevOps > User Settings > SSH Public Keys > Add Key
cat ~/.ssh/azure-devops-flux.pub
# Bootstrap FluxCD pointing to GitOps repository
flux bootstrap git \
--url=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
--branch=production \
--path=clusters/production \
--private-key-file=~/.ssh/azure-devops-flux \
--author-name="Platform Team" \
--author-email="platform-team@connectsoft.example" \
--components-extra=image-reflector-controller,image-automation-controller
Bootstrap Output:
► connecting to ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
► cloning branch "production" from Git repository
► cloned branch "production" from Git repository
✔ components are healthy
✔ git repository "ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops" is ready
► generating sync manifests
✔ sync manifests pushed successfully
► applying sync manifests
✔ sync components are ready
✔ kubectl -n flux-system get gitrepository flux-system
✔ kubectl -n flux-system get kustomization flux-system
Verify Installation¶
Check FluxCD Components:
# Check all FluxCD pods are running
kubectl get pods -n flux-system
# Expected output:
# NAME READY STATUS RESTARTS AGE
# helm-controller-7d5c8b9f6d-abc12 1/1 Running 0 5m
# kustomize-controller-7d5c8b9f6d-def45 1/1 Running 0 5m
# notification-controller-7d5c8b9f6d-ghi78 1/1 Running 0 5m
# source-controller-7d5c8b9f6d-jkl90 1/1 Running 0 5m
# Check FluxCD CRDs are installed
kubectl get crds | grep fluxcd
# Expected output:
# gitrepositories.source.toolkit.fluxcd.io
# kustomizations.kustomize.toolkit.fluxcd.io
# helmreleases.helm.toolkit.fluxcd.io
# alerts.notification.toolkit.fluxcd.io
# receivers.notification.toolkit.fluxcd.io
# Verify FluxCD CLI can connect
flux check
# Expected output:
# ✔ flux 2.3.0
# ✔ Kubernetes 1.30.0 >= 1.25.0
# ✔ prerequisites are satisfied
# ✔ controllers are healthy
Azure Repos Integration¶
SSH Key Setup for Git Access¶
Generate SSH Key:
# Generate SSH key pair for FluxCD
ssh-keygen -t rsa -b 4096 -f ~/.ssh/azure-devops-flux -N "" -C "fluxcd@atp-production"
# Output files:
# ~/.ssh/azure-devops-flux (private key)
# ~/.ssh/azure-devops-flux.pub (public key)
Add Public Key to Azure DevOps:
# Display public key (copy this)
cat ~/.ssh/azure-devops-flux.pub
# Manual steps in Azure DevOps Portal:
# 1. Navigate to User Settings > SSH Public Keys
# 2. Click "New Key"
# 3. Paste public key
# 4. Add description: "FluxCD Production AKS Cluster"
# 5. Save
Create Kubernetes Secret:
# Create SSH key secret in flux-system namespace
kubectl create namespace flux-system --dry-run=client -o yaml | kubectl apply -f -
kubectl create secret generic azure-devops-ssh-key \
--namespace=flux-system \
--from-file=identity=~/.ssh/azure-devops-flux \
--from-literal=known_hosts="$(ssh-keyscan ssh.dev.azure.com 2>/dev/null | grep ssh-rsa)"
# Verify secret created
kubectl get secret azure-devops-ssh-key -n flux-system
Azure DevOps Personal Access Token (PAT)¶
Alternative to SSH Key:
# Create PAT in Azure DevOps Portal:
# 1. User Settings > Personal Access Tokens > New Token
# 2. Name: "FluxCD Production AKS"
# 3. Organization: All accessible organizations
# 4. Scopes: Code (Read)
# 5. Copy token
# Create PAT secret
kubectl create secret generic azure-devops-pat \
--namespace=flux-system \
--from-literal=username=git \
--from-literal=password=<AZURE_DEVOPS_PAT>
# Use PAT in GitRepository (HTTPS URL)
Azure AD Authentication (Workload Identity)¶
Recommended Approach (Zero-Trust):
# Create Azure AD Application for FluxCD
az ad app create --display-name "FluxCD-ATP-Production"
# Get application details
APP_ID=$(az ad app list \
--display-name "FluxCD-ATP-Production" \
--query "[0].appId" -o tsv)
# Create service principal
az ad sp create --id $APP_ID
# Get AKS OIDC issuer URL
OIDC_ISSUER=$(az aks show \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--query "oidcIssuerProfile.issuerUrl" -o tsv)
# Create federated credential for Workload Identity
az ad app federated-credential create \
--id $APP_ID \
--parameters '{
"name": "fluxcd-atp-production",
"issuer": "'$OIDC_ISSUER'",
"subject": "system:serviceaccount:flux-system:fluxcd-source-controller",
"audiences": ["api://AzureADTokenExchange"]
}'
# Grant Azure DevOps access to application
# Azure DevOps > Project Settings > Service Connections > New Service Connection
# Type: Generic
# Authentication: Workload Identity federation
GitRepository with Workload Identity:
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
interval: 1m
ref:
branch: production
secretRef:
name: workload-identity-secret # Uses Azure AD Workload Identity
Bootstrap Configuration¶
Bootstrap FluxCD to Point to atp-gitops Repo¶
Complete Bootstrap Script:
#!/bin/bash
# bootstrap-fluxcd-production.sh — Bootstrap FluxCD on production AKS cluster
set -euo pipefail
RESOURCE_GROUP="ATP-Production-EUS-RG"
CLUSTER_NAME="atp-prod-eus-aks"
GIT_REPO_URL="ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops"
GIT_BRANCH="production"
GIT_PATH="clusters/production"
SSH_KEY_FILE="~/.ssh/azure-devops-flux"
echo "🚀 Bootstrapping FluxCD on production AKS cluster..."
# Get AKS credentials
az aks get-credentials \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--overwrite-existing
# Verify cluster access
kubectl cluster-info
# Bootstrap FluxCD
flux bootstrap git \
--url=$GIT_REPO_URL \
--branch=$GIT_BRANCH \
--path=$GIT_PATH \
--private-key-file=$SSH_KEY_FILE \
--author-name="Platform Team" \
--author-email="platform-team@connectsoft.example" \
--components=source-controller,kustomize-controller,helm-controller,notification-controller
echo "✅ FluxCD bootstrap complete!"
# Verify installation
echo "📋 Verifying FluxCD installation..."
flux check
# Check GitRepository
echo "📋 Checking GitRepository..."
kubectl get gitrepository flux-system -n flux-system
# Check Kustomization
echo "📋 Checking Kustomization..."
kubectl get kustomization flux-system -n flux-system
Configure GitRepository Resource¶
GitRepository Configuration:
# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
interval: 1m # Poll Git every 1 minute
url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
ref:
branch: production # Git branch to watch
secretRef:
name: azure-devops-ssh-key # SSH key secret
ignore: |
/*.md
!README.md
/docs/
timeout: 60s # Git clone timeout
suspend: false # Set to true to pause reconciliation
Configure Root Kustomization¶
Root Kustomization (Points to Infrastructure and Apps):
# clusters/production/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: flux-system
namespace: flux-system
spec:
interval: 5m # Reconcile every 5 minutes
path: ./ # Root path in Git repository
prune: true # Delete resources removed from Git
sourceRef:
kind: GitRepository
name: atp-gitops
namespace: flux-system
timeout: 10m # Reconciliation timeout
retryInterval: 2m # Retry on failure
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: source-controller
namespace: flux-system
- apiVersion: apps/v1
kind: Deployment
name: kustomize-controller
namespace: flux-system
suspend: false
Child Kustomizations (Per Application):
# clusters/production/kustomizations/infrastructure.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: infrastructure
namespace: flux-system
spec:
interval: 5m
path: ./infrastructure/overlays/production
prune: true
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: flux-system # Wait for root Kustomization
suspend: false
---
# clusters/production/kustomizations/apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps
namespace: flux-system
spec:
interval: 5m
path: ./apps
prune: true
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: infrastructure # Wait for infrastructure first
suspend: false
Namespace and RBAC Setup¶
Namespace Creation (via GitOps):
# infrastructure/base/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
name: flux-system
labels:
name: flux-system
managed-by: fluxcd
---
apiVersion: v1
kind: Namespace
metadata:
name: atp-production
labels:
name: atp-production
environment: production
managed-by: fluxcd
RBAC for FluxCD (Bootstrap creates automatically):
# FluxCD bootstrap automatically creates these:
# - ServiceAccount: kustomize-controller (flux-system namespace)
# - ClusterRole: cluster-admin (full cluster access)
# - ClusterRoleBinding: kustomize-controller (binds ServiceAccount to ClusterRole)
Multi-Cluster Setup¶
FluxCD Architecture for Dev, Test, Staging, Production¶
Multi-Cluster Topology:
graph TD
subgraph "Azure DevOps"
A[atp-gitops<br/>Repository]
A1[dev branch]
A2[test branch]
A3[staging branch]
A4[production branch]
end
subgraph "Dev Environment"
B1[AKS Dev Cluster]
B2[FluxCD<br/>flux-system]
B2 -->|git pull| A1
end
subgraph "Test Environment"
C1[AKS Test Cluster]
C2[FluxCD<br/>flux-system]
C2 -->|git pull| A2
end
subgraph "Staging Environment"
D1[AKS Staging Cluster]
D2[FluxCD<br/>flux-system]
D2 -->|git pull| A3
end
subgraph "Production Environment"
E1[AKS Production EUS]
E2[FluxCD<br/>flux-system]
E2 -->|git pull| A4
E3[AKS Production WUS]
E4[FluxCD<br/>flux-system]
E4 -->|git pull| A4
end
A --> A1
A --> A2
A --> A3
A --> A4
style E1 fill:#ffcccc
style E3 fill:#ffcccc
Cluster Configuration Matrix:
| Environment | Cluster Name | Git Branch | FluxCD Namespace | Reconcile Interval |
|---|---|---|---|---|
| Dev | atp-dev-eus-aks |
dev |
flux-system |
1 minute |
| Test | atp-test-eus-aks |
test |
flux-system |
2 minutes |
| Staging | atp-staging-eus-aks |
staging |
flux-system |
5 minutes |
| Production EUS | atp-prod-eus-aks |
production |
flux-system |
5 minutes |
| Production WUS | atp-prod-wus-aks |
production |
flux-system |
5 minutes |
Cluster-Specific Configurations¶
Per-Cluster GitRepository:
# clusters/production/gitrepository.yaml (Production EUS)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
ref:
branch: production
secretRef:
name: azure-devops-ssh-key
# clusters/dev/gitrepository.yaml (Dev)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
ref:
branch: dev # Dev branch
secretRef:
name: azure-devops-ssh-key
interval: 30s # Faster polling for dev
Cross-Cluster Orchestration¶
Fleet Management (Optional, for large-scale):
# Use FluxCD Fleet for multi-cluster management
# Fleet Controller runs in a central cluster
# Manages GitRepositories and Kustomizations across multiple clusters
Regional Deployment Pattern:
# clusters/production-eus/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-eus
namespace: flux-system
spec:
path: ./apps/overlays/production-eus # EUS-specific overlay
sourceRef:
kind: GitRepository
name: atp-gitops
---
# clusters/production-wus/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-wus
namespace: flux-system
spec:
path: ./apps/overlays/production-wus # WUS-specific overlay
sourceRef:
kind: GitRepository
name: atp-gitops
Workload Identity Configuration¶
Azure AD Workload Identity for FluxCD¶
Create Azure AD Application:
# Create Azure AD application
az ad app create --display-name "FluxCD-ATP-Production"
APP_ID=$(az ad app list \
--display-name "FluxCD-ATP-Production" \
--query "[0].appId" -o tsv)
echo "Application ID: $APP_ID"
# Create service principal
SP_ID=$(az ad sp create --id $APP_ID --query id -o tsv)
# Grant permissions (example: Azure DevOps read access)
az devops security permission update \
--id $SP_ID \
--allow-bit 1 \
--deny-bit 0
Service Principal Setup¶
Configure Service Principal Permissions:
# Grant Azure DevOps repository read permission
az devops security permission update \
--id $SP_ID \
--token $AZURE_DEVOPS_TOKEN \
--allow-bit 1 \
--deny-bit 0 \
--resource-id "repo"
# Grant Azure Container Registry pull permission
az acr role assignment create \
--assignee $APP_ID \
--role AcrPull \
--scope /subscriptions/.../resourceGroups/.../providers/Microsoft.ContainerRegistry/registries/atp-prod-acr
Federated Credentials Configuration¶
Configure Federated Credential:
# Get AKS OIDC issuer URL
OIDC_ISSUER=$(az aks show \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--query "oidcIssuerProfile.issuerUrl" -o tsv)
# Create federated credential for Source Controller
az ad app federated-credential create \
--id $APP_ID \
--parameters '{
"name": "fluxcd-source-controller",
"issuer": "'$OIDC_ISSUER'",
"subject": "system:serviceaccount:flux-system:source-controller",
"audiences": ["api://AzureADTokenExchange"]
}'
# Create federated credential for Kustomize Controller
az ad app federated-credential create \
--id $APP_ID \
--parameters '{
"name": "fluxcd-kustomize-controller",
"issuer": "'$OIDC_ISSUER'",
"subject": "system:serviceaccount:flux-system:kustomize-controller",
"audiences": ["api://AzureADTokenExchange"]
}'
ServiceAccount Configuration¶
Annotate ServiceAccounts:
# FluxCD ServiceAccount with Workload Identity
apiVersion: v1
kind: ServiceAccount
metadata:
name: source-controller
namespace: flux-system
annotations:
azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
azure.workload.identity/tenant-id: "87654321-4321-4321-4321-987654321abc"
RBAC for Azure Resources:
# Grant Key Vault Secrets User role
az role assignment create \
--assignee $APP_ID \
--role "Key Vault Secrets User" \
--scope /subscriptions/.../resourceGroups/.../providers/Microsoft.KeyVault/vaults/atp-prod-kv
FluxCD Version Management¶
Upgrade Procedures¶
Check Current Version:
# Check installed FluxCD version
flux version
# Output:
# flux version 2.3.0
# Check FluxCD components version
kubectl get deployment source-controller -n flux-system -o jsonpath='{.spec.template.spec.containers[0].image}'
# Output:
# ghcr.io/fluxcd/source-controller:v2.3.0
Upgrade FluxCD:
# Upgrade FluxCD components
flux upgrade
# Or upgrade to specific version
flux install \
--version=2.4.0 \
--namespace=flux-system \
--components=source-controller,kustomize-controller,helm-controller,notification-controller \
--export > flux-install-v2.4.0.yaml
kubectl apply -f flux-install-v2.4.0.yaml
# Wait for upgrade to complete
kubectl wait --for=condition=ready pod \
--all \
--namespace flux-system \
--timeout=300s
Rollback Strategies¶
Rollback to Previous Version:
# Identify previous version
PREVIOUS_VERSION="2.2.0"
# Apply previous version manifests
flux install \
--version=$PREVIOUS_VERSION \
--namespace=flux-system \
--components=source-controller,kustomize-controller,helm-controller,notification-controller \
--export | kubectl apply -f -
# Verify rollback
flux version
kubectl get pods -n flux-system
Version Compatibility Matrix¶
| FluxCD Version | Kubernetes Minimum | Kubernetes Recommended | Breaking Changes |
|---|---|---|---|
| 2.4.0 | 1.25+ | 1.28+ | None (from 2.3.x) |
| 2.3.0 | 1.25+ | 1.28+ | None (from 2.2.x) |
| 2.2.0 | 1.24+ | 1.27+ | API v1beta1 deprecated |
| 2.1.0 | 1.24+ | 1.27+ | None (from 2.0.x) |
Release Notes and Breaking Changes¶
Monitor FluxCD Releases:
# Check latest releases
flux check --components-extra=image-reflector-controller,image-automation-controller
# Review release notes
# https://github.com/fluxcd/flux2/releases
Azure Monitor Integration¶
FluxCD Metrics Export to Prometheus¶
Enable Prometheus Metrics:
# FluxCD controllers export Prometheus metrics on port 8080
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fluxcd
namespace: flux-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: fluxcd
endpoints:
- port: http-prom # Port name for Prometheus metrics
interval: 30s
path: /metrics
Log Forwarding to Log Analytics¶
Configure Container Insights:
# Enable Azure Monitor Container Insights (if not already enabled)
az aks enable-addons \
--resource-group ATP-Production-EUS-RG \
--name atp-prod-eus-aks \
--addons monitoring \
--workspace-resource-id /subscriptions/.../resourceGroups/.../providers/Microsoft.OperationalInsights/workspaces/atp-prod-eus-logs
KQL Query for FluxCD Logs:
// Azure Monitor Log Analytics: Query FluxCD logs
ContainerLog
| where Namespace == "flux-system"
| where PodName contains "source-controller" or PodName contains "kustomize-controller"
| where LogEntry contains "error" or LogEntry contains "warning"
| project TimeGenerated, PodName, LogEntry
| order by TimeGenerated desc
Custom Metrics and Alerts¶
Custom Metrics Dashboard:
# Grafana dashboard for FluxCD metrics
dashboard:
title: "FluxCD Reconciliation Metrics"
panels:
- title: "Reconciliation Duration"
query: "fluxcd_kustomize_reconciliation_duration_seconds"
- title: "Git Fetch Duration"
query: "fluxcd_source_git_duration_seconds"
- title: "Reconciliation Success Rate"
query: "rate(fluxcd_kustomize_reconciliation_success_total[5m]) / rate(fluxcd_kustomize_reconciliation_total[5m])"
Azure Monitor Alert:
# Azure Monitor Alert Rule for FluxCD failures
apiVersion: v1
kind: ConfigMap
metadata:
name: fluxcd-alert
namespace: flux-system
data:
alert-rule.yaml: |
alert:
name: FluxCD Reconciliation Failure
condition: |
fluxcd_kustomize_reconciliation_failure_total > 0
severity: warning
actionGroups:
- /subscriptions/.../resourceGroups/.../providers/microsoft.insights/actionGroups/fluxcd-alerts
Dashboard Setup in Azure Monitor¶
Create FluxCD Dashboard:
# Export FluxCD metrics to Azure Monitor
# Metrics available via Prometheus scraping or Container Insights
# Key metrics to monitor:
# - fluxcd_kustomize_reconciliation_duration_seconds
# - fluxcd_kustomize_reconciliation_total
# - fluxcd_kustomize_reconciliation_success_total
# - fluxcd_kustomize_reconciliation_failure_total
# - fluxcd_source_git_duration_seconds
Dashboard JSON (Azure Monitor):
{
"dashboard": {
"name": "FluxCD Reconciliation Dashboard",
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
{
"namespace": "Microsoft.ContainerService/managedClusters",
"name": "fluxcd_kustomize_reconciliation_duration_seconds",
"aggregation": "Average"
}
],
"title": "Reconciliation Duration"
}
}
]
}
}
Summary: FluxCD Installation & Configuration¶
- FluxCD Architecture: Modular controllers (source, kustomize, helm, notification) with reconciliation loop
- AKS Prerequisites: Kubernetes 1.28+, OIDC issuer, Workload Identity, RBAC enabled
- Installation: Multiple methods (Azure CLI, kubectl, Helm), bootstrap to GitOps repository
- Azure Repos Integration: SSH keys, PAT, or Workload Identity authentication
- Multi-Cluster Setup: Branch-per-environment, cluster-specific configurations, regional deployment patterns
- Workload Identity: Azure AD federated credentials for zero-trust authentication
- Version Management: Upgrade procedures, rollback strategies, compatibility matrix
- Azure Monitor Integration: Prometheus metrics, Log Analytics forwarding, custom alerts and dashboards
Declarative Manifest Management¶
Purpose: Define how ATP microservices are declared, organized, and managed using Kubernetes manifests, Helm charts, and Kustomize overlays, ensuring consistency, reusability, and environment-specific customization across all deployment environments.
Base Manifest Structure¶
ATP microservices use declarative Kubernetes manifests stored in Git as the single source of truth. This section covers the standard resource types and structures used for all ATP services.
Kubernetes Resource Types for ATP Services¶
Required Resources per Service:
| Resource Type | Purpose | Example Name | Required? |
|---|---|---|---|
| Deployment | Manages pod replicas | atp-ingestion |
✅ Yes |
| Service | Exposes pods via network | atp-ingestion |
✅ Yes |
| ConfigMap | Non-sensitive configuration | atp-ingestion-config |
✅ Yes |
| Secret | Sensitive data (references) | atp-ingestion-secrets |
⚠️ Via External Secrets |
| Ingress | External HTTP/gRPC access | atp-ingestion-ingress |
⚠️ If exposed externally |
| ServiceAccount | Pod identity and RBAC | atp-ingestion-sa |
✅ Yes |
| HorizontalPodAutoscaler | Auto-scaling | atp-ingestion-hpa |
⚠️ Production only |
| NetworkPolicy | Traffic isolation | atp-ingestion-network-policy |
✅ Yes |
| PodDisruptionBudget | High availability | atp-ingestion-pdb |
⚠️ Production only |
Deployment Manifest Structure¶
Complete Deployment Example (ATP Ingestion Service):
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-production
labels:
app: atp-ingestion
component: ingestion
tier: backend
version: v1.2.3
managed-by: fluxcd
compliance: soc2-gdpr-hipaa
spec:
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: atp-ingestion
template:
metadata:
labels:
app: atp-ingestion
component: ingestion
tier: backend
version: v1.2.3
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
checksum/config: "abc123def456" # Trigger restart on ConfigMap change
checksum/secret: "def456ghi789" # Trigger restart on Secret change
spec:
serviceAccountName: atp-ingestion-sa
# Pod Security Standards (Restricted)
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
imagePullPolicy: IfNotPresent
# Container Security Context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop: [ALL]
# Resource Requests and Limits
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
# Environment Variables
env:
- name: ASPNETCORE_ENVIRONMENT
value: Production
- name: ASPNETCORE_URLS
value: "http://+:8080"
- name: OpenTelemetry__ServiceName
value: atp-ingestion
- name: DOTNET_RUNNING_IN_CONTAINER
value: "true"
# Environment Variables from ConfigMap
envFrom:
- configMapRef:
name: atp-ingestion-config
- secretRef:
name: atp-ingestion-secrets
# Ports
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
- name: grpc
containerPort: 50051
protocol: TCP
# Health Checks
livenessProbe:
httpGet:
path: /health/live
port: 8080
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 30 # Allow up to 150 seconds for startup
# Volume Mounts (read-only root filesystem requires writable volumes)
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/cache
- name: logs
mountPath: /app/logs
# Volumes
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
- name: logs
emptyDir: {}
# Image Pull Secrets (for ACR authentication)
imagePullSecrets:
- name: acr-credentials
# Termination Grace Period
terminationGracePeriodSeconds: 30
# DNS Policy
dnsPolicy: ClusterFirst
# Restart Policy
restartPolicy: Always
Service Manifest Structure¶
Service Example (ATP Ingestion Service):
# apps/atp-ingestion/base/service.yaml
apiVersion: v1
kind: Service
metadata:
name: atp-ingestion
namespace: atp-production
labels:
app: atp-ingestion
component: ingestion
managed-by: fluxcd
spec:
type: ClusterIP # Internal service (use LoadBalancer for external)
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: metrics
port: 9090
targetPort: 9090
protocol: TCP
- name: grpc
port: 50051
targetPort: 50051
protocol: TCP
selector:
app: atp-ingestion
sessionAffinity: None
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
ConfigMap and Secret References¶
ConfigMap Example:
# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: atp-ingestion-config
namespace: atp-production
labels:
app: atp-ingestion
managed-by: fluxcd
data:
# Application Settings
ASPNETCORE_ENVIRONMENT: "Production"
ASPNETCORE_URLS: "http://+:8080"
# OpenTelemetry Configuration
OpenTelemetry__ServiceName: "atp-ingestion"
OpenTelemetry__SamplingRatio: "0.1"
OpenTelemetry__Exporters__Otlp__Endpoint: "http://otel-collector.observability:4317"
# Audit Trail Configuration
Audit__EnableImmutability: "true"
Audit__RetentionDays: "2555"
Audit__EnableTamperEvidence: "true"
# Feature Flags
Features__EnableAdvancedQuery: "true"
Features__EnableRealTimeAlerts: "true"
# Performance Settings
Performance__MaxConcurrentRequests: "1000"
Performance__RequestTimeoutSeconds: "30"
Secret Reference (External Secrets Operator):
# apps/atp-ingestion/base/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: atp-ingestion-secrets
namespace: atp-production
spec:
refreshInterval: 1h
secretStoreRef:
name: azure-keyvault-production
kind: ClusterSecretStore
target:
name: atp-ingestion-secrets
creationPolicy: Owner
data:
- secretKey: ConnectionStrings__Database
remoteRef:
key: atp-sql-connection-string-prod
- secretKey: ConnectionStrings__Redis
remoteRef:
key: atp-redis-connection-string-prod
- secretKey: ConnectionStrings__RabbitMQ
remoteRef:
key: atp-rabbitmq-connection-string-prod
- secretKey: ApiKeys__IngestionApiKey
remoteRef:
key: atp-ingestion-api-key-prod
Ingress Configuration¶
Ingress Example (External Access):
# apps/atp-ingestion/base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-ingestion-ingress
namespace: atp-production
labels:
app: atp-ingestion
managed-by: fluxcd
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
nginx.ingress.kubernetes.io/grpc-backend: "true"
nginx.ingress.kubernetes.io/rate-limit: "1000"
spec:
ingressClassName: nginx
tls:
- hosts:
- ingestion.atp.connectsoft.example
secretName: atp-ingestion-tls
rules:
- host: ingestion.atp.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion
port:
number: 80
ServiceAccount and RBAC¶
ServiceAccount Example:
# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: atp-ingestion-sa
namespace: atp-production
labels:
app: atp-ingestion
managed-by: fluxcd
annotations:
azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
azure.workload.identity/tenant-id: "87654321-4321-4321-4321-987654321abc"
automountServiceAccountToken: true
RBAC Role and RoleBinding:
# apps/atp-ingestion/base/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: atp-ingestion-role
namespace: atp-production
rules:
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: atp-ingestion-rolebinding
namespace: atp-production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: atp-ingestion-role
subjects:
- kind: ServiceAccount
name: atp-ingestion-sa
namespace: atp-production
Helm Charts for ATP Microservices¶
Helm charts provide parameterized, reusable templates for ATP microservices, enabling environment-specific customization via values files.
Chart Structure: Chart.yaml, values.yaml, templates/¶
Directory Structure:
apps/atp-ingestion/helm/
├── Chart.yaml # Chart metadata
├── values.yaml # Default values
├── values-dev.yaml # Dev environment overrides
├── values-test.yaml # Test environment overrides
├── values-staging.yaml # Staging environment overrides
├── values-production.yaml # Production environment overrides
├── templates/
│ ├── deployment.yaml # Deployment template
│ ├── service.yaml # Service template
│ ├── configmap.yaml # ConfigMap template
│ ├── ingress.yaml # Ingress template (optional)
│ ├── serviceaccount.yaml # ServiceAccount template
│ ├── rbac.yaml # RBAC template
│ ├── hpa.yaml # HPA template (conditional)
│ ├── networkpolicy.yaml # NetworkPolicy template
│ └── _helpers.tpl # Template helpers
└── charts/ # Chart dependencies (optional)
Chart.yaml:
# apps/atp-ingestion/helm/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: ATP Ingestion Service - Receives audit records via HTTP/gRPC
version: 1.2.3 # Chart version (SemVer)
appVersion: 1.2.3 # Application version
type: application
keywords:
- audit-trail
- ingestion
- microservice
- connectsoft
maintainers:
- name: ConnectSoft Platform Team
email: platform-team@connectsoft.example
dependencies:
- name: redis
version: 17.x.x
repository: https://charts.bitnami.com/bitnami
condition: redis.enabled
tags:
- atp-ingestion-redis
home: https://connectsoft.example/atp
sources:
- https://dev.azure.com/ConnectSoft/ATP/_git/atp-ingestion
annotations:
category: Backend
licenses: Proprietary
values.yaml (Default):
# apps/atp-ingestion/helm/values.yaml
# Default values for atp-ingestion chart
# Replica configuration
replicaCount: 3
# Image configuration
image:
repository: connectsoft.azurecr.io/atp/ingestion
pullPolicy: IfNotPresent
tag: "" # Overridden by CI pipeline or .Values.appVersion
# Image pull secrets
imagePullSecrets:
- name: acr-credentials
# Service account configuration
serviceAccount:
create: true
annotations:
azure.workload.identity/client-id: ""
name: atp-ingestion-sa
automountServiceAccountToken: true
# Pod annotations
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
# Pod security context
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
# Container security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop: [ALL]
# Service configuration
service:
type: ClusterIP
port: 80
targetPort: 8080
metricsPort: 9090
grpcPort: 50051
annotations: {}
# Ingress configuration
ingress:
enabled: false
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: ingestion.atp.connectsoft.example
paths:
- path: /
pathType: Prefix
tls:
- secretName: atp-ingestion-tls
hosts:
- ingestion.atp.connectsoft.example
# Resource requests and limits
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
# Autoscaling configuration
autoscaling:
enabled: false
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# Health checks
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
# Environment variables
env:
ASPNETCORE_ENVIRONMENT: Production
OpenTelemetry__ServiceName: atp-ingestion
# External Secrets configuration
externalSecrets:
enabled: true
secretStore: azure-keyvault
secrets:
- name: ConnectionStrings__Database
key: sql-connection-string
- name: ConnectionStrings__Redis
key: redis-connection-string
- name: ConnectionStrings__RabbitMQ
key: rabbitmq-connection-string
# Network policy
networkPolicy:
enabled: true
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: atp-production
- podSelector:
matchLabels:
app: atp-gateway
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
- namespaceSelector:
matchLabels:
name: flux-system
- namespaceSelector:
matchLabels:
name: observability
# Pod Disruption Budget
podDisruptionBudget:
enabled: false
minAvailable: 2
# Redis sub-chart (optional dependency)
redis:
enabled: false # Use Azure Cache for Redis instead
Template Best Practices¶
Helm Template Example (Deployment):
# apps/atp-ingestion/helm/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "atp-ingestion.fullname" . }}
namespace: {{ .Values.namespace | default .Release.Namespace }}
labels:
{{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount }}
revisionHistoryLimit: {{ .Values.revisionHistoryLimit | default 10 }}
selector:
matchLabels:
{{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
{{- with .Values.podAnnotations }}
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.configMap }}
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
{{- end }}
{{- if .Values.secrets }}
checksum/secret: {{ include (print $.Template.BasePath "/externalsecret.yaml") . | sha256sum }}
{{- end }}
labels:
{{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
spec:
serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
env:
{{- range $key, $value := .Values.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}
envFrom:
- configMapRef:
name: {{ include "atp-ingestion.fullname" . }}-config
- secretRef:
name: {{ include "atp-ingestion.fullname" . }}-secrets
ports:
- name: http
containerPort: {{ .Values.service.targetPort }}
protocol: TCP
- name: metrics
containerPort: {{ .Values.service.metricsPort }}
protocol: TCP
- name: grpc
containerPort: {{ .Values.service.grpcPort }}
protocol: TCP
livenessProbe:
{{- toYaml .Values.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.readinessProbe | nindent 12 }}
{{- if .Values.startupProbe }}
startupProbe:
{{- toYaml .Values.startupProbe | nindent 12 }}
{{- end }}
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/cache
- name: logs
mountPath: /app/logs
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
- name: logs
emptyDir: {}
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
Template Helpers (_helpers.tpl):
# apps/atp-ingestion/helm/templates/_helpers.tpl
{{/*
Expand the name of the chart.
*/}}
{{- define "atp-ingestion.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Create a default fully qualified app name.
*/}}
{{- define "atp-ingestion.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "atp-ingestion.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Common labels
*/}}
{{- define "atp-ingestion.labels" -}}
helm.sh/chart: {{ include "atp-ingestion.chart" . }}
{{ include "atp-ingestion.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
managed-by: fluxcd
{{- end }}
{{/*
Selector labels
*/}}
{{- define "atp-ingestion.selectorLabels" -}}
app.kubernetes.io/name: {{ include "atp-ingestion.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
{{/*
Create the name of the service account to use
*/}}
{{- define "atp-ingestion.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "atp-ingestion.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}
Values File Organization¶
values-production.yaml (Production Overrides):
# apps/atp-ingestion/helm/values-production.yaml
replicaCount: 5 # Production: 5 replicas
image:
tag: "v1.2.3-abc123d" # Immutable tag with commit SHA
resources:
requests:
cpu: 1000m # Production: 1 CPU core
memory: 1Gi # Production: 1 GB RAM
limits:
cpu: 2000m # Production: 2 CPU cores
memory: 2Gi # Production: 2 GB RAM
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "1000"
env:
ASPNETCORE_ENVIRONMENT: Production
OpenTelemetry__SamplingRatio: "0.1" # Production: 10% sampling
podDisruptionBudget:
enabled: true
minAvailable: 3 # Ensure at least 3 replicas available during updates
values-dev.yaml (Development Overrides):
# apps/atp-ingestion/helm/values-dev.yaml
replicaCount: 1 # Dev: 1 replica
image:
tag: "latest" # Dev: mutable latest tag
resources:
requests:
cpu: 100m # Dev: minimal resources
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: false # Dev: no autoscaling
ingress:
enabled: false # Dev: no external ingress
env:
ASPNETCORE_ENVIRONMENT: Development
OpenTelemetry__SamplingRatio: "1.0" # Dev: 100% sampling (full traces)
Chart Dependencies¶
Managing Dependencies:
# Update dependencies
helm dependency update apps/atp-ingestion/helm/
# Build chart with dependencies
helm package apps/atp-ingestion/helm/
# Output: atp-ingestion-1.2.3.tgz
Versioning and Publishing to ACR¶
Publish Helm Chart to ACR:
# Login to ACR
az acr login --name connectsoft
# Push Helm chart to ACR
helm push apps/atp-ingestion/helm/ oci://connectsoft.azurecr.io/helm
# Chart available at:
# oci://connectsoft.azurecr.io/helm/atp-ingestion:1.2.3
Helm Hooks for Migrations¶
Helm Hook Example (Database Migration):
# apps/atp-ingestion/helm/templates/migration-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "atp-ingestion.fullname" . }}-migration
namespace: {{ .Values.namespace | default .Release.Namespace }}
annotations:
"helm.sh/hook": pre-upgrade,pre-install
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
template:
spec:
serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
restartPolicy: Never
containers:
- name: migration
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
command: ["dotnet", "run", "--project", "Migrate"]
env:
- name: ConnectionStrings__Database
valueFrom:
secretKeyRef:
name: {{ include "atp-ingestion.fullname" . }}-secrets
key: ConnectionStrings__Database
Kustomize for Environment Overlays¶
Kustomize enables environment-specific customization of base manifests without duplicating code.
Base + Overlay Pattern¶
Directory Structure:
apps/atp-ingestion/
├── base/ # Base manifests (reusable)
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── configmap.yaml
│ ├── serviceaccount.yaml
│ ├── rbac.yaml
│ └── kustomization.yaml
│
└── overlays/ # Environment-specific overlays
├── dev/
│ ├── kustomization.yaml
│ ├── deployment-patch.yaml
│ └── configmap-patch.yaml
├── test/
│ ├── kustomization.yaml
│ └── deployment-patch.yaml
├── staging/
│ ├── kustomization.yaml
│ ├── deployment-patch.yaml
│ └── hpa-patch.yaml
└── production/
├── kustomization.yaml
├── deployment-patch.yaml
├── hpa-patch.yaml
├── configmap-patch.yaml
└── networkpolicy-patch.yaml
Base Kustomization:
# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-production
resources:
- deployment.yaml
- service.yaml
- configmap.yaml
- serviceaccount.yaml
- rbac.yaml
commonLabels:
app: atp-ingestion
component: ingestion
tier: backend
managed-by: fluxcd
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3-abc123d # Updated by CI pipeline
Strategic Merge Patches¶
Deployment Patch (Production):
# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 5 # Production: 5 replicas (base has 3)
template:
spec:
containers:
- name: ingestion
resources:
requests:
cpu: 1000m # Production: 1 CPU (base: 500m)
memory: 1Gi # Production: 1 GB (base: 512Mi)
limits:
cpu: 2000m # Production: 2 CPU (base: 1000m)
memory: 2Gi # Production: 2 GB (base: 1Gi)
env:
- name: ASPNETCORE_ENVIRONMENT
value: Production
- name: OpenTelemetry__SamplingRatio
value: "0.1" # Production: 10% sampling (dev: 100%)
JSON Patches¶
JSON Patch Example (Add Annotation):
# apps/atp-ingestion/overlays/production/json-patch.yaml
- op: add
path: /metadata/annotations/azure.connectsoft.com~1cost-center
value: atp-production
- op: replace
path: /spec/replicas
value: 5
ConfigMap and Secret Generators¶
ConfigMap Generator:
# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
configMapGenerator:
- name: atp-ingestion-config
behavior: merge # Merge with base ConfigMap
literals:
- ASPNETCORE_ENVIRONMENT=Production
- OpenTelemetry__SamplingRatio=0.1
- Audit__EnableImmutability=true
- Audit__RetentionDays=2555
options:
labels:
environment: production
Secret Generator (Base64 Encoded):
# apps/atp-ingestion/overlays/production/kustomization.yaml
secretGenerator:
- name: atp-ingestion-secrets
behavior: merge
type: Opaque
literals:
- ApiKeys__IngestionApiKey=$(echo -n "secret-value" | base64)
Variable Substitution¶
Kustomize with Variable Substitution:
# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-production
resources:
- deployment.yaml
replicas:
- name: atp-ingestion
count: 3
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3-abc123d
Configuration Layering Strategy¶
Configuration Precedence Hierarchy:
graph TD
A[Base Configuration<br/>apps/atp-ingestion/base] -->|applies to| B[All Environments]
B --> C[Dev Overlay<br/>overlays/dev]
B --> D[Test Overlay<br/>overlays/test]
B --> E[Staging Overlay<br/>overlays/staging]
B --> F[Production Overlay<br/>overlays/production]
C -->|customizes| G[Dev Cluster]
D -->|customizes| H[Test Cluster]
E -->|customizes| I[Staging Cluster]
F -->|customizes| J[Production Cluster]
style A fill:#FFE5B4
style C fill:#90EE90
style D fill:#90EE90
style E fill:#FFE5B4
style F fill:#ffcccc
Configuration Layers:
| Layer | Location | Purpose | Examples |
|---|---|---|---|
| Base | apps/{service}/base/ |
Common configuration for all environments | Deployment structure, service ports, basic labels |
| Dev Overlay | apps/{service}/overlays/dev/ |
Development-specific customization | 1 replica, minimal resources, 100% sampling |
| Test Overlay | apps/{service}/overlays/test/ |
Test environment customization | 2 replicas, medium resources, 50% sampling |
| Staging Overlay | apps/{service}/overlays/staging/ |
Pre-production environment | 3 replicas, production-like resources, 10% sampling |
| Production Overlay | apps/{service}/overlays/production/ |
Production environment | 5+ replicas, full resources, 10% sampling, HPA |
Hierarchical Configuration Precedence¶
Precedence Order (highest to lowest):
- Overlay patches (environment-specific)
- Overlay ConfigMap generators (environment-specific)
- Base configuration (common defaults)
Example:
# Base ConfigMap
data:
ASPNETCORE_ENVIRONMENT: "Development" # Default
# Production Overlay ConfigMap Generator (merges)
configMapGenerator:
- name: atp-ingestion-config
behavior: merge
literals:
- ASPNETCORE_ENVIRONMENT=Production # Overrides base
# Result: Production uses "Production", other environments use "Development"
Image Reference Patterns¶
ACR Image Path Conventions¶
Image Path Format:
{registry}/{project}/{service}:{tag}
Examples:
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
connectsoft.azurecr.io/atp/query:v1.3.0-def456e
connectsoft.azurecr.io/atp/integrity:v1.1.5-ghi789f
Service Image Mapping:
| Service | ACR Path |
|---|---|
| atp-ingestion | connectsoft.azurecr.io/atp/ingestion |
| atp-query | connectsoft.azurecr.io/atp/query |
| atp-integrity | connectsoft.azurecr.io/atp/integrity |
| atp-export | connectsoft.azurecr.io/atp/export |
| atp-policy | connectsoft.azurecr.io/atp/policy |
| atp-search | connectsoft.azurecr.io/atp/search |
| atp-gateway | connectsoft.azurecr.io/atp/gateway |
Semantic Versioning in Image Tags¶
Tag Format:
{v{VERSION}}-{COMMIT-SHA}
Examples:
v1.2.3-abc123d # Semantic version + 7-char commit SHA
v1.2.4-hotfix1-def456e # Hotfix version + commit SHA
Tag Rules:
| Tag Pattern | Mutable? | Use Case | Example |
|---|---|---|---|
v{VERSION}-{SHA} |
❌ Immutable | Production releases | v1.2.3-abc123d |
v{VERSION} |
❌ Immutable | Production releases (without SHA) | v1.2.3 |
latest |
✅ Mutable | Development only | latest |
{BRANCH} |
✅ Mutable | Feature branches | feature/grpc-ingestion |
Commit SHA for Traceability¶
Image Tagging in Azure Pipelines:
# Azure Pipelines: Tag image with version + commit SHA
- task: Docker@2
displayName: 'Build and push Docker image'
inputs:
containerRegistry: 'ConnectSoft-ACR'
repository: 'atp/ingestion'
command: 'buildAndPush'
Dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
tags: |
$(Build.BuildNumber) # v1.2.3
$(Build.BuildNumber)-$(Build.SourceVersion) # v1.2.3-abc123d
latest # Latest (dev only)
Image Pull Policies¶
Policy Selection:
| Policy | Behavior | Use Case |
|---|---|---|
| Always | Always pull latest image | Development (latest tag) |
| IfNotPresent | Pull only if not cached | Production (immutable tags) |
| Never | Never pull, use cached only | Air-gapped environments |
Production Configuration:
spec:
containers:
- name: ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
imagePullPolicy: IfNotPresent # Production: use cached image (faster, immutable tag)
Development Configuration:
spec:
containers:
- name: ingestion
image: connectsoft.azurecr.io/atp/ingestion:latest
imagePullPolicy: Always # Dev: always pull latest (mutable tag)
Resource Requests and Limits¶
Per-Environment Resource Specifications¶
Resource Configuration Matrix:
| Environment | CPU Request | CPU Limit | Memory Request | Memory Limit | Replicas |
|---|---|---|---|---|---|
| Dev | 100m | 500m | 128Mi | 512Mi | 1 |
| Test | 250m | 500m | 256Mi | 512Mi | 2 |
| Staging | 500m | 1000m | 512Mi | 1Gi | 3 |
| Production | 1000m | 2000m | 1Gi | 2Gi | 5 |
Production Resource Configuration:
# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
containers:
- name: ingestion
resources:
requests:
cpu: 1000m # Guaranteed: 1 CPU core
memory: 1Gi # Guaranteed: 1 GB RAM
limits:
cpu: 2000m # Maximum: 2 CPU cores (burst capacity)
memory: 2Gi # Maximum: 2 GB RAM
CPU and Memory Allocations¶
Sizing Guidelines:
- CPU Request: Guaranteed CPU (scheduling decision)
- CPU Limit: Maximum CPU (throttling threshold)
- Memory Request: Guaranteed memory (scheduling decision)
- Memory Limit: Maximum memory (OOMKill threshold)
Cost Optimization:
# Production: Right-sizing based on actual usage
resources:
requests:
cpu: 500m # Based on 50th percentile usage
memory: 512Mi # Based on 50th percentile usage
limits:
cpu: 2000m # Allow burst to 2x request
memory: 2Gi # Allow burst to 4x request (less frequent)
Resource Quota Enforcement¶
Namespace Resource Quota:
# infrastructure/overlays/production/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: atp-production-quota
namespace: atp-production
spec:
hard:
requests.cpu: "100" # 100 CPU cores total
requests.memory: 200Gi # 200 GB RAM total
limits.cpu: "200" # 200 CPU cores max
limits.memory: 400Gi # 400 GB RAM max
persistentvolumeclaims: "50" # Max 50 PVCs
services.loadbalancers: "2" # Max 2 load balancers
pods: "200" # Max 200 pods
Health Checks Configuration¶
Liveness Probes (Is the App Running?)¶
Purpose: Detect and restart crashed containers.
Liveness Probe Example:
livenessProbe:
httpGet:
path: /health/live
port: 8080
scheme: HTTP
httpHeaders:
- name: Custom-Header
value: liveness-check
initialDelaySeconds: 30 # Wait 30s after container starts
periodSeconds: 10 # Check every 10 seconds
timeoutSeconds: 5 # Timeout after 5 seconds
successThreshold: 1 # 1 success = healthy
failureThreshold: 3 # 3 failures = restart container
Implementation (.NET Health Checks):
// C# Health Check implementation
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("live"),
AllowCachingResponses = false,
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});
Readiness Probes (Is the App Ready for Traffic?)¶
Purpose: Determine if container is ready to receive traffic.
Readiness Probe Example:
readinessProbe:
httpGet:
path: /health/ready
port: 8080
scheme: HTTP
initialDelaySeconds: 10 # Wait 10s after container starts
periodSeconds: 5 # Check every 5 seconds
timeoutSeconds: 3 # Timeout after 3 seconds
successThreshold: 1 # 1 success = ready
failureThreshold: 3 # 3 failures = remove from Service endpoints
Implementation (.NET Health Checks):
// C# Readiness Check (includes dependencies)
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("ready"),
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});
// Check database connectivity
services.AddHealthChecks()
.AddSqlServer(connectionString, tags: new[] { "ready" })
.AddRedis(redisConnectionString, tags: new[] { "ready" });
Startup Probes (For Slow-Starting Apps)¶
Purpose: Allow slow-starting applications time to initialize.
Startup Probe Example:
startupProbe:
httpGet:
path: /health/startup
port: 8080
scheme: HTTP
initialDelaySeconds: 0 # Start immediately
periodSeconds: 5 # Check every 5 seconds
timeoutSeconds: 3 # Timeout after 3 seconds
successThreshold: 1 # 1 success = startup complete
failureThreshold: 30 # Allow up to 150 seconds (30 * 5s) for startup
When to Use Startup Probes:
- Applications with long initialization (database migrations, cache warming)
- JVM-based applications (slow JIT compilation)
- Applications loading large configuration files
Probe Configuration Best Practices¶
Best Practices Table:
| Probe Type | Initial Delay | Period | Timeout | Failure Threshold | Rationale |
|---|---|---|---|---|---|
| Liveness | 30s | 10s | 5s | 3 | Give app time to start; detect crashes quickly |
| Readiness | 10s | 5s | 3s | 3 | Quick feedback for traffic routing |
| Startup | 0s | 5s | 3s | 30 | Allow up to 150s for slow initialization |
Probe Failure Handling:
# Liveness probe failure: Container restart
# Readiness probe failure: Remove from Service endpoints (no traffic)
# Startup probe failure: Keep checking until success or failure threshold
Pod Security Standards (PSS)¶
Security Context Configuration¶
Pod Security Context (Restricted Profile):
# apps/atp-ingestion/base/deployment.yaml
spec:
template:
spec:
securityContext:
runAsNonRoot: true # Run as non-root user
runAsUser: 1000 # Run as user ID 1000
fsGroup: 2000 # File system group
seccompProfile: # System call filtering
type: RuntimeDefault
supplementalGroups: [] # No additional groups
Container Security Context:
containers:
- name: ingestion
securityContext:
allowPrivilegeEscalation: false # Prevent privilege escalation
readOnlyRootFilesystem: true # Read-only root filesystem
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop: [ALL] # Drop all capabilities
# add: [] # No additional capabilities
Pod Security Admission¶
Namespace Pod Security Labels:
# infrastructure/base/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-production
labels:
pod-security.kubernetes.io/enforce: restricted # Enforce restricted profile
pod-security.kubernetes.io/audit: restricted # Audit violations
pod-security.kubernetes.io/warn: restricted # Warn on violations
Pod Security Levels:
| Level | Description | ATP Use |
|---|---|---|
| Privileged | Unrestricted (all capabilities) | ❌ Never |
| Baseline | Minimally restrictive | ⚠️ Legacy workloads only |
| Restricted | Highly restrictive (best practice) | ✅ Production |
Restricted, Baseline, Privileged Policies¶
Policy Comparison:
| Feature | Privileged | Baseline | Restricted |
|---|---|---|---|
| Host Namespaces | ✅ Allowed | ❌ Disallowed | ❌ Disallowed |
| Host Networking | ✅ Allowed | ❌ Disallowed | ❌ Disallowed |
| Privileged Containers | ✅ Allowed | ❌ Disallowed | ❌ Disallowed |
| Capabilities | ✅ All | ⚠️ Limited | ❌ Drop ALL |
| Volume Types | ✅ All | ⚠️ Limited | ⚠️ Limited |
| Run as Non-Root | ❌ Not required | ❌ Not required | ✅ Required |
| Read-Only Root FS | ❌ Not required | ❌ Not required | ✅ Required |
| Seccomp | ❌ Not required | ✅ Default | ✅ Required |
ATP Production Policy:
# infrastructure/base/pod-security-standards.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Security Best Practices for ATP¶
Security Checklist:
- ✅ Run as non-root: All containers run as UID 1000+
- ✅ Read-only root filesystem: Use
emptyDirvolumes for writable paths - ✅ Drop all capabilities: No additional Linux capabilities
- ✅ Seccomp enabled: System call filtering (RuntimeDefault)
- ✅ No host namespaces: No host network, PID, or IPC access
- ✅ No privileged containers: No elevated privileges
- ✅ Network policies: Default deny, explicit allow rules
Network Policies¶
Default Deny All Traffic¶
Default Deny NetworkPolicy:
# platform/network-policies/default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: atp-production
spec:
podSelector: {} # Applies to all pods
policyTypes:
- Ingress
- Egress
# No ingress rules = deny all ingress
# No egress rules = deny all egress
Service-to-Service Communication Rules¶
Allow Internal Traffic (Same Namespace):
# apps/atp-ingestion/base/networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: atp-ingestion-network-policy
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-ingestion
policyTypes:
- Ingress
- Egress
ingress:
# Allow from atp-gateway (API Gateway)
- from:
- podSelector:
matchLabels:
app: atp-gateway
ports:
- protocol: TCP
port: 8080 # HTTP port
- protocol: TCP
port: 50051 # gRPC port
# Allow from same namespace (service-to-service)
- from:
- namespaceSelector:
matchLabels:
name: atp-production
ports:
- protocol: TCP
port: 8080
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow to Azure SQL (external)
- to:
- namespaceSelector: {} # External
ports:
- protocol: TCP
port: 1433 # SQL Server
# Allow to Azure Redis (external)
- to:
- namespaceSelector: {} # External
ports:
- protocol: TCP
port: 6380 # Redis TLS
# Allow to observability namespace (metrics)
- to:
- namespaceSelector:
matchLabels:
name: observability
ports:
- protocol: TCP
port: 4317 # OTLP gRPC
Ingress and Egress Policies¶
Ingress Policy (Allow External Traffic):
# apps/atp-gateway/base/networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: atp-gateway-network-policy
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-gateway
policyTypes:
- Ingress
ingress:
# Allow from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app: ingress-nginx
ports:
- protocol: TCP
port: 8080
DNS and Monitoring Exceptions¶
Egress Policy (DNS and Monitoring):
# platform/network-policies/allow-dns-monitoring.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-monitoring
namespace: atp-production
spec:
podSelector: {} # Applies to all pods
policyTypes:
- Egress
egress:
# Allow DNS queries
- to:
- namespaceSelector:
matchLabels:
name: kube-system
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow to Azure Monitor (external)
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443 # HTTPS for Azure Monitor
# Allow to observability namespace (metrics, logs, traces)
- to:
- namespaceSelector:
matchLabels:
name: observability
ports:
- protocol: TCP
port: 4317 # OTLP
- protocol: TCP
port: 9090 # Prometheus metrics
Horizontal Pod Autoscaler (HPA)¶
CPU-Based Autoscaling¶
HPA Configuration (CPU-Based):
# apps/atp-ingestion/overlays/production/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-ingestion-hpa
namespace: atp-production
labels:
app: atp-ingestion
managed-by: fluxcd
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
minReplicas: 5
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale when CPU > 70%
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60 # Scale down by 50% per minute
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15 # Double replicas every 15 seconds
- type: Pods
value: 4
periodSeconds: 15 # Or add 4 pods every 15 seconds
selectPolicy: Max # Use the policy that scales fastest
Memory-Based Autoscaling¶
HPA Configuration (Memory-Based):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-ingestion-hpa
namespace: atp-production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
minReplicas: 5
maxReplicas: 20
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale when memory > 80%
Custom Metrics with KEDA¶
KEDA ScaledObject (Custom Metrics):
# apps/atp-ingestion/overlays/production/keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: atp-ingestion-scaler
namespace: atp-production
spec:
scaleTargetRef:
name: atp-ingestion
minReplicaCount: 5
maxReplicaCount: 20
triggers:
# Scale based on CPU
- type: cpu
metricType: Utilization
metadata:
value: "70"
# Scale based on RabbitMQ queue length
- type: rabbitmq
metadata:
host: amqps://rabbitmq.example:5671
queueName: audit-records-queue
queueLength: "100" # Scale when queue > 100 messages
# Scale based on HTTP request rate
- type: prometheus
metadata:
serverAddress: http://prometheus.observability:9090
metricName: http_requests_per_second
threshold: "1000" # Scale when requests > 1000/sec
Scaling Policies per Environment¶
Environment-Specific Scaling:
| Environment | Min Replicas | Max Replicas | Target CPU | Target Memory | Custom Metrics |
|---|---|---|---|---|---|
| Dev | 1 | 2 | N/A (no HPA) | N/A | ❌ Disabled |
| Test | 2 | 4 | 80% | 80% | ⚠️ Optional |
| Staging | 3 | 10 | 70% | 75% | ⚠️ Optional |
| Production | 5 | 20 | 70% | 80% | ✅ Enabled (KEDA) |
Manifest Validation¶
kubeval for Syntax Validation¶
kubeval Usage:
# Install kubeval
brew install kubeval # macOS
# or
wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
# Validate Kubernetes manifests
kubeval apps/atp-ingestion/base/deployment.yaml
# Validate all manifests in directory
find apps/ -name "*.yaml" -exec kubeval {} \;
# Validate with specific Kubernetes version
kubeval --kubernetes-version 1.30.0 apps/atp-ingestion/base/deployment.yaml
kube-score for Best Practices¶
kube-score Usage:
# Install kube-score
brew install kube-score # macOS
# or download from https://github.com/zegl/kube-score/releases
# Score manifests (best practices check)
kube-score score apps/atp-ingestion/base/deployment.yaml
# Output:
# apps/atp-ingestion/base/deployment.yaml
# [CRITICAL] Container Image Tag
# · Image with latest or no tag
# Container 'ingestion' must not use the 'latest' tag
#
# [CRITICAL] Container Resources
# · CPU limit is not set
# Container 'ingestion' does not have a CPU limit
#
# [WARNING] Container Security Context
# · Container does not have a read-only root filesystem
# Container 'ingestion' can write to root filesystem
kube-score Configuration:
# .kube-score.yaml
ignore-test-case:
- container-image-tag # Allow 'latest' in dev
- deployment-has-poddisruptionbudget # Optional for non-critical services
minimum-score: 5 # Fail if score < 5/10
Azure Policy Validation¶
Azure Policy for Kubernetes:
# platform/azure-policy/pod-security-standards.yaml
apiVersion: policy.k8s.io/v1beta1
kind: Policy
metadata:
name: enforce-pod-security-standards
spec:
rules:
- apiGroups: [""]
resources: ["pods"]
validations:
- expression: "object.spec.securityContext.runAsNonRoot == true"
message: "Pods must run as non-root user"
- expression: "object.spec.containers.all(c, c.securityContext.readOnlyRootFilesystem == true)"
message: "Containers must have read-only root filesystem"
- expression: "object.spec.containers.all(c, c.resources.limits.cpu != null && c.resources.limits.memory != null)"
message: "Containers must have CPU and memory limits"
CI Pipeline Integration¶
Azure Pipelines Validation Stage:
# azure-pipelines.yml
- stage: Validate_Manifests
displayName: 'Validate Kubernetes Manifests'
jobs:
- job: Validate
steps:
- task: Bash@3
displayName: 'Install kubeval and kube-score'
inputs:
targetType: 'inline'
script: |
# Install kubeval
curl -LO https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
tar xf kubeval-linux-amd64.tar.gz
sudo mv kubeval /usr/local/bin/
# Install kube-score
curl -LO https://github.com/zegl/kube-score/releases/latest/download/kube-score_linux_amd64.tar.gz
tar xf kube-score_linux_amd64.tar.gz
sudo mv kube-score /usr/local/bin/
- task: Bash@3
displayName: 'Validate manifests with kubeval'
inputs:
targetType: 'inline'
script: |
# Validate all base manifests
find apps/ -name "*.yaml" -path "*/base/*" -exec kubeval --strict {} \;
- task: Bash@3
displayName: 'Score manifests with kube-score'
inputs:
targetType: 'inline'
script: |
# Score all base manifests
find apps/ -name "*.yaml" -path "*/base/*" -exec kube-score score {} \;
Summary: Declarative Manifest Management¶
- Base Manifest Structure: Standard Kubernetes resources (Deployment, Service, ConfigMap, Ingress, ServiceAccount, RBAC) for all ATP services
- Helm Charts: Parameterized, reusable templates with environment-specific values files
- Kustomize Overlays: Environment-specific customization using strategic merge patches and generators
- Configuration Layering: Base configuration + environment overlays with clear precedence
- Image References: ACR paths with semantic versioning and commit SHA for traceability
- Resource Management: Per-environment resource requests/limits with cost optimization
- Health Checks: Liveness, readiness, and startup probes for reliability
- Pod Security Standards: Restricted profile enforcement for production workloads
- Network Policies: Default deny with explicit service-to-service rules
- Horizontal Pod Autoscaler: CPU/memory-based scaling with KEDA for custom metrics
- Manifest Validation: kubeval (syntax), kube-score (best practices), Azure Policy (compliance)
Git Workflow & Environment Promotion¶
Purpose: Define the complete Git workflow, pull request process, environment promotion strategy, and operational procedures for managing changes through the GitOps pipeline from feature development to production deployment.
Feature Branch Development Workflow¶
ATP GitOps follows a GitOps-native workflow where all infrastructure and application changes flow through Git branches, pull requests, and automated validation before promotion to production.
Creating Feature Branches from Dev¶
Branch Creation Process:
# 1. Ensure you're on the latest dev branch
git checkout dev
git pull origin dev
# 2. Create feature branch from dev
git checkout -b feature/atp-ingestion-add-grpc-support
# 3. Verify branch creation
git branch
# Output:
# * feature/atp-ingestion-add-grpc-support
# dev
# main
# 4. Push feature branch to remote
git push -u origin feature/atp-ingestion-add-grpc-support
Branch Naming Conventions:
| Branch Type | Prefix | Example | Purpose |
|---|---|---|---|
| Feature | feature/ |
feature/atp-query-add-cache |
New functionality |
| Bugfix | bugfix/ |
bugfix/atp-ingestion-memory-leak |
Bug fixes |
| Hotfix | hotfix/ |
hotfix/atp-gateway-security-patch |
Critical production fixes |
| Documentation | docs/ |
docs/gitops-troubleshooting |
Documentation updates |
| Infrastructure | infra/ |
infra/add-monitoring-namespace |
Infrastructure changes |
| Chore | chore/ |
chore/update-helm-charts |
Maintenance tasks |
Local Development and Testing¶
Local GitOps Repository Structure:
# Clone GitOps repository
git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops.git
cd atp-gitops
# Navigate to service manifests
cd apps/atp-ingestion/base/
# Edit deployment manifest
vim deployment.yaml
# Validate changes locally
kubectl apply --dry-run=client -f deployment.yaml
Local Validation Tools:
# Validate YAML syntax
yamllint apps/atp-ingestion/base/deployment.yaml
# Validate Kubernetes manifests
kubeval apps/atp-ingestion/base/deployment.yaml
# Score manifests (best practices)
kube-score score apps/atp-ingestion/base/deployment.yaml
# Preview Kustomize output
kustomize build apps/atp-ingestion/overlays/dev/
# Preview Helm template output
helm template atp-ingestion apps/atp-ingestion/helm/ \
--values apps/atp-ingestion/helm/values-dev.yaml \
--debug
Committing Manifest Changes¶
Commit Process:
# Stage changes
git add apps/atp-ingestion/base/deployment.yaml
# Commit with conventional commit format
git commit -m "feat(ingestion): add gRPC endpoint configuration
- Add gRPC port (50051) to container ports
- Configure gRPC health checks
- Update service manifest for gRPC traffic
Related to: ATP-1234"
# Sign commit (required for production)
git commit -S -m "feat(ingestion): add gRPC endpoint configuration"
# Verify commit signature
git log --show-signature -1
Commit Message Format (Conventional Commits):
Examples:
# Feature addition
git commit -m "feat(ingestion): add Redis cache support"
# Bug fix
git commit -m "fix(query): resolve memory leak in query service"
# Configuration change
git commit -m "chore(infra): update resource limits for production"
# Breaking change
git commit -m "feat(gateway)!: remove legacy authentication
BREAKING CHANGE: Legacy API key authentication removed.
Migrate to OAuth 2.0 before deploying this change."
# Work item reference
git commit -m "feat(integrity): implement tamper detection
Implements ATP-5678
- Add cryptographic signatures to audit records
- Validate signatures on read operations
- Store signature metadata in database"
Syncing with Remote Repository¶
Sync Workflow:
# Fetch latest changes from remote
git fetch origin
# Check status
git status
# Rebase feature branch on latest dev (optional, for clean history)
git checkout feature/atp-ingestion-add-grpc-support
git rebase origin/dev
# Or merge latest dev into feature branch
git merge origin/dev
# Resolve conflicts if any
git status
# Edit conflicted files
vim apps/atp-ingestion/base/deployment.yaml
git add apps/atp-ingestion/base/deployment.yaml
git rebase --continue # or git commit for merge
# Push changes
git push origin feature/atp-ingestion-add-grpc-support
# If rebased, force push (be careful!)
git push --force-with-lease origin feature/atp-ingestion-add-grpc-support
Pull Request Process¶
PR Creation in Azure Repos¶
Create Pull Request:
# Using Azure DevOps CLI
az repos pr create \
--source-branch feature/atp-ingestion-add-grpc-support \
--target-branch dev \
--title "feat(ingestion): Add gRPC endpoint configuration" \
--description "Adds gRPC support to ingestion service. See ATP-1234." \
--work-items 1234 \
--auto-complete false
# Or use Azure DevOps Portal:
# 1. Navigate to Repos > Pull Requests
# 2. Click "New Pull Request"
# 3. Select source branch (feature/atp-ingestion-add-grpc-support)
# 4. Select target branch (dev)
# 5. Fill in title and description
# 6. Link work items
# 7. Add reviewers
# 8. Create pull request
PR Template and Checklist¶
Pull Request Template (.azuredevops/pull_request_template.md):
## Description
<!-- Provide a clear description of the changes -->
## Type of Change
<!-- Mark applicable with [x] -->
- [ ] Feature (non-breaking change adding functionality)
- [ ] Bug fix (non-breaking change fixing an issue)
- [ ] Breaking change (fix or feature causing existing functionality to change)
- [ ] Documentation update
- [ ] Infrastructure change
- [ ] Configuration change
## Service(s) Affected
<!-- List affected services -->
- [ ] atp-ingestion
- [ ] atp-query
- [ ] atp-integrity
- [ ] atp-export
- [ ] atp-policy
- [ ] atp-search
- [ ] atp-gateway
- [ ] Infrastructure/Platform
## Environment(s) Affected
<!-- Mark applicable with [x] -->
- [ ] Dev
- [ ] Test
- [ ] Staging
- [ ] Production
## Pre-Merge Checklist
<!-- Mark applicable with [x] -->
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Comments added for complex logic
- [ ] Documentation updated
- [ ] No breaking changes (or documented)
- [ ] All CI/CD checks passing
- [ ] Manifest validation passing (kubeval, kube-score)
- [ ] Security scanning passing (OPA, Azure Policy)
- [ ] Preview environment tested (if applicable)
- [ ] Work items linked
- [ ] Signed commits (for production branches)
## Testing
<!-- Describe testing performed -->
- [ ] Local testing completed
- [ ] Preview environment tested
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Manual testing completed
## Rollback Plan
<!-- Describe rollback procedure if needed -->
## Related Work Items
<!-- Link related work items -->
- ATP-1234: Add gRPC endpoint to ingestion service
## Screenshots/Documentation
<!-- Add screenshots, diagrams, or documentation links -->
## Additional Notes
<!-- Any additional information reviewers should know -->
Code Review Guidelines¶
Review Checklist:
- Manifest Validation:
- ✅ YAML syntax correct
- ✅ Kubernetes API version valid
- ✅ Resource names follow conventions
- ✅ Labels and annotations present
-
✅ Resource requests/limits set
-
Security:
- ✅ No secrets in plaintext
- ✅ Pod Security Standards compliant
- ✅ Network policies configured
-
✅ RBAC follows least privilege
-
Configuration:
- ✅ Environment-specific values correct
- ✅ Image tags immutable (not
latestin prod) - ✅ Health checks configured
-
✅ Resource limits appropriate
-
Best Practices:
- ✅ Follows GitOps principles
- ✅ Changes are declarative
- ✅ No hardcoded values
- ✅ Documentation updated
Review Comments:
❌ Security Issue: Secret in plaintext
✅ Approved: Looks good, minor suggestion
⚠️ Needs Work: Please add resource limits
📝 Question: Why is this change needed?
Approval Workflow¶
Approval Requirements Matrix:
| Target Branch | Minimum Approvers | Required Roles | GPG Signing | Status Checks |
|---|---|---|---|---|
| dev | 1 | Developer or above | ❌ Optional | ✅ Required |
| test | 1 | Developer or above | ❌ Optional | ✅ Required |
| staging | 2 | Architect or SRE Lead | ✅ Required | ✅ Required |
| production | 2 | Architect or SRE Lead | ✅ Required | ✅ Required |
| hotfix/ | 2 | Architect or SRE Lead | ✅ Required | ✅ Required |
Azure DevOps Branch Policy Configuration:
# Branch policy for production branch
branchPolicy:
branch: production
minimumApproverCount: 2
requiredApproverIds:
- architect-team-group
- sre-lead-group
blockingPolicies:
- buildValidation: true
- mergeStrategy: squash
- requireGpgSigning: true
- requireWorkItemLinking: true
- commentRequirements: true
Automated PR Validation¶
Manifest Linting (YAML Syntax, Helm Lint)¶
Azure Pipeline: PR Validation Stage:
# .azuredevops/pipelines/pr-validation.yml
trigger: none # Only run on PR
pr:
branches:
include:
- dev
- test
- staging
- production
- hotfix/*
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: Validate_Manifests
displayName: 'Validate Kubernetes Manifests'
jobs:
- job: YAMLLint
displayName: 'YAML Syntax Validation'
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.9'
- script: |
pip install yamllint
yamllint -c .yamllint.yml apps/
displayName: 'Validate YAML syntax'
- job: Kubeval
displayName: 'Kubernetes Manifest Validation'
steps:
- script: |
wget -q https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
tar xf kubeval-linux-amd64.tar.gz
sudo mv kubeval /usr/local/bin/
kubeval --version
displayName: 'Install kubeval'
- script: |
find apps/ -name "*.yaml" -path "*/base/*" -exec kubeval --strict {} \;
displayName: 'Validate Kubernetes manifests'
- job: HelmLint
displayName: 'Helm Chart Linting'
steps:
- task: HelmInstaller@1
inputs:
helmVersionToInstall: 'latest'
- script: |
find apps/ -name "Chart.yaml" -path "*/helm/*" | while read chart; do
chart_dir=$(dirname "$chart")
helm lint "$chart_dir"
done
displayName: 'Lint Helm charts'
- job: KubeScore
displayName: 'Best Practices Check'
steps:
- script: |
wget -q https://github.com/zegl/kube-score/releases/latest/download/kube-score_linux_amd64.tar.gz
tar xf kube-score_linux_amd64.tar.gz
sudo mv kube-score /usr/local/bin/
displayName: 'Install kube-score'
- script: |
find apps/ -name "*.yaml" -path "*/base/*" -exec kube-score score {} \;
displayName: 'Score manifests for best practices'
Security Scanning (OPA Policies, Azure Policy)¶
OPA Policy Validation:
# .azuredevops/pipelines/pr-validation.yml
- job: OPAPolicy
displayName: 'OPA Policy Validation'
steps:
- script: |
wget -q https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz
tar xf conftest_linux_amd64.tar.gz
sudo mv conftest /usr/local/bin/
displayName: 'Install conftest'
- script: |
find apps/ -name "*.yaml" -path "*/base/*" | while read manifest; do
conftest test "$manifest" -p policies/
done
displayName: 'Validate OPA policies'
OPA Policy Examples:
# policies/pod-security.rego
package podsecurity
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.securityContext.runAsNonRoot
msg := "Container must run as non-root user"
}
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.securityContext.readOnlyRootFilesystem
msg := "Container must have read-only root filesystem"
}
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits.cpu
msg := "Container must have CPU limit"
}
Dry-Run Validation (Kustomize Build, Helm Template)¶
Dry-Run Validation:
# .azuredevops/pipelines/pr-validation.yml
- job: DryRun
displayName: 'Dry-Run Validation'
steps:
- task: Kubernetes@1
inputs:
connectionType: 'Azure Resource Manager'
azureSubscriptionEndpoint: 'ATP-AKS-Connection'
azureResourceGroup: 'ATP-Production-EUS-RG'
kubernetesCluster: 'atp-prod-eus-aks'
namespace: 'atp-production'
command: 'apply'
arguments: '--dry-run=client -f apps/atp-ingestion/base/deployment.yaml'
displayName: 'Kubectl dry-run'
- script: |
# Kustomize build validation
kustomize build apps/atp-ingestion/overlays/production/ > /dev/null
echo "✅ Kustomize build successful"
displayName: 'Kustomize build validation'
- script: |
# Helm template validation
helm template atp-ingestion apps/atp-ingestion/helm/ \
--values apps/atp-ingestion/helm/values-production.yaml \
--debug > /dev/null
echo "✅ Helm template successful"
displayName: 'Helm template validation'
Breaking Change Detection¶
Breaking Change Detection Script:
#!/bin/bash
# scripts/detect-breaking-changes.sh
set -euo pipefail
BASE_BRANCH="${1:-dev}"
FEATURE_BRANCH="${2:-HEAD}"
echo "🔍 Detecting breaking changes between $BASE_BRANCH and $FEATURE_BRANCH..."
# Check for removed resources
REMOVED_RESOURCES=$(git diff --name-only --diff-filter=D "$BASE_BRANCH" "$FEATURE_BRANCH" | grep -E '\.(yaml|yml)$' || true)
if [ -n "$REMOVED_RESOURCES" ]; then
echo "⚠️ WARNING: Resources removed:"
echo "$REMOVED_RESOURCES"
echo "This may be a breaking change!"
fi
# Check for API version changes
API_VERSION_CHANGES=$(git diff "$BASE_BRANCH" "$FEATURE_BRANCH" | grep -E '^\+.*apiVersion:|^\-.*apiVersion:' || true)
if [ -n "$API_VERSION_CHANGES" ]; then
echo "⚠️ WARNING: API version changes detected:"
echo "$API_VERSION_CHANGES"
fi
# Check for breaking change markers
BREAKING_MARKERS=$(git log --oneline "$BASE_BRANCH..$FEATURE_BRANCH" | grep -i "BREAKING" || true)
if [ -n "$BREAKING_MARKERS" ]; then
echo "🚨 BREAKING CHANGE detected in commit messages:"
echo "$BREAKING_MARKERS"
exit 1
fi
echo "✅ No breaking changes detected"
Test Environment Deployment Preview¶
Preview Environment Deployment:
# .azuredevops/pipelines/pr-validation.yml
- job: PreviewDeploy
displayName: 'Preview Environment Deployment'
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/feature/*'))
steps:
- task: Kubernetes@1
inputs:
connectionType: 'Azure Resource Manager'
azureSubscriptionEndpoint: 'ATP-AKS-Connection'
azureResourceGroup: 'ATP-Dev-EUS-RG'
kubernetesCluster: 'atp-dev-eus-aks'
namespace: 'preview-$(System.PullRequest.PullRequestId)'
command: 'apply'
arguments: '-f apps/atp-ingestion/base/'
displayName: 'Deploy to preview namespace'
- script: |
# Wait for deployment to be ready
kubectl wait --for=condition=available \
--timeout=300s \
deployment/atp-ingestion \
-n preview-$(System.PullRequest.PullRequestId)
echo "✅ Preview deployment successful"
displayName: 'Wait for deployment'
- script: |
# Run smoke tests
kubectl exec -n preview-$(System.PullRequest.PullRequestId) \
deployment/atp-ingestion -- \
curl -f http://localhost:8080/health/ready
echo "✅ Smoke tests passed"
displayName: 'Run smoke tests'
Merge Strategies¶
Squash Merge (Production, Staging)¶
Squash Merge Configuration:
# Azure DevOps branch policy
branchPolicy:
branch: production
mergeStrategy: squash
squashMergeCommitMessage: firstLine # Use first commit message line
Squash Merge Example:
# Before squash merge (3 commits in feature branch)
git log --oneline feature/atp-ingestion-add-grpc
# abc123 feat(ingestion): add gRPC port
# def456 feat(ingestion): add gRPC health check
# ghi789 feat(ingestion): update service manifest
# After squash merge to production (1 commit)
git log --oneline production
# jkl012 feat(ingestion): add gRPC port # Single squashed commit
Benefits of Squash Merge: - ✅ Clean, linear history - ✅ Easier rollback (single commit) - ✅ Simpler to review changes
Merge Commit (Test)¶
Merge Commit Configuration:
# Azure DevOps branch policy
branchPolicy:
branch: test
mergeStrategy: noFastForward # Creates merge commit
Merge Commit Example:
# Feature branch merged to test with merge commit
git log --oneline --graph test
# * mno345 Merge pull request #123 from feature/atp-ingestion-add-grpc
# |\
# | * abc123 feat(ingestion): add gRPC port
# | * def456 feat(ingestion): add gRPC health check
# |/
# * pqr678 Previous commit
Benefits of Merge Commit: - ✅ Preserves branch history - ✅ Clear feature boundaries - ✅ Useful for tracking feature development
Rebase (Dev, Optional)¶
Rebase Workflow:
# Rebase feature branch on latest dev
git checkout feature/atp-ingestion-add-grpc
git fetch origin
git rebase origin/dev
# Resolve conflicts if any
git status
# Edit conflicted files
vim apps/atp-ingestion/base/deployment.yaml
git add apps/atp-ingestion/base/deployment.yaml
git rebase --continue
# Force push (be careful!)
git push --force-with-lease origin feature/atp-ingestion-add-grpc
Benefits of Rebase: - ✅ Linear history - ✅ Clean, sequential commits - ⚠️ Requires force push (dangerous)
Strategy Selection Rationale¶
Merge Strategy Matrix:
| Branch | Strategy | Rationale |
|---|---|---|
| dev | Merge commit or rebase | Preserve feature history, flexibility |
| test | Merge commit | Track feature development clearly |
| staging | Squash merge | Clean history, easier rollback |
| production | Squash merge | Clean, linear history essential for compliance |
Environment Promotion Flow¶
Promotion Flow Diagram:
graph LR
A[Feature Branch] -->|PR + Merge| B[Dev Environment]
B -->|Automated<br/>Schedule/Tag| C[Test Environment]
C -->|Manual Approval<br/>Regression Tests| D[Staging Environment]
D -->|CAB Approval<br/>Change Window| E[Production Environment]
F[Hotfix Branch] -.->|Expedited| E
style A fill:#90EE90
style B fill:#90EE90
style C fill:#FFE5B4
style D fill:#FFE5B4
style E fill:#ffcccc
style F fill:#ff9999
Promotion Flow Details:
| From | To | Method | Trigger | Approval Required | Automated Testing |
|---|---|---|---|---|---|
| Feature | Dev | Automatic | PR merge | ❌ No (PR approval only) | ✅ PR validation |
| Dev | Test | Automated | Schedule/Tag | ❌ No | ✅ Smoke tests |
| Test | Staging | Manual | On-demand | ✅ 2 approvers | ✅ Regression tests |
| Staging | Production | Manual | Change window | ✅ CAB (2 approvers) | ✅ Full test suite |
| Hotfix | Production | Expedited | Critical issue | ✅ 2 approvers | ✅ Hotfix tests |
Feature → Dev (Automatic After PR Merge)¶
Automatic Promotion Process:
# Azure Pipeline: Auto-promote to Dev
trigger:
branches:
include:
- dev
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: PromoteToDev
displayName: 'Promote to Dev Environment'
jobs:
- job: DeployToDev
steps:
- task: Kubernetes@1
inputs:
connectionType: 'Azure Resource Manager'
azureSubscriptionEndpoint: 'ATP-AKS-Connection'
azureResourceGroup: 'ATP-Dev-EUS-RG'
kubernetesCluster: 'atp-dev-eus-aks'
namespace: 'atp-dev'
command: 'apply'
arguments: '-f apps/atp-ingestion/overlays/dev/'
displayName: 'Apply manifests to Dev cluster'
- script: |
# Verify deployment
kubectl rollout status deployment/atp-ingestion -n atp-dev --timeout=300s
echo "✅ Deployment to Dev successful"
displayName: 'Verify deployment'
Dev → Test (Automatic, Triggered by Schedule or Tag)¶
Automated Promotion to Test:
# Azure Pipeline: Auto-promote to Test
schedules:
- cron: "0 2 * * *" # Daily at 2 AM UTC
branches:
include:
- dev
displayName: 'Daily promotion to Test'
trigger:
branches:
include:
- dev
tags:
include:
- promote-to-test/*
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: PromoteToTest
displayName: 'Promote to Test Environment'
jobs:
- job: DeployToTest
steps:
- script: |
# Tag current dev commit
git tag -a "test-$(date +%Y%m%d-%H%M%S)" -m "Promote to Test: $(Build.SourceVersion)"
git push origin --tags
displayName: 'Tag promotion'
- task: Kubernetes@1
inputs:
connectionType: 'Azure Resource Manager'
azureSubscriptionEndpoint: 'ATP-AKS-Connection'
azureResourceGroup: 'ATP-Test-EUS-RG'
kubernetesCluster: 'atp-test-eus-aks'
namespace: 'atp-test'
command: 'apply'
arguments: '-f apps/atp-ingestion/overlays/test/'
displayName: 'Apply manifests to Test cluster'
- script: |
# Run smoke tests
./scripts/run-smoke-tests.sh --environment test
displayName: 'Run smoke tests'
Manual Trigger for Test Promotion:
# Tag dev branch to trigger promotion to test
git checkout dev
git pull origin dev
git tag -a "promote-to-test/v1.2.3" -m "Promote version 1.2.3 to Test"
git push origin --tags
Test → Staging (Manual Approval, Regression Tests)¶
Manual Promotion to Staging:
# Azure Pipeline: Manual promotion to Staging
trigger: none # Manual trigger only
parameters:
- name: promoteVersion
displayName: 'Version to Promote'
type: string
default: 'latest'
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: PromoteToStaging
displayName: 'Promote to Staging Environment'
jobs:
- job: DeployToStaging
steps:
- script: |
# Checkout test branch at specified version
git checkout test
git pull origin test
git checkout "${{ parameters.promoteVersion }}"
displayName: 'Checkout version'
- task: Kubernetes@1
inputs:
connectionType: 'Azure Resource Manager'
azureSubscriptionEndpoint: 'ATP-AKS-Connection'
azureResourceGroup: 'ATP-Staging-EUS-RG'
kubernetesCluster: 'atp-staging-eus-aks'
namespace: 'atp-staging'
command: 'apply'
arguments: '-f apps/atp-ingestion/overlays/staging/'
displayName: 'Apply manifests to Staging cluster'
- script: |
# Run full regression test suite
./scripts/run-regression-tests.sh --environment staging
displayName: 'Run regression tests'
Pre-Promotion Checklist:
- ✅ All test environment tests passing
- ✅ Regression test suite passing
- ✅ Performance benchmarks met
- ✅ Security scans passing
- ✅ Documentation updated
- ✅ Rollback plan documented
- ✅ 2 approvers approved
Staging → Production (CAB Approval, Change Window)¶
Production Promotion Process:
# Azure Pipeline: Production promotion (requires manual approval)
trigger: none # Manual trigger only
parameters:
- name: promoteVersion
displayName: 'Version to Promote to Production'
type: string
default: 'latest'
- name: changeWindow
displayName: 'Change Window'
type: string
default: '2024-01-15 02:00 UTC'
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: ApprovalGate
displayName: 'Change Advisory Board Approval'
jobs:
- job: WaitForApproval
displayName: 'Wait for CAB Approval'
pool: server
steps:
- task: ManualValidation@0
timeoutInMinutes: 1440 # 24 hours
inputs:
notifyUsers: 'architect-team@connectsoft.example;sre-lead@connectsoft.example'
instructions: 'Review and approve production promotion'
- stage: PromoteToProduction
displayName: 'Promote to Production Environment'
dependsOn: ApprovalGate
condition: succeeded()
jobs:
- job: DeployToProduction
steps:
- script: |
# Verify change window
CURRENT_TIME=$(date -u +%s)
WINDOW_START=$(date -u -d "${{ parameters.changeWindow }}" +%s)
if [ $CURRENT_TIME -lt $WINDOW_START ]; then
echo "⏳ Waiting for change window..."
sleep $((WINDOW_START - CURRENT_TIME))
fi
displayName: 'Wait for change window'
- script: |
git checkout staging
git pull origin staging
git checkout "${{ parameters.promoteVersion }}"
displayName: 'Checkout version'
- task: Kubernetes@1
inputs:
connectionType: 'Azure Resource Manager'
azureSubscriptionEndpoint: 'ATP-AKS-Connection'
azureResourceGroup: 'ATP-Production-EUS-RG'
kubernetesCluster: 'atp-prod-eus-aks'
namespace: 'atp-production'
command: 'apply'
arguments: '-f apps/atp-ingestion/overlays/production/'
displayName: 'Apply manifests to Production cluster'
- script: |
# Verify deployment
kubectl rollout status deployment/atp-ingestion -n atp-production --timeout=600s
echo "✅ Production deployment successful"
displayName: 'Verify deployment'
- script: |
# Run production smoke tests
./scripts/run-production-smoke-tests.sh
displayName: 'Run production smoke tests'
Change Window Schedule:
| Day | Window | Rationale |
|---|---|---|
| Monday - Thursday | 02:00 - 04:00 UTC | Low traffic period |
| Friday | No deployments | Weekend preparation |
| Saturday - Sunday | Emergency only | Minimal staffing |
Automated Promotion (Dev and Test)¶
Trigger Mechanisms (Schedule, Tags, Webhooks)¶
Schedule-Based Promotion:
# Azure Pipeline: Scheduled promotion
schedules:
- cron: "0 2 * * *" # Daily at 2 AM UTC
branches:
include:
- dev
displayName: 'Daily Dev → Test Promotion'
always: false # Only if changes detected
Tag-Based Promotion:
# Create promotion tag
git checkout dev
git pull origin dev
git tag -a "promote-to-test/v1.2.3" -m "Promote version 1.2.3 to Test environment"
git push origin --tags
# Pipeline triggered automatically
Webhook Trigger:
# Azure Pipeline: Webhook trigger
resources:
webhooks:
- webhook: promotion-webhook
connection: GitHubWebhook
filters:
- path: body.ref
value: refs/heads/dev
Automated Testing Gates¶
Testing Gates Configuration:
# .azuredevops/pipelines/promotion-test-gates.yml
stages:
- stage: TestingGates
displayName: 'Automated Testing Gates'
jobs:
- job: SmokeTests
displayName: 'Smoke Tests'
steps:
- script: |
./scripts/run-smoke-tests.sh --environment test
displayName: 'Run smoke tests'
- job: IntegrationTests
displayName: 'Integration Tests'
steps:
- script: |
./scripts/run-integration-tests.sh --environment test
displayName: 'Run integration tests'
- job: PerformanceTests
displayName: 'Performance Benchmarks'
steps:
- script: |
./scripts/run-performance-tests.sh --environment test
displayName: 'Run performance tests'
- script: |
# Validate performance metrics
METRICS=$(cat performance-results.json)
P95_LATENCY=$(echo "$METRICS" | jq '.p95_latency')
if (( $(echo "$P95_LATENCY > 500" | bc -l) )); then
echo "❌ Performance regression: P95 latency $P95_LATENCY > 500ms"
exit 1
fi
echo "✅ Performance tests passed"
displayName: 'Validate performance'
Rollback on Failure¶
Automatic Rollback Script:
#!/bin/bash
# scripts/auto-rollback.sh
set -euo pipefail
ENVIRONMENT="${1:-test}"
DEPLOYMENT_NAME="${2:-atp-ingestion}"
NAMESPACE="atp-${ENVIRONMENT}"
echo "🔄 Rolling back deployment $DEPLOYMENT_NAME in $NAMESPACE..."
# Get previous revision
PREVIOUS_REVISION=$(kubectl rollout history deployment/$DEPLOYMENT_NAME -n $NAMESPACE | tail -n 2 | head -n 1 | awk '{print $1}')
if [ -z "$PREVIOUS_REVISION" ]; then
echo "❌ No previous revision found"
exit 1
fi
# Rollback
kubectl rollout undo deployment/$DEPLOYMENT_NAME -n $NAMESPACE
# Wait for rollback
kubectl rollout status deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=300s
echo "✅ Rollback successful to revision $PREVIOUS_REVISION"
Notification and Alerting¶
Promotion Notification:
# Azure Pipeline: Notification stage
- stage: Notify
displayName: 'Send Notifications'
condition: always() # Always run, even on failure
jobs:
- job: NotifyTeam
steps:
- task: Slack@1
inputs:
endpoint: 'ATP-Slack-Connection'
channel: '#atp-deployments'
message: |
🚀 Promotion to ${{ parameters.environment }} Environment
*Version*: ${{ parameters.promoteVersion }}
*Status*: $(Agent.JobStatus)
*Pipeline*: $(Build.BuildNumber)
*Author*: $(Build.RequestedFor)
*Changes*:
$(git log --oneline -10)
displayName: 'Send Slack notification'
- task: SendEmail@1
condition: eq(variables['Agent.JobStatus'], 'Failed')
inputs:
to: 'sre-oncall@connectsoft.example'
subject: '❌ Production Promotion Failed'
body: 'Production promotion failed. Check pipeline: $(Build.BuildUri)'
displayName: 'Send alert email'
Manual Promotion (Staging and Production)¶
Approval Gates Configuration¶
Azure DevOps Approval Gates:
# Azure DevOps environment: Production
environments:
- name: Production
approvals:
- approvers:
- architect-team@connectsoft.example
- sre-lead@connectsoft.example
count: 2 # Require 2 approvals
timeoutInMinutes: 1440 # 24 hours
checks:
- type: AzureMonitor
properties:
actionGroupName: production-promotion-alerts
- type: InvokeRESTAPI
properties:
url: 'https://api.connectsoft.example/change-window/validate'
method: 'POST'
Change Advisory Board (CAB) Process¶
CAB Approval Checklist:
- Change Request Review:
- ✅ Change description clear
- ✅ Risk assessment completed
- ✅ Rollback plan documented
- ✅ Testing evidence provided
-
✅ Impact analysis completed
-
Technical Review:
- ✅ Architecture review approved
- ✅ Security review approved
- ✅ Performance impact assessed
-
✅ Dependency analysis completed
-
Operational Review:
- ✅ Runbook updated
- ✅ Monitoring alerts configured
- ✅ On-call engineer notified
- ✅ Change window scheduled
CAB Meeting Agenda:
- Review pending change requests
- Assess risk and impact
- Approve/reject change requests
- Schedule change windows
- Document decisions
Change Window Scheduling¶
Change Window Policy:
| Environment | Days | Time Window | Restrictions |
|---|---|---|---|
| Dev | Any day | 24/7 | None |
| Test | Mon-Fri | 24/7 | None |
| Staging | Mon-Thu | 02:00-04:00 UTC | No Friday deployments |
| Production | Mon-Thu | 02:00-04:00 UTC | No Friday/weekend deployments |
Schedule Change Window:
# Azure DevOps CLI: Schedule change
az pipelines variable-group create \
--name "Change-Window-2024-01-15" \
--variables \
changeWindowStart="2024-01-15T02:00:00Z" \
changeWindowEnd="2024-01-15T04:00:00Z" \
changeOwner="john.doe@connectsoft.example"
Pre-Deployment Checklists¶
Pre-Deployment Checklist:
## Pre-Deployment Checklist
### Technical Readiness
- [ ] All tests passing (unit, integration, E2E)
- [ ] Performance benchmarks met
- [ ] Security scans passing (SAST, DAST, dependency scan)
- [ ] Manifest validation passing
- [ ] Helm charts validated
- [ ] Kustomize builds successful
### Documentation
- [ ] Release notes updated
- [ ] API documentation updated
- [ ] Runbook updated
- [ ] Architecture diagrams updated
### Operations
- [ ] Monitoring dashboards configured
- [ ] Alerts configured
- [ ] Logging configured
- [ ] Backup strategy verified
- [ ] Rollback procedure tested
### Communication
- [ ] Stakeholders notified
- [ ] On-call engineer notified
- [ ] Support team briefed
- [ ] Customer communication prepared (if needed)
### Change Management
- [ ] Change request created
- [ ] CAB approval obtained
- [ ] Change window scheduled
- [ ] Risk assessment completed
Post-Deployment Validation¶
Post-Deployment Validation Script:
#!/bin/bash
# scripts/post-deployment-validation.sh
set -euo pipefail
ENVIRONMENT="${1:-production}"
NAMESPACE="atp-${ENVIRONMENT}"
echo "✅ Post-Deployment Validation for $ENVIRONMENT"
# Health checks
echo "1. Health Checks"
kubectl get pods -n $NAMESPACE -l app=atp-ingestion
kubectl wait --for=condition=ready pod -l app=atp-ingestion -n $NAMESPACE --timeout=300s
# Service endpoints
echo "2. Service Endpoints"
kubectl get endpoints atp-ingestion -n $NAMESPACE
# Metrics
echo "3. Metrics"
kubectl exec -n $NAMESPACE deployment/atp-ingestion -- \
curl -s http://localhost:9090/metrics | grep -q "http_requests_total" && \
echo "✅ Metrics endpoint responding"
# Smoke tests
echo "4. Smoke Tests"
./scripts/run-smoke-tests.sh --environment $ENVIRONMENT
echo "✅ Post-deployment validation complete"
Hotfix Workflow¶
Hotfix Workflow Diagram:
graph TD
A[Production Issue Detected] --> B[Create Hotfix Branch<br/>from production]
B --> C[Implement Fix<br/>in Hotfix Branch]
C --> D[PR to Production<br/>Expedited Review]
D --> E[2 Approvers<br/>Required]
E --> F[Merge to Production]
F --> G[Deploy to Production]
G --> H[Verify Fix]
H --> I[Back-merge to<br/>dev/test/staging]
style A fill:#ffcccc
style B fill:#ff9999
style F fill:#ffcccc
style I fill:#90EE90
Creating Hotfix Branch from Production¶
Hotfix Branch Creation:
# 1. Checkout production branch
git checkout production
git pull origin production
# 2. Create hotfix branch
git checkout -b hotfix/atp-gateway-security-patch-CVE-2024-1234
# 3. Push hotfix branch
git push -u origin hotfix/atp-gateway-security-patch-CVE-2024-1234
# 4. Apply fix
vim apps/atp-gateway/base/deployment.yaml
git add apps/atp-gateway/base/deployment.yaml
git commit -S -m "fix(gateway): patch security vulnerability CVE-2024-1234
URGENT: Security patch for authentication bypass vulnerability.
Related to: ATP-9999 (Critical Security Issue)"
Expedited Approval Process¶
Hotfix PR Creation:
# Create hotfix PR with expedited flag
az repos pr create \
--source-branch hotfix/atp-gateway-security-patch-CVE-2024-1234 \
--target-branch production \
--title "🚨 HOTFIX: Security patch CVE-2024-1234" \
--description "Urgent security fix. Requires expedited review." \
--work-items 9999 \
--reviewers "architect-team@connectsoft.example;sre-lead@connectsoft.example" \
--auto-complete false \
--bypass-policy false # Still requires 2 approvers
Expedited Review Checklist:
- ✅ Security issue verified (CVE, vulnerability scan)
- ✅ Fix validated (local testing, security review)
- ✅ Impact assessment completed
- ✅ Rollback plan documented
- ✅ 2 approvers from architecture/SRE teams
Testing in Hotfix Environment¶
Hotfix Testing:
# Deploy to hotfix test environment
kubectl apply -f apps/atp-gateway/overlays/hotfix-test/ \
--namespace atp-hotfix-test
# Run critical path tests
./scripts/run-critical-path-tests.sh --environment hotfix-test
# Security validation
./scripts/run-security-tests.sh --environment hotfix-test \
--focus CVE-2024-1234
Fast-Track Merge to Production¶
Hotfix Merge Process:
# After approval, merge hotfix
az repos pr update \
--id <PR_ID> \
--status completed \
--merge-strategy squash
# Verify merge
git checkout production
git pull origin production
git log --oneline -5
# Tag hotfix release
git tag -a "hotfix/v1.2.4-CVE-2024-1234" \
-m "Hotfix: Security patch CVE-2024-1234"
git push origin --tags
Back-Merge to Dev/Test/Staging¶
Back-Merge Process:
# Back-merge to staging
git checkout staging
git pull origin staging
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin staging
# Back-merge to test
git checkout test
git pull origin test
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin test
# Back-merge to dev
git checkout dev
git pull origin dev
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin dev
Preview Environments¶
Ephemeral Namespace per PR¶
Preview Environment Configuration:
# .azuredevops/pipelines/preview-environment.yml
trigger: none
pr:
branches:
include:
- feature/*
- bugfix/*
pool:
vmImage: 'ubuntu-latest'
variables:
previewNamespace: 'preview-pr-$(System.PullRequest.PullRequestId)'
stages:
- stage: CreatePreview
displayName: 'Create Preview Environment'
jobs:
- job: ProvisionPreview
steps:
- script: |
# Create preview namespace
kubectl create namespace $(previewNamespace) \
--dry-run=client -o yaml | kubectl apply -f -
# Label namespace
kubectl label namespace $(previewNamespace) \
environment=preview \
pr-id=$(System.PullRequest.PullRequestId) \
managed-by=fluxcd
echo "✅ Preview namespace created: $(previewNamespace)"
displayName: 'Create preview namespace'
- task: Kubernetes@1
inputs:
connectionType: 'Azure Resource Manager'
azureSubscriptionEndpoint: 'ATP-AKS-Connection'
azureResourceGroup: 'ATP-Dev-EUS-RG'
kubernetesCluster: 'atp-dev-eus-aks'
namespace: '$(previewNamespace)'
command: 'apply'
arguments: |
-f apps/atp-ingestion/base/ \
--namespace $(previewNamespace)
displayName: 'Deploy to preview namespace'
- script: |
# Wait for deployment
kubectl wait --for=condition=available \
deployment/atp-ingestion \
-n $(previewNamespace) \
--timeout=300s
# Get preview URL
PREVIEW_URL=$(kubectl get ingress atp-ingestion \
-n $(previewNamespace) \
-o jsonpath='{.spec.rules[0].host}')
echo "##vso[task.setvariable variable=PreviewUrl]$PREVIEW_URL"
echo "✅ Preview environment ready: https://$PREVIEW_URL"
displayName: 'Wait for deployment'
- script: |
# Add preview URL to PR comment
az repos pr set-vote \
--id $(System.PullRequest.PullRequestId) \
--vote approved \
--comment "Preview environment: https://$(PreviewUrl)"
displayName: 'Add preview URL to PR'
- stage: CleanupPreview
displayName: 'Cleanup Preview Environment'
condition: always()
jobs:
- job: DeletePreview
steps:
- script: |
# Delete preview namespace
kubectl delete namespace $(previewNamespace) --ignore-not-found=true
echo "✅ Preview namespace deleted: $(previewNamespace)"
displayName: 'Delete preview namespace'
Automatic Provisioning on PR Creation¶
PR Webhook Trigger:
# Azure Pipeline: Trigger on PR creation
resources:
webhooks:
- webhook: pr-webhook
connection: AzureReposWebhook
filters:
- path: eventType
value: git.pullrequest.created
Testing Isolated Changes¶
Preview Environment Testing:
# Test preview environment
PREVIEW_URL="https://preview-pr-123.ingestion.atp.connectsoft.example"
# Health check
curl "$PREVIEW_URL/health/ready"
# Smoke tests
curl "$PREVIEW_URL/api/v1/audit/records" \
-H "Authorization: Bearer $PREVIEW_API_KEY"
# Integration tests
./scripts/run-integration-tests.sh \
--base-url "$PREVIEW_URL" \
--environment preview
Auto-Deletion After PR Close¶
Cleanup on PR Close:
# Azure Pipeline: Cleanup on PR close
resources:
webhooks:
- webhook: pr-webhook
connection: AzureReposWebhook
filters:
- path: eventType
value: git.pullrequest.closed
stages:
- stage: Cleanup
displayName: 'Cleanup Preview Environment'
jobs:
- job: DeletePreview
steps:
- script: |
PR_ID=$(echo "$(Build.SourceBranch)" | sed 's/.*\///')
PREVIEW_NAMESPACE="preview-pr-${PR_ID}"
kubectl delete namespace $PREVIEW_NAMESPACE --ignore-not-found=true
echo "✅ Preview namespace $PREVIEW_NAMESPACE deleted"
displayName: 'Delete preview namespace'
Git Commit Message Conventions¶
Conventional Commits Format¶
Conventional Commits Specification:
Commit Types:
| Type | Description | Example |
|---|---|---|
| feat | New feature | feat(ingestion): add gRPC endpoint |
| fix | Bug fix | fix(query): resolve memory leak |
| docs | Documentation | docs(gitops): update deployment guide |
| style | Code style (formatting) | style(*): format YAML files |
| refactor | Code refactoring | refactor(gateway): simplify auth logic |
| test | Tests | test(integrity): add unit tests |
| chore | Maintenance | chore(infra): update Helm charts |
| perf | Performance | perf(query): optimize database queries |
| ci | CI/CD | ci(pipelines): add validation stage |
Linking to Work Items¶
Work Item References:
# Link to Azure DevOps work item
git commit -m "feat(ingestion): add gRPC endpoint
Related to: ATP-1234"
# Link multiple work items
git commit -m "fix(query): resolve multiple issues
Fixes: ATP-1234, ATP-5678
Closes: ATP-9999"
# Reference work item in footer
git commit -m "feat(gateway): implement OAuth 2.0
Implements ATP-2345
See also: ATP-2346"
Semantic Prefix Examples¶
Complete Commit Message Examples:
# Feature with scope
git commit -m "feat(ingestion): add Redis cache support
- Add Redis connection configuration
- Implement cache layer for audit records
- Add cache health checks
Related to: ATP-1234"
# Breaking change
git commit -m "feat(gateway)!: remove legacy authentication
BREAKING CHANGE: Legacy API key authentication removed.
Migrate to OAuth 2.0 before deploying this change.
Migration guide: https://docs.connectsoft.example/migration/oauth
Closes: ATP-5678"
# Hotfix
git commit -m "fix(gateway): patch security vulnerability CVE-2024-1234
URGENT: Security patch for authentication bypass vulnerability.
Related to: ATP-9999 (Critical Security Issue)"
# Configuration change
git commit -m "chore(infra): update resource limits for production
- Increase CPU limit to 2000m
- Increase memory limit to 2Gi
- Update HPA min replicas to 5
Related to: ATP-3456"
Commit Message Templates¶
Git Commit Template (.gitmessage):
# <type>(<scope>): <subject>
#
# <body>
#
# <footer>
#
# Type: feat, fix, docs, style, refactor, test, chore, perf, ci
# Scope: ingestion, query, gateway, integrity, export, policy, search, infra
#
# Examples:
# feat(ingestion): add gRPC endpoint
# fix(query): resolve memory leak
# docs(gitops): update deployment guide
#
# Related work items: ATP-1234
Configure Git to Use Template:
# Set commit template
git config --global commit.template .gitmessage
# Or per repository
git config commit.template .gitmessage
Release Tagging Strategy¶
Tagging Releases for Production¶
Production Release Tagging:
# Tag production release
git checkout production
git pull origin production
# Create annotated tag
git tag -a "v1.2.3" \
-m "Release v1.2.3
Features:
- Add gRPC endpoint to ingestion service
- Implement Redis cache for query service
- Security enhancements
Breaking Changes:
- None
Related work items: ATP-1234, ATP-5678"
# Push tags
git push origin --tags
Tag Verification:
# List tags
git tag -l "v*"
# Show tag details
git show v1.2.3
# Verify tag signature (if GPG signed)
git tag -v v1.2.3
Service-Specific vs Environment-Wide Tags¶
Service-Specific Tags:
# Tag specific service version
git tag -a "atp-ingestion/v1.2.3" \
-m "ATP Ingestion Service v1.2.3"
git tag -a "atp-query/v1.3.0" \
-m "ATP Query Service v1.3.0"
Environment-Wide Tags:
# Tag environment release
git tag -a "production/2024-01-15" \
-m "Production Release 2024-01-15
Services:
- atp-ingestion: v1.2.3
- atp-query: v1.3.0
- atp-gateway: v1.1.5"
git tag -a "staging/2024-01-10" \
-m "Staging Release 2024-01-10"
Tag Naming Conventions¶
Tag Naming Standards:
| Tag Type | Format | Example | Use Case |
|---|---|---|---|
| Release | v{MAJOR}.{MINOR}.{PATCH} |
v1.2.3 |
Production releases |
| Pre-release | v{MAJOR}.{MINOR}.{PATCH}-{PRERELEASE} |
v1.2.3-rc1 |
Release candidates |
| Service | {service}/v{VERSION} |
atp-ingestion/v1.2.3 |
Service-specific |
| Environment | {env}/{DATE} |
production/2024-01-15 |
Environment snapshots |
| Hotfix | hotfix/v{VERSION}-{ISSUE} |
hotfix/v1.2.4-CVE-2024-1234 |
Hotfixes |
| Promotion | promote-to-{env}/{VERSION} |
promote-to-test/v1.2.3 |
Promotion triggers |
Automated Tag Creation¶
Automated Tagging in CI/CD:
# Azure Pipeline: Auto-tag on production merge
trigger:
branches:
include:
- production
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: TagRelease
displayName: 'Tag Production Release'
jobs:
- job: CreateTag
steps:
- script: |
# Extract version from commit message or manifest
VERSION=$(grep -E '^version:' apps/atp-ingestion/base/deployment.yaml | awk '{print $2}' | tr -d '"')
# Create tag
git config user.name "Azure DevOps"
git config user.email "devops@connectsoft.example"
git tag -a "v$VERSION" \
-m "Release v$VERSION
Automated release from commit $(Build.SourceVersion)
Pipeline: $(Build.BuildNumber)"
git push origin "v$VERSION"
echo "✅ Tagged release: v$VERSION"
displayName: 'Create and push tag'
Rollback via Git Revert¶
Simple Rollback (Single Service)¶
Single Service Rollback:
# 1. Identify commit to revert
git log --oneline production | grep "atp-ingestion"
# abc123 feat(ingestion): add gRPC endpoint
# 2. Revert commit
git checkout production
git pull origin production
git revert abc123 --no-edit
# 3. Push revert commit
git push origin production
# 4. Verify rollback
git log --oneline -5 production
# def456 Revert "feat(ingestion): add gRPC endpoint"
# abc123 feat(ingestion): add gRPC endpoint
Complex Rollback (Multiple Services)¶
Multiple Service Rollback:
# 1. Identify commits to revert
git log --oneline production | grep -E "(ingestion|query|gateway)" | head -5
# abc123 feat(ingestion): add gRPC endpoint
# def456 feat(query): add Redis cache
# ghi789 feat(gateway): update authentication
# 2. Create rollback branch
git checkout production
git pull origin production
git checkout -b rollback/2024-01-15-multiple-services
# 3. Revert commits (newest first)
git revert ghi789 --no-edit # Gateway
git revert def456 --no-edit # Query
git revert abc123 --no-edit # Ingestion
# 4. Test rollback
kubectl apply --dry-run=client -f apps/ --recursive
# 5. Create PR for rollback
az repos pr create \
--source-branch rollback/2024-01-15-multiple-services \
--target-branch production \
--title "ROLLBACK: Multiple services 2024-01-15" \
--description "Revert changes to ingestion, query, and gateway services"
# 6. After approval, merge
az repos pr update --id <PR_ID> --status completed
Git Revert vs Reset¶
Git Revert vs Reset Comparison:
| Method | Command | History | Safety | Use Case |
|---|---|---|---|---|
| Revert | git revert |
Preserves (creates new commit) | ✅ Safe (non-destructive) | Production rollback |
| Reset | git reset --hard |
Rewrites (destroys commits) | ❌ Dangerous (destructive) | Development only |
Git Reset (Development Only):
# ⚠️ WARNING: Only use in development branches!
git checkout dev
git reset --hard HEAD~3 # Remove last 3 commits
git push --force origin dev # Force push (destructive!)
Git Revert (Production):
# ✅ Safe for production
git checkout production
git revert abc123 # Creates new commit undoing abc123
git push origin production # Safe push
Rollback Validation¶
Rollback Validation Script:
#!/bin/bash
# scripts/validate-rollback.sh
set -euo pipefail
COMMIT_TO_REVERT="${1:-HEAD}"
NAMESPACE="${2:-atp-production}"
echo "🔄 Validating rollback for commit $COMMIT_TO_REVERT..."
# 1. Preview rollback changes
git revert --no-commit $COMMIT_TO_REVERT
git diff --stat
# 2. Validate manifests
find apps/ -name "*.yaml" -path "*/base/*" | while read manifest; do
kubeval "$manifest" || exit 1
done
# 3. Dry-run apply
kubectl apply --dry-run=client -f apps/ --recursive
# 4. Check for breaking changes
git log $COMMIT_TO_REVERT -1 --pretty=format:"%B" | grep -i "BREAKING" && \
echo "⚠️ WARNING: Reverting a breaking change!" || \
echo "✅ No breaking changes detected"
# 5. Restore state
git reset --hard HEAD
echo "✅ Rollback validation complete"
Execute Rollback:
# Validate rollback
./scripts/validate-rollback.sh abc123 atp-production
# Execute rollback
git checkout production
git pull origin production
git revert abc123 --no-edit
# Apply to cluster
kubectl apply -f apps/atp-ingestion/overlays/production/
# Verify rollback
kubectl rollout status deployment/atp-ingestion -n atp-production
kubectl get pods -n atp-production -l app=atp-ingestion
# Run smoke tests
./scripts/run-smoke-tests.sh --environment production
echo "✅ Rollback completed and verified"
Summary: Git Workflow & Environment Promotion¶
- Feature Branch Workflow: Git-centric development with conventional commits and GPG signing
- Pull Request Process: Comprehensive PR templates, automated validation, and approval workflows
- Automated PR Validation: YAML linting, security scanning, dry-run validation, breaking change detection
- Merge Strategies: Squash merge (production), merge commit (test), rebase (dev)
- Environment Promotion: Automated (dev→test), manual (test→staging→production) with CAB approval
- Hotfix Workflow: Expedited process with back-merge to all environments
- Preview Environments: Ephemeral namespaces per PR for isolated testing
- Commit Conventions: Conventional commits with work item linking
- Release Tagging: Semantic versioning with service-specific and environment-wide tags
- Rollback Procedures: Git revert for safe production rollbacks with validation
Azure Pipelines to GitOps Handoff¶
Purpose: Define how Azure Pipelines (CI) hand off to the GitOps workflow by automating artifact publishing, manifest updates, and Git commits, ensuring a clear separation of concerns between build/test (CI) and deployment/reconciliation (GitOps).
Separation of Concerns¶
CI Pipeline Responsibilities (Build, Test, Publish)¶
CI Pipeline Scope (Azure Pipelines):
| Responsibility | Description | Examples |
|---|---|---|
| Source Code Build | Compile, package applications | dotnet build, npm build |
| Unit Testing | Run unit tests, code coverage | dotnet test, jest |
| Integration Testing | Test service interactions | Test containers, API tests |
| Security Scanning | SAST, DAST, dependency scanning | Snyk, Trivy, OWASP ZAP |
| Artifact Creation | Build Docker images, NuGet packages | docker build, dotnet pack |
| Artifact Publishing | Push to registry (ACR, NuGet feed) | docker push, helm push |
| SBOM Generation | Software Bill of Materials | Syft, SPDX format |
| Metadata Recording | Build provenance, vulnerability reports | In-toto attestations |
| GitOps Manifest Update | Commit image tag updates to GitOps repo | Automated Git commits |
CI Pipeline Boundaries:
# CI Pipeline Responsibilities
✅ Build application code
✅ Run tests (unit, integration, security)
✅ Build and push container images to ACR
✅ Generate and publish SBOM
✅ Update GitOps repository with new image tags
✅ Trigger FluxCD sync (via webhook or polling)
❌ NOT: Deploy directly to Kubernetes
❌ NOT: Manage Kubernetes cluster state
❌ NOT: Handle reconciliation loops
❌ NOT: Monitor cluster health
GitOps Responsibilities (Deploy, Reconcile, Monitor)¶
GitOps Scope (FluxCD on AKS):
| Responsibility | Description | Examples |
|---|---|---|
| Git Repository Watch | Poll Git repository for changes | FluxCD Source Controller |
| Manifest Rendering | Render Kustomize/Helm manifests | FluxCD Kustomize/Helm Controller |
| Cluster Deployment | Apply manifests to Kubernetes | kubectl apply (via FluxCD) |
| State Reconciliation | Detect and correct drift | Continuous reconciliation loop |
| Health Monitoring | Monitor deployment health | FluxCD health checks |
| Rollback Management | Revert to previous Git commits | Git revert operations |
GitOps Boundaries:
# GitOps Responsibilities
✅ Watch Git repository for manifest changes
✅ Render and apply Kubernetes manifests
✅ Reconcile cluster state with Git
✅ Detect and correct configuration drift
✅ Monitor deployment health
✅ Rollback via Git operations
❌ NOT: Build application code
❌ NOT: Run unit/integration tests
❌ NOT: Build container images
❌ NOT: Publish artifacts to registries
Clear Handoff Point: Artifact Publishing¶
Handoff Architecture:
graph LR
A[Source Code<br/>Repository] -->|trigger| B[Azure Pipelines<br/>CI Pipeline]
B -->|build + test| C[Container Image<br/>+ SBOM]
C -->|push| D[Azure Container<br/>Registry ACR]
B -->|update manifests| E[GitOps Repository<br/>Azure Repos]
E -->|watch| F[FluxCD<br/>Source Controller]
F -->|fetch artifacts| G[FluxCD<br/>Kustomize Controller]
G -->|deploy| H[AKS Cluster<br/>Production]
style B fill:#90EE90
style D fill:#FFE5B4
style E fill:#90EE90
style F fill:#FFE5B4
style G fill:#FFE5B4
style H fill:#ffcccc
Handoff Criteria:
- ✅ Artifact Published: Image pushed to ACR with immutable tag
- ✅ SBOM Generated: Software Bill of Materials published
- ✅ Vulnerability Scan: Security scan results available
- ✅ Manifest Updated: GitOps repository contains new image tag
- ✅ Commit Signed: Git commit signed (for production)
- ✅ CI Tests Passing: All CI validation gates passed
Handoff Checklist:
## CI → GitOps Handoff Checklist
### Artifacts
- [ ] Container image built and pushed to ACR
- [ ] Image tagged with version + commit SHA (immutable)
- [ ] SBOM generated and published
- [ ] Vulnerability scan completed and results recorded
### Manifest Updates
- [ ] Image tag updated in GitOps repository
- [ ] Kustomize/Helm manifest files updated
- [ ] Changes committed to Git
- [ ] Commit signed (production only)
### Validation
- [ ] All CI tests passing
- [ ] Security scans passing
- [ ] Manifest validation passing
- [ ] Build provenance recorded
Azure Pipeline Stages¶
Build: Compile, Unit Test¶
Build Stage:
# azure-pipelines.yml
stages:
- stage: Build
displayName: 'Build and Test'
jobs:
- job: BuildApplication
displayName: 'Build ATP Ingestion Service'
pool:
vmImage: 'ubuntu-latest'
variables:
- group: ATP-Common-Variables
- name: BuildConfiguration
value: 'Release'
- name: ServiceName
value: 'atp-ingestion'
steps:
# Checkout source code
- checkout: self
fetchDepth: 0 # Full history for version calculation
# Setup .NET SDK
- task: UseDotNet@2
inputs:
packageType: 'sdk'
version: '8.0.x'
# Restore dependencies
- script: |
dotnet restore src/ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj
displayName: 'Restore NuGet packages'
# Build application
- script: |
dotnet build src/ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj \
--configuration $(BuildConfiguration) \
--no-restore \
-p:Version=$(Build.BuildNumber)
displayName: 'Build application'
# Run unit tests
- script: |
dotnet test src/ConnectSoft.ATP.Ingestion.Tests/ConnectSoft.ATP.Ingestion.Tests.csproj \
--configuration $(BuildConfiguration) \
--no-build \
--collect:"XPlat Code Coverage" \
--results-directory $(Agent.TempDirectory)/test-results \
--logger "trx;LogFileName=test-results.trx"
displayName: 'Run unit tests'
continueOnError: false
# Publish test results
- task: PublishTestResults@2
condition: always()
inputs:
testResultsFormat: 'VSTest'
testResultsFiles: '$(Agent.TempDirectory)/test-results/**/*.trx'
testRunTitle: 'Unit Tests - $(ServiceName)'
# Publish code coverage
- task: PublishCodeCoverageResults@1
condition: always()
inputs:
codeCoverageTool: 'Cobertura'
summaryFileLocation: '$(Agent.TempDirectory)/test-results/**/coverage.cobertura.xml'
Test: Integration Test, Security Scan¶
Test Stage:
- stage: Test
displayName: 'Integration Tests and Security Scans'
dependsOn: Build
jobs:
- job: IntegrationTests
displayName: 'Integration Tests'
pool:
vmImage: 'ubuntu-latest'
services:
postgres: postgres
redis: redis
steps:
- checkout: self
- task: UseDotNet@2
inputs:
packageType: 'sdk'
version: '8.0.x'
- script: |
dotnet test src/ConnectSoft.ATP.Ingestion.IntegrationTests/ \
--configuration Release \
--logger "trx;LogFileName=integration-test-results.trx"
displayName: 'Run integration tests'
env:
ConnectionStrings__Database: $(PostgresConnectionString)
ConnectionStrings__Redis: $(RedisConnectionString)
- task: PublishTestResults@2
condition: always()
inputs:
testResultsFormat: 'VSTest'
testResultsFiles: '**/integration-test-results.trx'
testRunTitle: 'Integration Tests - $(ServiceName)'
- job: SecurityScan
displayName: 'Security Scanning'
pool:
vmImage: 'ubuntu-latest'
steps:
- checkout: self
# SAST (Static Application Security Testing)
- task: SnykSecurityScan@1
inputs:
serviceConnectionEndpoint: 'Snyk-Service-Connection'
testType: 'app'
severityThreshold: 'high'
# Dependency Vulnerability Scan
- script: |
dotnet list package --vulnerable --include-transitive
displayName: 'Check for vulnerable NuGet packages'
# Container Image Scan (after build)
- script: |
trivy image --severity HIGH,CRITICAL \
connectsoft.azurecr.io/atp/ingestion:$(Build.BuildNumber) \
--format json \
--output trivy-scan-results.json
displayName: 'Scan container image with Trivy'
condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
Publish: Push to ACR, Generate SBOM¶
Publish Stage:
- stage: Publish
displayName: 'Publish Artifacts'
dependsOn: Test
condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
jobs:
- job: BuildAndPushImage
displayName: 'Build and Push Container Image'
pool:
vmImage: 'ubuntu-latest'
variables:
- name: ImageRepository
value: 'connectsoft.azurecr.io/atp/ingestion'
- name: ImageTag
value: '$(Build.BuildNumber)-$(Build.SourceVersion)' # v1.2.3-abc123d
steps:
- checkout: self
# Login to ACR
- task: Docker@2
displayName: 'Login to Azure Container Registry'
inputs:
command: 'login'
containerRegistry: 'ConnectSoft-ACR'
# Build Docker image
- task: Docker@2
displayName: 'Build Docker image'
inputs:
command: 'build'
containerRegistry: 'ConnectSoft-ACR'
repository: 'atp/ingestion'
dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
tags: |
$(ImageTag)
latest # Only for dev branch
buildContext: '$(Build.SourcesDirectory)'
arguments: |
--build-arg BUILD_VERSION=$(Build.BuildNumber)
--build-arg BUILD_COMMIT=$(Build.SourceVersion)
--build-arg BUILD_DATE=$(Build.BuildId)
# Generate SBOM
- script: |
# Install Syft
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
# Generate SBOM
syft packages docker:$(ImageRepository):$(ImageTag) \
--output spdx-json \
--file sbom-$(ImageTag).spdx.json
echo "✅ SBOM generated: sbom-$(ImageTag).spdx.json"
displayName: 'Generate SBOM (Software Bill of Materials)'
# Scan image for vulnerabilities
- script: |
trivy image \
--format json \
--output trivy-$(ImageTag).json \
--severity HIGH,CRITICAL \
$(ImageRepository):$(ImageTag)
displayName: 'Scan image for vulnerabilities'
# Push image to ACR
- task: Docker@2
displayName: 'Push Docker image to ACR'
inputs:
command: 'push'
containerRegistry: 'ConnectSoft-ACR'
repository: 'atp/ingestion'
tags: |
$(ImageTag)
# Attach SBOM and scan results as pipeline artifacts
- task: PublishPipelineArtifact@1
displayName: 'Publish SBOM'
inputs:
targetPath: 'sbom-$(ImageTag).spdx.json'
artifactName: 'sbom-$(ImageTag)'
publishLocation: 'pipeline'
- task: PublishPipelineArtifact@1
displayName: 'Publish vulnerability scan results'
inputs:
targetPath: 'trivy-$(ImageTag).json'
artifactName: 'vulnerability-scan-$(ImageTag)'
publishLocation: 'pipeline'
# Attach metadata to ACR image (annotations)
- script: |
az acr repository update \
--name connectsoft \
--image atp/ingestion:$(ImageTag) \
--metadata \
build.version=$(Build.BuildNumber) \
build.commit=$(Build.SourceVersion) \
build.date=$(Build.BuildId) \
build.pipeline=$(Build.BuildUri) \
sbom.url=$(Pipeline.Workspace)/sbom-$(ImageTag).spdx.json
displayName: 'Attach metadata to image'
Update GitOps: Commit Manifest Changes¶
Update GitOps Stage:
- stage: UpdateGitOps
displayName: 'Update GitOps Repository'
dependsOn: Publish
condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
jobs:
- job: UpdateManifests
displayName: 'Update GitOps Manifests'
pool:
vmImage: 'ubuntu-latest'
variables:
- name: GitOpsRepoUrl
value: 'https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops'
- name: TargetBranch
value: ${{ if eq(variables['Build.SourceBranch'], 'refs/heads/main') }}'production'${{ else }}'dev'${{ endif }}
steps:
# Checkout GitOps repository
- checkout: git://ATP/atp-gitops@$(TargetBranch)
displayName: 'Checkout GitOps repository'
path: gitops-repo
# Install required tools
- script: |
# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
# Install yq (YAML processor)
wget -q https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O yq
chmod +x yq
sudo mv yq /usr/local/bin/
displayName: 'Install tools (kustomize, yq)'
# Update Kustomize image tags
- script: |
cd gitops-repo
# Update image tag in Kustomize base
kustomize edit set image \
connectsoft.azurecr.io/atp/ingestion=$(ImageRepository):$(ImageTag)
# Update image tag in all overlays (dev, test, staging, production)
for overlay in dev test staging production; do
if [ -d "apps/atp-ingestion/overlays/$overlay" ]; then
cd "apps/atp-ingestion/overlays/$overlay"
kustomize edit set image \
connectsoft.azurecr.io/atp/ingestion=$(ImageRepository):$(ImageTag)
cd ../../../../../
fi
done
echo "✅ Updated Kustomize image tags"
displayName: 'Update Kustomize image tags'
# Update Helm values files
- script: |
cd gitops-repo
# Update image tag in Helm values for target branch
yq eval -i '.image.tag = "$(ImageTag)"' \
apps/atp-ingestion/helm/values-$(TargetBranch).yaml
# Also update default values.yaml if exists
if [ -f "apps/atp-ingestion/helm/values.yaml" ]; then
yq eval -i '.image.tag = "$(ImageTag)"' \
apps/atp-ingestion/helm/values.yaml
fi
echo "✅ Updated Helm values files"
displayName: 'Update Helm values files'
# Validate updated manifests
- script: |
cd gitops-repo
# Validate Kustomize builds
kustomize build apps/atp-ingestion/overlays/$(TargetBranch)/ > /dev/null
echo "✅ Kustomize build validation passed"
# Validate Helm templates
helm template atp-ingestion apps/atp-ingestion/helm/ \
--values apps/atp-ingestion/helm/values-$(TargetBranch).yaml > /dev/null
echo "✅ Helm template validation passed"
displayName: 'Validate updated manifests'
# Commit and push changes
- script: |
cd gitops-repo
# Configure Git
git config user.name "Azure DevOps Pipeline"
git config user.email "azure-devops@connectsoft.example"
# Check for changes
if [ -z "$(git status --porcelain)" ]; then
echo "ℹ️ No changes to commit"
exit 0
fi
# Stage changes
git add apps/atp-ingestion/
# Commit with conventional commit format
git commit -m "chore(ingestion): update image tag to $(ImageTag)
Automated update from CI pipeline:
- Image: $(ImageRepository):$(ImageTag)
- Build: $(Build.BuildNumber)
- Commit: $(Build.SourceVersion)
- Pipeline: $(Build.BuildUri)
Related to: $(System.PullRequest.PullRequestId)"
# Push to GitOps repository
git push origin $(TargetBranch)
echo "✅ Pushed manifest updates to GitOps repository"
displayName: 'Commit and push to GitOps repository'
env:
SYSTEM_ACCESSTOKEN: $(System.AccessToken)
Image Tag Generation¶
Semantic Version from Git Tag¶
Version Extraction Script:
#!/bin/bash
# scripts/extract-version.sh
set -euo pipefail
# Extract version from Git tag
VERSION_TAG=$(git describe --tags --exact-match 2>/dev/null || echo "")
if [ -n "$VERSION_TAG" ]; then
# Use version from Git tag (e.g., v1.2.3)
VERSION=$(echo "$VERSION_TAG" | sed 's/^v//')
echo "✅ Version from Git tag: $VERSION"
else
# Fallback: Use version from project file or build number
VERSION=$(grep -E '<Version>(.*)</Version>' src/**/*.csproj | head -1 | sed -E 's/.*<Version>(.*)<\/Version>.*/\1/')
if [ -z "$VERSION" ]; then
# Last resort: Use build number format
VERSION="${BUILD_BUILDNUMBER:-1.0.0}"
fi
echo "⚠️ Version from project file/build number: $VERSION"
fi
echo "##vso[task.setvariable variable=Version]$VERSION"
Short Commit SHA for Traceability¶
Commit SHA Extraction:
# Extract short commit SHA (7 characters)
SHORT_SHA=$(git rev-parse --short=7 HEAD)
echo "Commit SHA: $SHORT_SHA"
# Example output: abc123d
Build Number for Uniqueness¶
Build Number Format:
# Build number format: v{MAJOR}.{MINOR}.{PATCH}.{BUILD_ID}
BUILD_NUMBER="${BUILD_BUILDNUMBER}" # e.g., 20240115.1
# Or use Build.BuildId (unique incrementing number)
BUILD_ID="${BUILD_BUILDID}" # e.g., 12345
Tag Format: v1.2.3-abc123d¶
Complete Tag Generation:
# Azure Pipeline: Generate image tag
- script: |
# Extract version from Git tag or project file
VERSION_TAG=$(git describe --tags --exact-match 2>/dev/null || echo "")
if [ -n "$VERSION_TAG" ]; then
VERSION=$(echo "$VERSION_TAG" | sed 's/^v//')
else
# Extract from .csproj or use build number
VERSION=$(grep -oP '<Version>\K[^<]+' src/**/*.csproj | head -1 || echo "1.0.0")
fi
# Extract short commit SHA
SHORT_SHA=$(git rev-parse --short=7 HEAD)
# Generate image tag: v1.2.3-abc123d
IMAGE_TAG="v${VERSION}-${SHORT_SHA}"
echo "Image tag: $IMAGE_TAG"
echo "##vso[task.setvariable variable=ImageTag]$IMAGE_TAG"
echo "##vso[task.setvariable variable=Version]$VERSION"
echo "##vso[task.setvariable variable=ShortSha]$SHORT_SHA"
displayName: 'Generate image tag'
Tag Format Examples:
| Source | Version | Commit SHA | Image Tag | Example |
|---|---|---|---|---|
| Git Tag | v1.2.3 |
abc123d |
v1.2.3-abc123d |
Semantic version + SHA |
| Project File | 1.2.3 |
abc123d |
v1.2.3-abc123d |
Version from .csproj + SHA |
| Build Number | 20240115.1 |
abc123d |
v20240115.1-abc123d |
Build number + SHA |
Automated Manifest Update¶
Pipeline Script to Update Image Tag in GitOps Repo¶
Manifest Update Script (scripts/update-gitops-manifests.sh):
#!/bin/bash
# scripts/update-gitops-manifests.sh
set -euo pipefail
SERVICE_NAME="${1:-atp-ingestion}"
IMAGE_REPOSITORY="${2:-connectsoft.azurecr.io/atp/ingestion}"
IMAGE_TAG="${3:-latest}"
TARGET_BRANCH="${4:-dev}"
GITOPS_REPO_PATH="${5:-gitops-repo}"
echo "🔄 Updating GitOps manifests for $SERVICE_NAME..."
echo " Image: $IMAGE_REPOSITORY:$IMAGE_TAG"
echo " Branch: $TARGET_BRANCH"
cd "$GITOPS_REPO_PATH"
# Update Kustomize image tags
if [ -d "apps/$SERVICE_NAME/base" ]; then
echo "📝 Updating Kustomize base..."
cd "apps/$SERVICE_NAME/base"
# Update image tag using kustomize edit
kustomize edit set image \
"$IMAGE_REPOSITORY:$IMAGE_TAG"
cd ../../../
fi
# Update Kustomize overlays
for overlay in dev test staging production; do
if [ -d "apps/$SERVICE_NAME/overlays/$overlay" ]; then
echo "📝 Updating Kustomize overlay: $overlay..."
cd "apps/$SERVICE_NAME/overlays/$overlay"
kustomize edit set image \
"$IMAGE_REPOSITORY:$IMAGE_TAG"
cd ../../../../../
fi
done
# Update Helm values files
if [ -d "apps/$SERVICE_NAME/helm" ]; then
echo "📝 Updating Helm values..."
cd "apps/$SERVICE_NAME/helm"
# Update target branch values file
if [ -f "values-${TARGET_BRANCH}.yaml" ]; then
yq eval -i ".image.tag = \"$IMAGE_TAG\"" \
"values-${TARGET_BRANCH}.yaml"
echo " ✅ Updated values-${TARGET_BRANCH}.yaml"
fi
# Update default values.yaml
if [ -f "values.yaml" ]; then
yq eval -i ".image.tag = \"$IMAGE_TAG\"" \
"values.yaml"
echo " ✅ Updated values.yaml"
fi
cd ../../
fi
echo "✅ Manifest update complete"
Kustomize Image Tag Replacement¶
Kustomize Update Example:
# apps/atp-ingestion/base/kustomization.yaml (before)
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.2-def456e # Old tag
# Update image tag using kustomize edit
kustomize edit set image \
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
# Result: apps/atp-ingestion/base/kustomization.yaml (after)
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3-abc123d # New tag
Helm Values File Update¶
Helm Values Update Example:
# apps/atp-ingestion/helm/values-production.yaml (before)
image:
repository: connectsoft.azurecr.io/atp/ingestion
tag: v1.2.2-def456e # Old tag
pullPolicy: IfNotPresent
# Update Helm values using yq
yq eval -i '.image.tag = "v1.2.3-abc123d"' \
apps/atp-ingestion/helm/values-production.yaml
# Result: apps/atp-ingestion/helm/values-production.yaml (after)
image:
repository: connectsoft.azurecr.io/atp/ingestion
tag: v1.2.3-abc123d # New tag
pullPolicy: IfNotPresent
Git Commit and Push Automation¶
Automated Git Commit Script:
#!/bin/bash
# scripts/commit-gitops-changes.sh
set -euo pipefail
SERVICE_NAME="${1:-atp-ingestion}"
IMAGE_TAG="${2:-latest}"
TARGET_BRANCH="${3:-dev}"
BUILD_NUMBER="${4:-unknown}"
COMMIT_SHA="${5:-unknown}"
GITOPS_REPO_PATH="${6:-gitops-repo}"
cd "$GITOPS_REPO_PATH"
# Check for changes
if [ -z "$(git status --porcelain)" ]; then
echo "ℹ️ No changes to commit"
exit 0
fi
# Configure Git
git config user.name "Azure DevOps Pipeline"
git config user.email "azure-devops@connectsoft.example"
# Stage all changes
git add apps/$SERVICE_NAME/
# Create commit message
COMMIT_MESSAGE="chore($SERVICE_NAME): update image tag to $IMAGE_TAG
Automated update from CI pipeline:
- Service: $SERVICE_NAME
- Image Tag: $IMAGE_TAG
- Build Number: $BUILD_NUMBER
- Source Commit: $COMMIT_SHA
- Pipeline: $BUILD_BUILDURI
Related to: $SYSTEM_PULLREQUEST_PULLREQUESTID"
# Commit changes
if [ -n "${GPG_KEY_ID:-}" ]; then
# Sign commit with GPG (production)
git commit -S -m "$COMMIT_MESSAGE"
else
# Unsigned commit (dev/test)
git commit -m "$COMMIT_MESSAGE"
fi
# Push to target branch
git push origin "$TARGET_BRANCH"
echo "✅ Changes committed and pushed to $TARGET_BRANCH"
echo " Commit: $(git rev-parse HEAD)"
Azure Pipeline Integration:
- script: |
chmod +x scripts/update-gitops-manifests.sh
chmod +x scripts/commit-gitops-changes.sh
# Update manifests
./scripts/update-gitops-manifests.sh \
atp-ingestion \
$(ImageRepository):$(ImageTag) \
$(TargetBranch) \
gitops-repo
# Commit and push
./scripts/commit-gitops-changes.sh \
atp-ingestion \
$(ImageTag) \
$(TargetBranch) \
$(Build.BuildNumber) \
$(Build.SourceVersion) \
gitops-repo
displayName: 'Update GitOps repository'
env:
SYSTEM_ACCESSTOKEN: $(System.AccessToken)
GPG_KEY_ID: ${{ if eq(variables['TargetBranch'], 'production') }}$(GPG_KEY_ID)${{ else }}''${{ endif }}
Commit Back to GitOps Repository¶
Service Account Credentials (PAT or SSH)¶
Personal Access Token (PAT) Setup:
# Create PAT in Azure DevOps:
# 1. User Settings > Personal Access Tokens > New Token
# 2. Name: "GitOps Pipeline Service Account"
# 3. Organization: All accessible organizations
# 4. Scopes: Code (Read & Write)
# 5. Copy token
# Store PAT as Azure DevOps variable group
az pipelines variable-group create \
--name "GitOps-Credentials" \
--variables \
gitopsPat="<PAT_TOKEN>" \
--authorize true
SSH Key Setup:
# Generate SSH key for pipeline
ssh-keygen -t rsa -b 4096 -f ~/.ssh/gitops-pipeline -N ""
# Add public key to Azure DevOps
# Azure DevOps > User Settings > SSH Public Keys > New Key
cat ~/.ssh/gitops-pipeline.pub
# Store private key as Azure DevOps secret variable
az pipelines variable-group variable create \
--group-id <GROUP_ID> \
--name gitopsSshPrivateKey \
--value "$(cat ~/.ssh/gitops-pipeline | base64)" \
--secret true
Using Credentials in Pipeline:
# Option 1: Use System.AccessToken (recommended for same organization)
- checkout: git://ATP/atp-gitops@$(TargetBranch)
displayName: 'Checkout GitOps repository'
path: gitops-repo
persistCredentials: true # Use System.AccessToken
# Option 2: Use PAT from variable group
- script: |
git config --global credential.helper store
echo "https://PAT:$(gitopsPat)@dev.azure.com" > ~/.git-credentials
displayName: 'Configure Git credentials'
env:
gitopsPat: $(gitopsPat)
# Option 3: Use SSH key
- script: |
mkdir -p ~/.ssh
echo "$(gitopsSshPrivateKey)" | base64 -d > ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
ssh-keyscan ssh.dev.azure.com >> ~/.ssh/known_hosts
displayName: 'Configure SSH key'
env:
gitopsSshPrivateKey: $(gitopsSshPrivateKey)
Commit Message Format¶
Standardized Commit Message:
Example Commit Messages:
# Single service update
chore(ingestion): update image tag to v1.2.3-abc123d
Automated update from CI pipeline:
- Service: atp-ingestion
- Image Tag: v1.2.3-abc123d
- Build Number: 20240115.1
- Source Commit: abc123def456
- Pipeline: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345
Related to: PR-123
# Multiple services update
chore(*): update image tags for release v1.2.3
Automated update from CI pipeline:
- Services: atp-ingestion, atp-query, atp-gateway
- Version: v1.2.3
- Build Number: 20240115.1
- Pipeline: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345
Services updated:
- atp-ingestion: v1.2.3-abc123d
- atp-query: v1.2.3-def456e
- atp-gateway: v1.2.3-ghi789f
Signed Commits for Audit Trail¶
GPG Commit Signing Setup:
# Generate GPG key for pipeline service account
gpg --batch --gen-key <<EOF
%no-protection
Key-Type: RSA
Key-Length: 4096
Name-Real: Azure DevOps Pipeline
Name-Email: azure-devops@connectsoft.example
Expire-Date: 0
EOF
# Export public key
gpg --armor --export azure-devops@connectsoft.example > pipeline-gpg-public.key
# Export private key (base64 encoded for storage)
gpg --export-secret-keys --armor azure-devops@connectsoft.example | base64 > pipeline-gpg-private.key.b64
# Store private key as Azure DevOps secret variable
Using GPG in Pipeline:
- script: |
# Import GPG key
echo "$(gpgPrivateKey)" | base64 -d | gpg --batch --import
gpg --list-secret-keys --keyid-format LONG
# Configure Git to use GPG
git config user.signingkey "$(gpgKeyId)"
git config commit.gpgsign true
# Sign commit
git commit -S -m "$(commitMessage)"
displayName: 'Sign and commit changes'
env:
gpgPrivateKey: $(gpgPrivateKey)
gpgKeyId: $(gpgKeyId)
commitMessage: $(commitMessage)
Branch Selection (Dev, Test, Staging, Production)¶
Branch Selection Logic:
# Azure Pipeline: Dynamic branch selection
variables:
- name: TargetBranch
value: ${{ if eq(variables['Build.SourceBranch'], 'refs/heads/main') }}
'production'
${{ elseif eq(variables['Build.SourceBranch'], 'refs/heads/staging') }}
'staging'
${{ elseif eq(variables['Build.SourceBranch'], 'refs/heads/test') }}
'test'
${{ else }}
'dev'
${{ endif }}
Branch Selection Script:
#!/bin/bash
# scripts/determine-target-branch.sh
SOURCE_BRANCH="${1:-dev}"
case "$SOURCE_BRANCH" in
refs/heads/main|main)
TARGET_BRANCH="production"
;;
refs/heads/staging|staging)
TARGET_BRANCH="staging"
;;
refs/heads/test|test)
TARGET_BRANCH="test"
;;
*)
TARGET_BRANCH="dev"
;;
esac
echo "Source branch: $SOURCE_BRANCH"
echo "Target branch: $TARGET_BRANCH"
echo "##vso[task.setvariable variable=TargetBranch]$TARGET_BRANCH"
Triggering FluxCD Sync¶
Automatic Sync via Polling (Default)¶
FluxCD Polling Configuration:
# GitRepository polling interval (default: 1 minute)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
interval: 1m # Poll every 1 minute
url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
ref:
branch: production
Sync Timeline (Polling-Based):
T+0s: CI pipeline commits to GitOps repo
T+0s: Git commit pushed successfully
T+0s: FluxCD Source Controller (last poll was T-30s)
T+60s: FluxCD Source Controller polls Git (detects new commit)
T+60s: FluxCD Kustomize Controller notified of new artifact
T+65s: FluxCD Kustomize Controller reconciles cluster
T+70s: Kubernetes resources updated
Webhook-Based Immediate Sync (Optional)¶
FluxCD Webhook Receiver:
# apps/fluxcd/receiver.yaml
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Receiver
metadata:
name: gitops-webhook
namespace: flux-system
spec:
type: git
events:
- push
resources:
- kind: GitRepository
name: atp-gitops
namespace: flux-system
secretRef:
name: webhook-token
Azure DevOps Webhook Configuration:
# Azure Pipeline: Trigger FluxCD webhook
- script: |
WEBHOOK_URL="https://fluxcd-receiver.flux-system.svc.cluster.local:8080/hook/$(webhookToken)"
curl -X POST "$WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{
"ref": "refs/heads/'"$(TargetBranch)"'",
"commits": [{
"id": "'"$(Build.SourceVersion)"'",
"message": "Automated manifest update"
}]
}'
echo "✅ FluxCD webhook triggered"
displayName: 'Trigger FluxCD sync webhook'
env:
webhookToken: $(fluxcdWebhookToken)
Sync Timeline (Webhook-Based):
T+0s: CI pipeline commits to GitOps repo
T+0s: Git commit pushed successfully
T+1s: Azure DevOps webhook triggered
T+2s: FluxCD Receiver receives webhook
T+2s: FluxCD Source Controller immediately fetches Git
T+3s: FluxCD Kustomize Controller notified
T+8s: Kubernetes resources updated
Flux Reconcile Command (Manual)¶
Manual Reconciliation:
# Manual reconciliation via flux CLI
flux reconcile source git atp-gitops --namespace flux-system
# Force reconciliation (even if no changes)
flux reconcile kustomization apps --namespace flux-system --with-source
# Reconciliation status
flux get kustomizations apps --namespace flux-system
Reconciliation in Pipeline (Optional):
- task: Kubernetes@1
displayName: 'Trigger FluxCD reconciliation'
inputs:
connectionType: 'Azure Resource Manager'
azureSubscriptionEndpoint: 'ATP-AKS-Connection'
azureResourceGroup: 'ATP-Production-EUS-RG'
kubernetesCluster: 'atp-prod-eus-aks'
namespace: 'flux-system'
command: 'run'
arguments: 'flux reconcile source git atp-gitops'
condition: and(succeeded(), eq(variables['TargetBranch'], 'production'))
Pipeline Templates for GitOps Integration¶
Reusable YAML Templates¶
GitOps Update Template (templates/gitops-update.yml):
# templates/gitops-update.yml
parameters:
- name: serviceName
type: string
- name: imageRepository
type: string
- name: imageTag
type: string
- name: targetBranch
type: string
default: 'dev'
- name: requireGpgSigning
type: boolean
default: false
steps:
- checkout: git://ATP/atp-gitops@${{ parameters.targetBranch }}
displayName: 'Checkout GitOps repository'
path: gitops-repo
persistCredentials: true
- script: |
# Install tools
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
wget -q https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O yq
chmod +x yq
sudo mv yq /usr/local/bin/
displayName: 'Install tools'
- script: |
cd gitops-repo
# Update Kustomize
if [ -d "apps/${{ parameters.serviceName }}/base" ]; then
cd "apps/${{ parameters.serviceName }}/base"
kustomize edit set image "${{ parameters.imageRepository }}:${{ parameters.imageTag }}"
cd ../../../
fi
# Update Helm
if [ -d "apps/${{ parameters.serviceName }}/helm" ]; then
cd "apps/${{ parameters.serviceName }}/helm"
if [ -f "values-${{ parameters.targetBranch }}.yaml" ]; then
yq eval -i ".image.tag = \"${{ parameters.imageTag }}\"" \
"values-${{ parameters.targetBranch }}.yaml"
fi
cd ../../
fi
displayName: 'Update manifests'
- script: |
cd gitops-repo
if [ -z "$(git status --porcelain)" ]; then
echo "ℹ️ No changes to commit"
exit 0
fi
git config user.name "Azure DevOps Pipeline"
git config user.email "azure-devops@connectsoft.example"
git add apps/${{ parameters.serviceName }}/
COMMIT_MESSAGE="chore(${{ parameters.serviceName }}): update image tag to ${{ parameters.imageTag }}
Automated update from CI pipeline.
Build: $(Build.BuildNumber)
Commit: $(Build.SourceVersion)"
if [ "${{ parameters.requireGpgSigning }}" == "true" ]; then
echo "$(gpgPrivateKey)" | base64 -d | gpg --batch --import
git config user.signingkey "$(gpgKeyId)"
git config commit.gpgsign true
git commit -S -m "$COMMIT_MESSAGE"
else
git commit -m "$COMMIT_MESSAGE"
fi
git push origin ${{ parameters.targetBranch }}
displayName: 'Commit and push changes'
env:
gpgPrivateKey: ${{ if eq(parameters.requireGpgSigning, true) }}$(gpgPrivateKey)${{ else }}''${{ endif }}
gpgKeyId: ${{ if eq(parameters.requireGpgSigning, true) }}$(gpgKeyId)${{ else }}''${{ endif }}
Parameterization for Different Services¶
Using Template in Pipeline:
# azure-pipelines.yml
resources:
repositories:
- repository: templates
type: git
name: ATP/azure-pipelines-templates
stages:
- stage: UpdateGitOps
displayName: 'Update GitOps Repository'
jobs:
- job: UpdateManifests
steps:
- template: templates/gitops-update.yml@templates
parameters:
serviceName: 'atp-ingestion'
imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
imageTag: '$(ImageTag)'
targetBranch: '$(TargetBranch)'
requireGpgSigning: ${{ if eq(variables['TargetBranch'], 'production') }}true${{ else }}false${{ endif }}
Multi-Service Template Usage:
- stage: UpdateGitOps
displayName: 'Update GitOps for Multiple Services'
jobs:
- job: UpdateAllServices
strategy:
matrix:
ingestion:
serviceName: 'atp-ingestion'
imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
query:
serviceName: 'atp-query'
imageRepository: 'connectsoft.azurecr.io/atp/query'
gateway:
serviceName: 'atp-gateway'
imageRepository: 'connectsoft.azurecr.io/atp/gateway'
steps:
- template: templates/gitops-update.yml@templates
parameters:
serviceName: '$(serviceName)'
imageRepository: '$(imageRepository)'
imageTag: '$(ImageTag)'
targetBranch: '$(TargetBranch)'
Template Versioning¶
Versioned Template Reference:
resources:
repositories:
- repository: templates
type: git
name: ATP/azure-pipelines-templates
ref: refs/tags/v1.2.3 # Pin to specific version
stages:
- stage: UpdateGitOps
jobs:
- job: UpdateManifests
steps:
- template: templates/gitops-update.yml@templates
parameters:
serviceName: 'atp-ingestion'
imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
imageTag: '$(ImageTag)'
Multi-Service Coordination¶
Updating Multiple Services Atomically¶
Atomic Multi-Service Update:
- stage: UpdateGitOps
displayName: 'Update GitOps (Atomic)'
jobs:
- job: UpdateAllServices
steps:
- checkout: git://ATP/atp-gitops@$(TargetBranch)
path: gitops-repo
persistCredentials: true
# Update all services in single commit
- script: |
cd gitops-repo
# Update ingestion
kustomize edit set image \
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
--path apps/atp-ingestion/base
# Update query
kustomize edit set image \
connectsoft.azurecr.io/atp/query:v1.3.0-def456e \
--path apps/atp-query/base
# Update gateway
kustomize edit set image \
connectsoft.azurecr.io/atp/gateway:v1.1.5-ghi789f \
--path apps/atp-gateway/base
# Single commit for all services
git add apps/
git commit -m "chore(*): update all service image tags
Atomic update for release v1.2.3:
- atp-ingestion: v1.2.3-abc123d
- atp-query: v1.3.0-def456e
- atp-gateway: v1.1.5-ghi789f"
git push origin $(TargetBranch)
displayName: 'Atomic multi-service update'
Dependency Management¶
Service Dependency Graph:
# services-dependencies.yaml
services:
- name: atp-gateway
dependsOn: []
updateOrder: 1
- name: atp-ingestion
dependsOn: [atp-gateway]
updateOrder: 2
- name: atp-query
dependsOn: [atp-ingestion]
updateOrder: 3
- name: atp-export
dependsOn: [atp-query]
updateOrder: 4
Dependency-Aware Update Script:
#!/bin/bash
# scripts/update-services-with-dependencies.sh
SERVICES=(
"atp-gateway:v1.1.5-ghi789f:1"
"atp-ingestion:v1.2.3-abc123d:2"
"atp-query:v1.3.0-def456e:3"
)
# Sort by update order
IFS=$'\n' sorted_services=($(sort -t: -k3 <<<"${SERVICES[*]}"))
unset IFS
for service_info in "${sorted_services[@]}"; do
IFS=':' read -r service_name image_tag update_order <<< "$service_info"
echo "📦 Updating $service_name (order: $update_order)..."
# Update manifest
kustomize edit set image \
"connectsoft.azurecr.io/$service_name:$image_tag" \
--path "apps/$service_name/base"
echo " ✅ Updated $service_name"
done
Coordinated Rollout Strategies¶
Staged Rollout:
- stage: CoordinatedRollout
displayName: 'Coordinated Multi-Service Rollout'
jobs:
- job: Stage1Gateway
displayName: 'Stage 1: Gateway'
steps:
- template: templates/gitops-update.yml@templates
parameters:
serviceName: 'atp-gateway'
imageTag: '$(GatewayImageTag)'
- job: Stage2Ingestion
displayName: 'Stage 2: Ingestion'
dependsOn: Stage1Gateway
condition: succeeded()
steps:
- template: templates/gitops-update.yml@templates
parameters:
serviceName: 'atp-ingestion'
imageTag: '$(IngestionImageTag)'
- job: Stage3Query
displayName: 'Stage 3: Query'
dependsOn: Stage2Ingestion
condition: succeeded()
steps:
- template: templates/gitops-update.yml@templates
parameters:
serviceName: 'atp-query'
imageTag: '$(QueryImageTag)'
Artifact Metadata¶
SBOM (Software Bill of Materials) Generation¶
SBOM Generation with Syft:
- script: |
# Install Syft
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
# Generate SBOM in SPDX format
syft packages docker:$(ImageRepository):$(ImageTag) \
--output spdx-json \
--file sbom-$(ImageTag).spdx.json
# Generate SBOM in CycloneDX format
syft packages docker:$(ImageRepository):$(ImageTag) \
--output cyclonedx-json \
--file sbom-$(ImageTag).cyclonedx.json
echo "✅ SBOM generated: sbom-$(ImageTag).spdx.json"
displayName: 'Generate SBOM'
SBOM Structure (Example):
{
"SPDXID": "SPDXRef-DOCUMENT",
"spdxVersion": "SPDX-2.3",
"name": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
"packages": [
{
"SPDXID": "SPDXRef-Package-dotnet-runtime",
"name": "dotnet-runtime",
"versionInfo": "8.0.0",
"downloadLocation": "NOASSERTION"
},
{
"SPDXID": "SPDXRef-Package-aspnetcore",
"name": "aspnetcore",
"versionInfo": "8.0.0",
"downloadLocation": "NOASSERTION"
}
]
}
Vulnerability Scan Results¶
Vulnerability Scanning Integration:
- script: |
# Scan image with Trivy
trivy image \
--format json \
--output trivy-$(ImageTag).json \
--severity HIGH,CRITICAL \
$(ImageRepository):$(ImageTag)
# Generate HTML report
trivy image \
--format template \
--template "@contrib/html.tpl" \
--output trivy-$(ImageTag).html \
$(ImageRepository):$(ImageTag)
# Publish scan results
echo "##vso[task.addattachment type=Distributedtask.Core.Summary;name=Vulnerability Scan;]$PWD/trivy-$(ImageTag).html"
displayName: 'Scan image for vulnerabilities'
Vulnerability Scan Results (Example):
{
"SchemaVersion": 2,
"ArtifactName": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
"Results": [
{
"Target": "atp-ingestion:v1.2.3-abc123d",
"Vulnerabilities": [
{
"VulnerabilityID": "CVE-2024-1234",
"Severity": "HIGH",
"PackageName": "aspnetcore",
"InstalledVersion": "8.0.0",
"FixedVersion": "8.0.1"
}
]
}
]
}
Build Provenance Information¶
Provenance Generation (SLSA/In-Toto):
- script: |
# Generate build provenance (SLSA v1.0)
cat > provenance-$(ImageTag).json <<EOF
{
"_type": "https://in-toto.io/Statement/v1",
"subject": [
{
"name": "$(ImageRepository):$(ImageTag)",
"digest": {
"sha256": "$(IMAGE_DIGEST)"
}
}
],
"predicateType": "https://slsa.dev/provenance/v1",
"predicate": {
"buildDefinition": {
"buildType": "https://dev.azure.com/ConnectSoft/ATP",
"externalParameters": {
"source": "$(Build.Repository.Uri)",
"ref": "$(Build.SourceBranch)",
"commit": "$(Build.SourceVersion)"
},
"internalParameters": {
"pipeline": "$(Build.DefinitionName)",
"buildId": "$(Build.BuildId)"
},
"resolvedDependencies": [
{
"uri": "$(Build.Repository.Uri)",
"digest": {
"gitCommit": "$(Build.SourceVersion)"
}
}
]
},
"runDetails": {
"builder": {
"id": "Azure DevOps Pipeline"
},
"metadata": {
"invocationId": "$(Build.BuildId)",
"startedOn": "$(Build.QueuedTime)",
"finishedOn": "$(System.Agent.JobFinishTime)"
}
}
}
}
EOF
echo "✅ Build provenance generated"
displayName: 'Generate build provenance'
Metadata Storage in ACR¶
Attach Metadata to ACR Image:
- script: |
# Attach metadata as image annotations
az acr repository update \
--name connectsoft \
--image atp/ingestion:$(ImageTag) \
--metadata \
build.version=$(Build.BuildNumber) \
build.commit=$(Build.SourceVersion) \
build.date=$(Build.BuildId) \
build.pipeline=$(Build.BuildUri) \
build.branch=$(Build.SourceBranch) \
sbom.url=$(Pipeline.Workspace)/sbom-$(ImageTag).spdx.json \
scan.url=$(Pipeline.Workspace)/trivy-$(ImageTag).json \
provenance.url=$(Pipeline.Workspace)/provenance-$(ImageTag).json
echo "✅ Metadata attached to image"
displayName: 'Attach metadata to ACR image'
Query Image Metadata:
# Query image metadata
az acr repository show \
--name connectsoft \
--image atp/ingestion:v1.2.3-abc123d \
--query metadata
# Output:
# {
# "build.version": "v1.2.3",
# "build.commit": "abc123def456",
# "build.date": "20240115.1",
# "build.pipeline": "https://dev.azure.com/...",
# "sbom.url": "...",
# "scan.url": "...",
# "provenance.url": "..."
# }
Pipeline Observability¶
Correlation IDs Between Build and Deployment¶
Correlation ID Generation:
- script: |
# Generate correlation ID
CORRELATION_ID="build-$(Build.BuildId)-$(Build.SourceVersion)"
echo "##vso[task.setvariable variable=CorrelationId;isOutput=true]$CORRELATION_ID"
echo "Correlation ID: $CORRELATION_ID"
displayName: 'Generate correlation ID'
name: GenerateCorrelationId
# Pass correlation ID to GitOps commit
- script: |
git commit -m "chore(ingestion): update image tag
Correlation ID: $(GenerateCorrelationId.CorrelationId)
Build: $(Build.BuildNumber)"
displayName: 'Commit with correlation ID'
Correlation ID in Deployment Metadata:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
annotations:
deployment.connectsoft.com/correlation-id: "build-12345-abc123d"
deployment.connectsoft.com/build-number: "20240115.1"
deployment.connectsoft.com/build-uri: "https://dev.azure.com/.../builds/12345"
spec:
template:
metadata:
labels:
correlation-id: "build-12345-abc123d"
Linking Azure Pipeline Runs to FluxCD Reconciliations¶
Link Tracking Script:
#!/bin/bash
# scripts/link-build-to-deployment.sh
CORRELATION_ID="${1:-unknown}"
BUILD_URI="${2:-unknown}"
NAMESPACE="${3:-atp-production}"
# Annotate deployment with build information
kubectl annotate deployment atp-ingestion \
-n "$NAMESPACE" \
deployment.connectsoft.com/build-uri="$BUILD_URI" \
deployment.connectsoft.com/correlation-id="$CORRELATION_ID" \
--overwrite
echo "✅ Deployment linked to build: $BUILD_URI"
Query Links:
# Query deployment for build link
kubectl get deployment atp-ingestion -n atp-production \
-o jsonpath='{.metadata.annotations.deployment\.connectsoft\.com/build-uri}'
# Output: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345
Deployment Receipt Generation¶
Deployment Receipt Script:
#!/bin/bash
# scripts/generate-deployment-receipt.sh
DEPLOYMENT_NAME="${1:-atp-ingestion}"
NAMESPACE="${2:-atp-production}"
CORRELATION_ID="${3:-unknown}"
# Generate deployment receipt
cat > deployment-receipt-$(date +%Y%m%d-%H%M%S).json <<EOF
{
"deploymentId": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.uid}')",
"correlationId": "$CORRELATION_ID",
"namespace": "$NAMESPACE",
"deploymentName": "$DEPLOYMENT_NAME",
"image": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}')",
"replicas": $(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}'),
"deployedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"deployedBy": "FluxCD",
"gitCommit": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.labels.app\.kubernetes\.io/version}')",
"status": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')"
}
EOF
echo "✅ Deployment receipt generated"
Metrics and Dashboards¶
Pipeline Metrics (Azure Monitor):
- script: |
# Send custom metrics to Azure Monitor
az monitor metrics create \
--resource /subscriptions/.../resourceGroups/... \
--name "gitops_manifest_update_duration" \
--value "$(AGENT_JOBDURATION)" \
--timestamp "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
displayName: 'Send pipeline metrics'
KQL Query for Build-Deployment Correlation:
// Azure Monitor: Link builds to deployments
let BuildEvents = ContainerLog
| where LogEntry contains "correlation-id"
| extend CorrelationId = extract(@"correlation-id: ([^\s]+)", 1, LogEntry)
| extend BuildUri = extract(@"build-uri: ([^\s]+)", 1, LogEntry)
| project CorrelationId, BuildUri, TimeGenerated;
let DeploymentEvents = KubePodInventory
| where Namespace == "atp-production"
| extend CorrelationId = extract(@"correlation-id: ([^\s]+)", 1, Labels)
| where isnotempty(CorrelationId)
| project CorrelationId, PodName, TimeGenerated;
BuildEvents
| join kind=inner DeploymentEvents on CorrelationId
| project CorrelationId, BuildUri, PodName, DeploymentTime = DeploymentEvents.TimeGenerated, BuildTime = BuildEvents.TimeGenerated
| extend DeploymentLatency = DeploymentTime - BuildTime
| summarize avg(DeploymentLatency) by bin(TimeGenerated, 1h)
Grafana Dashboard Configuration:
# Grafana dashboard for CI/CD → GitOps handoff
dashboard:
title: "CI/CD to GitOps Handoff Metrics"
panels:
- title: "Build to Deployment Latency"
query: |
avg(deployment_latency_seconds{namespace="atp-production"})
- title: "GitOps Commit Frequency"
query: |
rate(gitops_commits_total[5m])
- title: "FluxCD Sync Success Rate"
query: |
rate(fluxcd_kustomize_reconciliation_success_total[5m]) /
rate(fluxcd_kustomize_reconciliation_total[5m])
Summary: Azure Pipelines to GitOps Handoff¶
- Separation of Concerns: CI builds/test/publishes artifacts; GitOps deploys/reconciles/monitors
- Pipeline Stages: Build (compile, test), Test (integration, security), Publish (ACR, SBOM), Update GitOps (manifest commits)
- Image Tag Generation: Semantic version + commit SHA format (
v1.2.3-abc123d) for immutability and traceability - Automated Manifest Update: Scripts for Kustomize and Helm manifest updates with Git commit automation
- Git Commit Automation: PAT/SSH credentials, conventional commit messages, GPG signing for production
- FluxCD Sync Triggers: Polling (default), webhooks (immediate), manual reconciliation
- Pipeline Templates: Reusable YAML templates with parameterization and versioning
- Multi-Service Coordination: Atomic updates, dependency management, coordinated rollout strategies
- Artifact Metadata: SBOM generation, vulnerability scans, build provenance, ACR metadata storage
- Pipeline Observability: Correlation IDs, build-deployment linking, deployment receipts, metrics dashboards
Pulumi Infrastructure as Code Integration¶
Purpose: Define how Pulumi with C# is used to provision and manage Azure infrastructure for ATP, integrating with GitOps workflows to ensure infrastructure changes are version-controlled, tested, and deployed through the same Git-based processes as application deployments.
Pulumi Overview for Azure Resources¶
Why Pulumi for ATP (C# Programming Model)¶
ATP Infrastructure Requirements:
- Complex Azure Resource Orchestration: AKS clusters, ACR, Key Vault, Service Bus, Storage Accounts
- Multi-Environment Management: Dev, test, staging, production with environment-specific configurations
- C# Development Team: Leverage existing C# expertise for infrastructure code
- Type Safety: Strong typing and IntelliSense for Azure resources
- Testability: Unit test infrastructure code with standard C# testing frameworks
- Code Reusability: Create reusable components and modules
Pulumi Advantages for ATP:
| Advantage | Description | ATP Benefit |
|---|---|---|
| C# Programming Model | Write infrastructure as C# code | Leverage team's existing C# skills |
| Type Safety | Strong typing with IntelliSense | Reduce configuration errors at compile time |
| Rich Ecosystem | Access to .NET libraries and NuGet packages | Reuse existing code and patterns |
| Imperative Logic | Full programming language capabilities | Complex conditional logic, loops, functions |
| State Management | Built-in state management with locking | Safe concurrent updates |
| Multi-Language Support | C#, TypeScript, Python, Go available | Team flexibility |
Pulumi vs Terraform vs Bicep Comparison¶
Comparison Matrix:
| Feature | Pulumi | Terraform | Bicep |
|---|---|---|---|
| Language | C#, TypeScript, Python, Go | HCL (Hashicorp Config Language) | Domain-specific language (DSL) |
| Programming Model | Imperative with full language features | Declarative | Declarative |
| Type Safety | ✅ Strong typing (C#) | ❌ Limited | ✅ Type checking |
| IntelliSense | ✅ Full IDE support | ⚠️ Basic | ✅ Good |
| Testing | ✅ Unit test with standard frameworks | ⚠️ Limited | ❌ Limited |
| Code Reuse | ✅ Functions, classes, modules | ⚠️ Modules | ⚠️ Modules |
| State Management | ✅ Built-in with locking | ✅ Built-in with locking | ✅ Azure-native |
| Azure Integration | ✅ Excellent | ✅ Good | ✅ Native (Azure-only) |
| Multi-Cloud | ✅ Excellent | ✅ Excellent | ❌ Azure-only |
| Learning Curve | ✅ Low (if team knows C#) | ⚠️ Medium (HCL) | ⚠️ Medium (DSL) |
ATP Selection: Pulumi with C#
Rationale:
- Team Expertise: ATP team is proficient in C#, reducing learning curve
- Type Safety: Catch configuration errors at compile time
- Testability: Unit test infrastructure code with xUnit/NUnit
- Code Reuse: Create reusable infrastructure components
- Complex Logic: Handle multi-tenant, multi-region scenarios with imperative code
- Azure-First: Strong Azure support while maintaining multi-cloud flexibility
Infrastructure as Code Principles¶
IaC Best Practices for ATP:
- Version Control: All infrastructure code in Git (Azure Repos)
- Immutable Infrastructure: Recreate rather than modify when possible
- Idempotency: Infrastructure code can be run multiple times safely
- Declarative Intent: Describe desired state, not implementation steps
- Environment Parity: Use same code for all environments (parameterized)
- Code Review: Infrastructure changes require PR approval
- Testing: Preview changes before applying
- State Management: Centralized, versioned state with locking
IaC Workflow:
graph LR
A[Developer] -->|Create PR| B[Infrastructure Code<br/>in Git]
B -->|PR Validation| C[Pulumi Preview]
C -->|Review| D[Manual Approval]
D -->|Merge| E[Pulumi Up]
E -->|Update State| F[Azure Blob Storage<br/>State Backend]
E -->|Provision| G[Azure Resources]
style B fill:#90EE90
style C fill:#FFE5B4
style D fill:#FFE5B4
style E fill:#90EE90
style F fill:#FFE5B4
style G fill:#ffcccc
Pulumi Stacks for Environments¶
Stack per Environment (Dev, Test, Staging, Production)¶
Stack Organization:
atp-infrastructure/
├── Pulumi.yaml
├── Pulumi.dev.yaml
├── Pulumi.test.yaml
├── Pulumi.staging.yaml
├── Pulumi.production.yaml
├── Program.cs
└── infrastructure/
├── AKS.cs
├── ACR.cs
├── KeyVault.cs
└── ServiceBus.cs
Pulumi Project Configuration (Pulumi.yaml):
name: atp-infrastructure
runtime: dotnet
description: ATP Infrastructure as Code using Pulumi with C#
Stack Configuration Examples:
Dev Stack (Pulumi.dev.yaml):
config:
azure-native:location: eastus
atp-infrastructure:environment: dev
atp-infrastructure:aksNodeCount: 3
atp-infrastructure:aksNodeVmSize: Standard_D2s_v3
atp-infrastructure:acrSku: Basic
atp-infrastructure:keyVaultSku: standard
atp-infrastructure:enableMonitoring: true
atp-infrastructure:enablePrivateEndpoint: false
atp-infrastructure:tags:
Environment: dev
ManagedBy: pulumi
Project: ATP
Production Stack (Pulumi.production.yaml):
config:
azure-native:location: eastus
atp-infrastructure:environment: production
atp-infrastructure:aksNodeCount: 5
atp-infrastructure:aksNodeVmSize: Standard_D4s_v3
atp-infrastructure:acrSku: Premium
atp-infrastructure:keyVaultSku: premium
atp-infrastructure:enableMonitoring: true
atp-infrastructure:enablePrivateEndpoint: true
atp-infrastructure:enableGeoReplication: true
atp-infrastructure:tags:
Environment: production
ManagedBy: pulumi
Project: ATP
Compliance: SOC2
Stack Configuration and Secrets¶
Stack Configuration with Secrets:
// Program.cs
using Pulumi;
class Program
{
static Task<int> Main() => Deployment.RunAsync<ATPStack>();
}
class ATPStack : Stack
{
public ATPStack()
{
var config = new Config();
// Read configuration
var environment = config.Require("environment");
var location = config.Get("location") ?? "eastus";
var nodeCount = config.GetInt32("aksNodeCount") ?? 3;
var nodeVmSize = config.Get("aksNodeVmSize") ?? "Standard_D2s_v3";
// Read secrets (encrypted)
var sqlAdminPassword = config.RequireSecret("sqlAdminPassword");
var keyVaultAccessKey = config.RequireSecret("keyVaultAccessKey");
// Create infrastructure
var aks = new AKSCluster(this, environment, location, nodeCount, nodeVmSize);
var acr = new AzureContainerRegistry(this, environment, location);
var keyVault = new KeyVault(this, environment, location, keyVaultAccessKey);
}
}
Setting Stack Configuration:
# Set plain configuration
pulumi config set aksNodeCount 5 --stack production
pulumi config set aksNodeVmSize Standard_D4s_v3 --stack production
# Set secrets (encrypted in state)
pulumi config set --secret sqlAdminPassword "SecurePassword123!"
pulumi config set --secret keyVaultAccessKey "access-key-value"
# View configuration
pulumi config --stack production
# View secrets (decrypted)
pulumi config get --secret sqlAdminPassword --stack production
Configuration via Azure DevOps Variable Groups:
# Azure Pipeline: Set Pulumi configuration from variable groups
- script: |
# Set plain configuration
pulumi config set azure-native:location $(AzureLocation) --stack $(PulumiStack)
pulumi config set aksNodeCount $(AKSNodeCount) --stack $(PulumiStack)
# Set secrets from Azure Key Vault
pulumi config set --secret sqlAdminPassword "$(sqlAdminPassword)" --stack $(PulumiStack)
pulumi config set --secret keyVaultAccessKey "$(keyVaultAccessKey)" --stack $(PulumiStack)
displayName: 'Set Pulumi stack configuration'
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
Stack References for Cross-Stack Dependencies¶
Stack Reference Example:
// Shared infrastructure stack (networking, monitoring)
class SharedStack : Stack
{
[Output]
public Output<string> LogAnalyticsWorkspaceId { get; set; }
[Output]
public Output<string> VirtualNetworkId { get; set; }
public SharedStack()
{
var workspace = new OperationalInsights.Workspace("atp-shared-loganalytics", new()
{
ResourceGroupName = resourceGroup.Name,
Location = location,
});
this.LogAnalyticsWorkspaceId = workspace.Id;
}
}
// Application stack references shared stack
class ApplicationStack : Stack
{
public ApplicationStack()
{
var sharedStack = new StackReference("ConnectSoft/atp-shared/shared");
var logAnalyticsWorkspaceId = sharedStack.RequireOutput("LogAnalyticsWorkspaceId")
.Apply(id => id.ToString());
// Use shared resources
var aks = new ContainerService.ManagedCluster("atp-aks", new()
{
// Reference shared Log Analytics workspace
AddonProfiles = new()
{
["omsagent"] = new()
{
Enabled = true,
Config = new()
{
["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspaceId,
},
},
},
});
}
}
AKS Cluster Provisioning¶
Cluster Configuration (Node Pools, Networking, SKUs)¶
Complete AKS Cluster with C# Pulumi:
// infrastructure/AKS.cs
using Pulumi;
using Pulumi.AzureNative.ContainerService;
using Pulumi.AzureNative.ContainerService.Inputs;
using Pulumi.AzureNative.Network;
using Pulumi.AzureNative.Resources;
public class AKSCluster
{
public ManagedCluster Cluster { get; }
public Output<string> KubeConfig { get; }
public AKSCluster(Pulumi.Stack stack, string environment, string location,
int nodeCount, string nodeVmSize)
{
var config = new Config();
var resourceGroupName = $"atp-{environment}-rg";
var clusterName = $"atp-{environment}-aks";
// Resource Group
var resourceGroup = new ResourceGroup($"atp-{environment}-rg", new()
{
Location = location,
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
// Virtual Network
var vnet = new VirtualNetwork($"atp-{environment}-vnet", new()
{
ResourceGroupName = resourceGroup.Name,
Location = location,
AddressSpace = new() { AddressPrefixes = { "10.0.0.0/16" } },
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
// Subnet for AKS
var aksSubnet = new Subnet($"atp-{environment}-aks-subnet", new()
{
ResourceGroupName = resourceGroup.Name,
VirtualNetworkName = vnet.Name,
AddressPrefix = "10.0.1.0/24",
Delegations = new()
{
new SubnetDelegationArgs
{
Name = "Microsoft.ContainerService.managedClusters",
ServiceDelegation = new ServiceDelegationArgs
{
Name = "Microsoft.ContainerService/managedClusters",
},
},
},
});
// User Assigned Managed Identity
var identity = new ManagedServiceIdentity.UserAssignedIdentity(
$"atp-{environment}-aks-identity", new()
{
ResourceGroupName = resourceGroup.Name,
Location = location,
});
// AKS Cluster
var cluster = new ManagedCluster(clusterName, new()
{
ResourceGroupName = resourceGroup.Name,
Location = location,
// Identity
Identity = new ManagedClusterIdentityArgs
{
Type = ManagedClusterIdentityType.UserAssigned,
UserAssignedIdentities = new[]
{
identity.Id,
},
},
// Kubernetes Version
KubernetesVersion = config.Get("kubernetesVersion") ?? "1.28",
// Node Pool Configuration
AgentPoolProfiles = new[]
{
new ManagedClusterAgentPoolProfileArgs
{
Name = "systempool",
Count = nodeCount,
VmSize = nodeVmSize,
OsType = "Linux",
OsDiskSizeGB = 128,
Mode = AgentPoolMode.System,
EnableAutoScaling = true,
MinCount = 2,
MaxCount = 10,
Type = AgentPoolType.VirtualMachineScaleSets,
VnetSubnetId = aksSubnet.Id,
MaxPods = 30,
NodeLabels = new()
{
{ "pool", "system" },
{ "environment", environment },
},
NodeTaints = new[] { "CriticalAddonsOnly=true:NoSchedule" },
},
new ManagedClusterAgentPoolProfileArgs
{
Name = "userpool",
Count = nodeCount,
VmSize = nodeVmSize,
OsType = "Linux",
OsDiskSizeGB = 128,
Mode = AgentPoolMode.User,
EnableAutoScaling = true,
MinCount = 3,
MaxCount = 20,
Type = AgentPoolType.VirtualMachineScaleSets,
VnetSubnetId = aksSubnet.Id,
MaxPods = 30,
NodeLabels = new()
{
{ "pool", "user" },
{ "environment", environment },
},
},
},
// Network Profile (Azure CNI)
NetworkProfile = new ContainerServiceNetworkProfileArgs
{
NetworkPlugin = "azure",
NetworkPolicy = "azure",
ServiceCidr = "10.1.0.0/16",
DnsServiceIP = "10.1.0.10",
LoadBalancerSku = "standard",
},
// RBAC
EnableRBAC = true,
// Pod Security Standards
SecurityProfile = new ManagedClusterSecurityProfileArgs
{
WorkloadIdentity = new ManagedClusterSecurityProfileWorkloadIdentityArgs
{
Enabled = true,
},
},
// Addon Profiles
AddonProfiles = new()
{
["httpApplicationRouting"] = new ManagedClusterAddonProfileArgs
{
Enabled = false,
},
["omsagent"] = new ManagedClusterAddonProfileArgs
{
Enabled = true,
Config = new()
{
["logAnalyticsWorkspaceResourceID"] = config.Require("logAnalyticsWorkspaceId"),
},
},
},
// Auto Upgrade Channel
AutoUpgradeProfile = new ManagedClusterAutoUpgradeProfileArgs
{
UpgradeChannel = environment == "production"
? UpgradeChannel.Stable
: UpgradeChannel.Rapid,
},
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
{ "Project", "ATP" },
},
});
this.Cluster = cluster;
this.KubeConfig = Output.Tuple(resourceGroup.Name, cluster.Name)
.Apply(names => Output.CreateSecret(GetKubeConfig(names.Item1, names.Item2)));
}
private static string GetKubeConfig(string resourceGroupName, string clusterName)
{
// Generate kubeconfig
// In practice, use Pulumi's Kubernetes provider
return "";
}
}
Azure CNI vs Kubenet¶
Network Plugin Comparison:
| Feature | Azure CNI | Kubenet |
|---|---|---|
| IP Address Management | Pod IPs from VNet subnet | Pod IPs NAT through node IP |
| Pod Networking | Direct VNet connectivity | Overlay network |
| Max Pods per Node | Up to 250 (configurable) | 110 (fixed) |
| Network Policies | Azure Network Policies or Calico | Calico only |
| VNet Integration | Native VNet integration | Requires route table |
| Performance | Lower latency | Slight NAT overhead |
| Complexity | More complex subnet planning | Simpler setup |
ATP Selection: Azure CNI
Rationale:
- ✅ Native Azure networking for better integration
- ✅ Direct pod-to-VNet connectivity for Azure services
- ✅ Support for Azure Network Policies
- ✅ Better performance for high-throughput workloads
- ✅ Required for advanced features (Private AKS, etc.)
Azure CNI Configuration:
NetworkProfile = new ContainerServiceNetworkProfileArgs
{
NetworkPlugin = "azure",
NetworkPolicy = "azure", // Azure Network Policies
ServiceCidr = "10.1.0.0/16",
DnsServiceIP = "10.1.0.10",
LoadBalancerSku = "standard",
PodCidr = null, // Not used with Azure CNI
}
Managed Identity Setup¶
User Assigned Managed Identity:
// Create User Assigned Managed Identity
var identity = new ManagedServiceIdentity.UserAssignedIdentity(
$"atp-{environment}-aks-identity", new()
{
ResourceGroupName = resourceGroup.Name,
Location = location,
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
// Grant permissions to identity
var acrRoleAssignment = new Authorization.RoleAssignment(
"aks-acr-role-assignment", new()
{
PrincipalId = identity.PrincipalId,
PrincipalType = "ServicePrincipal",
RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/7f951dda-4ed3-4680-a7ca-43fe172d538d", // AcrPull
Scope = acr.Id,
});
Azure Monitor Integration¶
Container Insights Configuration:
// Log Analytics Workspace
var logAnalyticsWorkspace = new OperationalInsights.Workspace(
$"atp-{environment}-loganalytics", new()
{
ResourceGroupName = resourceGroup.Name,
Location = location,
Sku = new OperationalInsights.WorkspaceSkuArgs
{
Name = "PerGB2018",
},
});
// Enable Container Insights on AKS
var containerInsights = new ContainerService.ManagedClusterAddonProfileArgs
{
Enabled = true,
Config = new()
{
["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspace.Id,
},
};
// Add to cluster
AddonProfiles = new()
{
["omsagent"] = containerInsights,
}
Network Policies and Security Groups¶
Network Security Group:
// Network Security Group for AKS subnet
var nsg = new NetworkSecurityGroup($"atp-{environment}-aks-nsg", new()
{
ResourceGroupName = resourceGroup.Name,
Location = location,
SecurityRules = new()
{
// Allow inbound from Load Balancer
new NetworkSecurityGroupSecurityRuleArgs
{
Name = "Allow-LoadBalancer-Inbound",
Priority = 1000,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "AzureLoadBalancer",
DestinationAddressPrefix = "*",
},
// Allow outbound to Internet
new NetworkSecurityGroupSecurityRuleArgs
{
Name = "Allow-Internet-Outbound",
Priority = 1000,
Direction = "Outbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "*",
DestinationAddressPrefix = "Internet",
},
},
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
// Associate NSG with subnet
var subnetWithNsg = new Subnet($"atp-{environment}-aks-subnet-nsg", new()
{
ResourceGroupName = resourceGroup.Name,
VirtualNetworkName = vnet.Name,
AddressPrefix = "10.0.1.0/24",
NetworkSecurityGroupId = nsg.Id,
Delegations = new()
{
new SubnetDelegationArgs
{
Name = "Microsoft.ContainerService.managedClusters",
ServiceDelegation = new ServiceDelegationArgs
{
Name = "Microsoft.ContainerService/managedClusters",
},
},
},
});
Azure Resource Provisioning¶
Azure Container Registry (ACR)¶
ACR Provisioning:
// infrastructure/ACR.cs
public class AzureContainerRegistry
{
public ContainerRegistry.Registry Registry { get; }
public AzureContainerRegistry(Pulumi.Stack stack, string environment, string location)
{
var config = new Config();
var resourceGroupName = $"atp-{environment}-rg";
var acrName = $"atp{environment}acr".Replace("-", ""); // ACR names must be alphanumeric
this.Registry = new ContainerRegistry.Registry($"atp-{environment}-acr", new()
{
ResourceGroupName = resourceGroupName,
Location = location,
Sku = new ContainerRegistry.Inputs.SkuArgs
{
Name = config.Get("acrSku") ?? "Basic",
},
AdminEnabled = environment != "production", // Disable admin for production
PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true
? "Disabled"
: "Enabled",
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
{ "Project", "ATP" },
},
});
// Enable geo-replication for production
if (environment == "production" && config.GetBool("enableGeoReplication") == true)
{
new ContainerRegistry.Replication($"atp-production-acr-westus2", new()
{
ResourceGroupName = resourceGroupName,
RegistryName = this.Registry.Name,
Location = "westus2",
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
}
}
}
Azure Key Vault¶
Key Vault Provisioning:
// infrastructure/KeyVault.cs
public class KeyVault
{
public KeyVault.Vault Vault { get; }
public KeyVault(Pulumi.Stack stack, string environment, string location,
Output<string> accessKey)
{
var config = new Config();
var resourceGroupName = $"atp-{environment}-rg";
var keyVaultName = $"atp-{environment}-kv";
// Key Vault
this.Vault = new KeyVault.Vault($"atp-{environment}-kv", new()
{
ResourceGroupName = resourceGroupName,
Location = location,
Properties = new KeyVault.Inputs.VaultPropertiesArgs
{
TenantId = config.Require("tenantId"),
Sku = new KeyVault.Inputs.SkuArgs
{
Family = "A",
Name = config.Get("keyVaultSku") ?? "standard",
},
EnabledForDeployment = false,
EnabledForTemplateDeployment = false,
EnabledForDiskEncryption = false,
EnableRbacAuthorization = true,
PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true
? "Disabled"
: "Enabled",
},
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
// Store initial secret
new KeyVault.Secret("keyVaultAccessKey", new()
{
ResourceGroupName = resourceGroupName,
VaultName = this.Vault.Name,
Properties = new KeyVault.Inputs.SecretPropertiesArgs
{
Value = accessKey,
},
});
}
}
Azure Service Bus¶
Service Bus Provisioning:
// infrastructure/ServiceBus.cs
public class ServiceBus
{
public ServiceBus.Namespace Namespace { get; }
public ServiceBus(Pulumi.Stack stack, string environment, string location)
{
var config = new Config();
var resourceGroupName = $"atp-{environment}-rg";
var serviceBusName = $"atp-{environment}-sb";
this.Namespace = new ServiceBus.Namespace($"atp-{environment}-sb", new()
{
ResourceGroupName = resourceGroupName,
Location = location,
Sku = new ServiceBus.Inputs.SBSkuArgs
{
Name = environment == "production" ? "Premium" : "Standard",
Tier = environment == "production" ? "Premium" : "Standard",
},
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
// Create queues
var queues = new[] { "audit-events", "export-requests", "notifications" };
foreach (var queueName in queues)
{
new ServiceBus.Queue($"{environment}-{queueName}", new()
{
ResourceGroupName = resourceGroupName,
NamespaceName = this.Namespace.Name,
Name = queueName,
EnablePartitioning = environment == "production",
MaxDeliveryCount = 10,
LockDuration = "PT5M",
DefaultMessageTimeToLive = "P7D",
});
}
}
}
Azure Storage Accounts (Blob, Queue)¶
Storage Account Provisioning:
// infrastructure/StorageAccount.cs
public class StorageAccount
{
public Storage.StorageAccount Account { get; }
public StorageAccount(Pulumi.Stack stack, string environment, string location)
{
var config = new Config();
var resourceGroupName = $"atp-{environment}-rg";
var storageName = $"atp{environment}st".Replace("-", ""); // Must be lowercase, alphanumeric
this.Account = new Storage.StorageAccount($"atp-{environment}-st", new()
{
ResourceGroupName = resourceGroupName,
Location = location,
AccountName = storageName,
Kind = "StorageV2",
Sku = new Storage.Inputs.SkuArgs
{
Name = environment == "production" ? "Standard_GRS" : "Standard_LRS",
},
EnableHttpsTrafficOnly = true,
AllowBlobPublicAccess = false,
MinimumTlsVersion = "TLS1_2",
NetworkRuleSet = new Storage.Inputs.NetworkRuleSetArgs
{
DefaultAction = config.GetBool("enablePrivateEndpoint") == true
? "Deny"
: "Allow",
Bypass = "AzureServices",
},
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
// Blob Container
new Storage.BlobContainer("audit-trail", new()
{
ResourceGroupName = resourceGroupName,
AccountName = this.Account.Name,
ContainerName = "audit-trail",
PublicAccess = "None",
});
// Queue
new Storage.Queue("audit-processing", new()
{
ResourceGroupName = resourceGroupName,
AccountName = this.Account.Name,
QueueName = "audit-processing",
});
}
}
Application Insights / Log Analytics¶
Application Insights and Log Analytics:
// infrastructure/Monitoring.cs
public class Monitoring
{
public OperationalInsights.Workspace LogAnalyticsWorkspace { get; }
public Insights.Component ApplicationInsights { get; }
public Monitoring(Pulumi.Stack stack, string environment, string location)
{
var config = new Config();
var resourceGroupName = $"atp-{environment}-rg";
// Log Analytics Workspace
this.LogAnalyticsWorkspace = new OperationalInsights.Workspace(
$"atp-{environment}-loganalytics", new()
{
ResourceGroupName = resourceGroupName,
Location = location,
Sku = new OperationalInsights.WorkspaceSkuArgs
{
Name = "PerGB2018",
},
RetentionInDays = environment == "production" ? 730 : 30,
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
// Application Insights
this.ApplicationInsights = new Insights.Component($"atp-{environment}-appinsights", new()
{
ResourceGroupName = resourceGroupName,
Location = location,
Kind = "web",
ApplicationType = "web",
WorkspaceResourceId = this.LogAnalyticsWorkspace.Id,
RetentionInDays = environment == "production" ? 730 : 30,
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
},
});
}
}
Pulumi State Management¶
State Backend in Azure Blob Storage¶
Azure Blob Storage Backend Configuration:
# Initialize Pulumi with Azure Blob Storage backend
pulumi login --cloud-url azblob://atp-pulumi-state
# Or configure in Pulumi.yaml
Backend Configuration (Pulumi.yaml):
Backend Setup:
# Create storage account for state (one-time setup)
az storage account create \
--name atppulumistate \
--resource-group atp-shared-rg \
--location eastus \
--sku Standard_LRS \
--allow-blob-public-access false
# Create container
az storage container create \
--name pulumi-state \
--account-name atppulumistate \
--auth-mode login
# Set backend URL
pulumi login --cloud-url azblob://pulumi-state
State Locking Mechanisms¶
State Locking:
- Automatic Locking: Pulumi automatically locks state during operations
- Blob Lease: Uses Azure Blob Storage lease mechanism
- Lock Duration: Default 10 minutes, configurable
- Lock Release: Automatically released on operation completion or timeout
Manual Lock Management:
# Check lock status
pulumi stack --show-urns
# Force unlock (if stuck)
pulumi cancel --stack production
State Encryption¶
State Encryption at Rest:
# Enable encryption on storage account
az storage account update \
--name atppulumistate \
--resource-group atp-shared-rg \
--encryption-services blob
# Use Azure Key Vault for encryption keys
az storage account update \
--name atppulumistate \
--resource-group atp-shared-rg \
--encryption-key-source Microsoft.Keyvault \
--encryption-key-vault "https://atp-shared-kv.vault.azure.net/keys/storage-encryption"
Encrypted Secrets in State:
// Secrets are automatically encrypted in state
var password = config.RequireSecret("sqlAdminPassword");
// This value is encrypted in the state file
Backup and Recovery¶
State Backup Strategy:
# Enable blob versioning
az storage account blob-service-properties update \
--account-name atppulumistate \
--enable-versioning true
# Enable soft delete
az storage account blob-service-properties update \
--account-name atppulumistate \
--enable-delete-retention true \
--delete-retention-days 30
# Manual backup
az storage blob download \
--account-name atppulumistate \
--container-name pulumi-state \
--name production.json \
--file backup-$(date +%Y%m%d)-production.json
State Recovery:
# Restore from backup
az storage blob upload \
--account-name atppulumistate \
--container-name pulumi-state \
--name production.json \
--file backup-20240115-production.json \
--overwrite
GitOps Workflow for Infrastructure¶
Infrastructure Changes via PR¶
PR Workflow for Infrastructure:
graph LR
A[Developer] -->|Create PR| B[Infrastructure Code<br/>in Git]
B -->|PR Validation| C[Pulumi Preview]
C -->|Lint & Validate| D[Security Scan]
D -->|Review| E[Manual Approval]
E -->|Merge| F[Pulumi Up]
F -->|Update State| G[Azure Resources<br/>Provisioned]
style B fill:#90EE90
style C fill:#FFE5B4
style D fill:#FFE5B4
style E fill:#ffcccc
style F fill:#90EE90
style G fill:#ffcccc
Pulumi Preview in PR Validation¶
Azure Pipeline: PR Validation:
# .azuredevops/pipelines/infrastructure-pr-validation.yml
trigger: none
pr:
branches:
include:
- main
- staging
- test
- dev
pool:
vmImage: 'ubuntu-latest'
variables:
- group: ATP-Infrastructure-Variables
stages:
- stage: ValidateInfrastructure
displayName: 'Validate Infrastructure Changes'
jobs:
- job: PulumiPreview
displayName: 'Pulumi Preview'
steps:
- checkout: self
# Determine stack from branch
- script: |
case "$(Build.SourceBranch)" in
refs/heads/main)
STACK="production"
;;
refs/heads/staging)
STACK="staging"
;;
refs/heads/test)
STACK="test"
;;
*)
STACK="dev"
;;
esac
echo "##vso[task.setvariable variable=PulumiStack]$STACK"
displayName: 'Determine Pulumi stack'
# Install Pulumi
- script: |
curl -fsSL https://get.pulumi.com | sh
export PATH="$HOME/.pulumi/bin:$PATH"
pulumi version
displayName: 'Install Pulumi'
# Restore .NET dependencies
- script: |
dotnet restore
displayName: 'Restore .NET dependencies'
# Set stack configuration
- script: |
export PATH="$HOME/.pulumi/bin:$PATH"
pulumi stack select $(PulumiStack)
pulumi config set azure-native:location $(AzureLocation)
pulumi config set environment $(PulumiStack)
displayName: 'Configure Pulumi stack'
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)
# Run Pulumi preview
- script: |
export PATH="$HOME/.pulumi/bin:$PATH"
pulumi preview --stack $(PulumiStack) \
--diff \
--json > preview-output.json
displayName: 'Run Pulumi preview'
continueOnError: true
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)
# Publish preview output
- task: PublishPipelineArtifact@1
condition: always()
inputs:
targetPath: 'preview-output.json'
artifactName: 'pulumi-preview-$(PulumiStack)'
publishLocation: 'pipeline'
# Validate preview output
- script: |
if [ -s preview-output.json ]; then
echo "✅ Preview generated successfully"
# Check for destroy operations (require special approval)
if grep -q '"steps".*"delete"' preview-output.json; then
echo "⚠️ WARNING: Preview contains resource deletions"
exit 1
fi
else
echo "❌ Preview failed or produced no output"
exit 1
fi
displayName: 'Validate preview output'
Manual Approval for Infrastructure Changes¶
Approval Gates:
# Azure Pipeline: Infrastructure deployment
trigger:
branches:
include:
- main
- staging
stages:
- stage: ApprovalGate
displayName: 'Infrastructure Change Approval'
jobs:
- job: WaitForApproval
displayName: 'Wait for Approval'
pool: server
steps:
- task: ManualValidation@0
timeoutInMinutes: 1440 # 24 hours
inputs:
notifyUsers: 'architect-team@connectsoft.example;sre-lead@connectsoft.example'
instructions: |
Review the Pulumi preview output before approving infrastructure changes.
⚠️ WARNING: Infrastructure changes can affect production services.
Please verify:
- Resource changes are expected
- No unintended resource deletions
- Configuration values are correct
- Cost impact is acceptable
- stage: DeployInfrastructure
displayName: 'Deploy Infrastructure'
dependsOn: ApprovalGate
condition: succeeded()
jobs:
- job: PulumiUp
steps:
# ... Pulumi up steps
Pulumi Up After Approval¶
Deployment Stage:
- stage: DeployInfrastructure
displayName: 'Deploy Infrastructure'
jobs:
- job: PulumiUp
displayName: 'Pulumi Up'
steps:
- checkout: self
- script: |
curl -fsSL https://get.pulumi.com | sh
export PATH="$HOME/.pulumi/bin:$PATH"
displayName: 'Install Pulumi'
- script: |
dotnet restore
dotnet build
displayName: 'Build Pulumi program'
- script: |
export PATH="$HOME/.pulumi/bin:$PATH"
pulumi stack select $(PulumiStack)
pulumi config set azure-native:location $(AzureLocation)
displayName: 'Configure stack'
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)
- script: |
export PATH="$HOME/.pulumi/bin:$PATH"
pulumi up --stack $(PulumiStack) \
--yes \
--skip-preview
displayName: 'Deploy infrastructure'
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)
Infrastructure Drift Detection¶
Drift Detection Script:
#!/bin/bash
# scripts/detect-infrastructure-drift.sh
set -euo pipefail
STACK="${1:-production}"
echo "🔍 Detecting infrastructure drift for stack: $STACK"
# Refresh state
pulumi refresh --stack "$STACK" --yes
# Generate diff
pulumi preview --stack "$STACK" --diff > drift-diff.txt
if [ -s drift-diff.txt ]; then
echo "⚠️ Infrastructure drift detected!"
cat drift-diff.txt
# Send alert
echo "🚨 Alert: Infrastructure drift detected in $STACK stack"
exit 1
else
echo "✅ No infrastructure drift detected"
exit 0
fi
Scheduled Drift Detection:
# Azure Pipeline: Scheduled drift detection
schedules:
- cron: "0 2 * * *" # Daily at 2 AM UTC
branches:
include:
- main
displayName: 'Daily Infrastructure Drift Detection'
stages:
- stage: DriftDetection
displayName: 'Detect Infrastructure Drift'
jobs:
- job: CheckDrift
steps:
- script: |
./scripts/detect-infrastructure-drift.sh production
displayName: 'Detect drift'
Pulumi Automation API¶
Programmatic Infrastructure Management¶
Pulumi Automation API Example:
// infrastructure/Automation.cs
using Pulumi.Automation;
public class InfrastructureAutomation
{
public static async Task<UpdateResult> UpdateInfrastructureAsync(
string stackName, string projectPath)
{
// Create stack workspace
var workspace = await LocalWorkspace.CreateOrSelectStackAsync(
new InlineProgramArgs("atp-infrastructure", projectPath)
{
Program = PulumiFn.Create<ATPStack>(),
});
// Set stack configuration
await workspace.SetConfigAsync("azure-native:location",
new ConfigValue("eastus"));
// Preview changes
var preview = await workspace.PreviewAsync(new PreviewOptions
{
OnOutput = Console.WriteLine,
});
if (preview.ChangeSummary.ContainsKey(OpType.Create) ||
preview.ChangeSummary.ContainsKey(OpType.Update))
{
// Deploy changes
var update = await workspace.UpAsync(new UpOptions
{
OnOutput = Console.WriteLine,
});
return update.Summary;
}
return null;
}
}
Dynamic Infrastructure Provisioning¶
Dynamic Resource Creation:
// Create resources dynamically based on configuration
var environments = new[] { "dev", "test", "staging", "production" };
foreach (var env in environments)
{
var stack = new StackReference($"ConnectSoft/atp-infrastructure/{env}");
// Create environment-specific resources
var aks = new AKSCluster(stack, env, "eastus", 3, "Standard_D2s_v3");
var acr = new AzureContainerRegistry(stack, env, "eastus");
}
Pulumi Policy as Code¶
Resource Validation Policies¶
Pulumi Policy Example:
// policies/enforce-tagging.ts
import { PolicyPack } from "@pulumi/policy";
const policies = new PolicyPack("atp-tagging-policies", {
policies: [{
name: "require-environment-tag",
description: "All resources must have an Environment tag",
enforcementLevel: "mandatory",
validateResource: (args, reportViolation) => {
const tags = args.props.tags || {};
if (!tags.Environment) {
reportViolation("Resource must have an Environment tag");
}
},
}, {
name: "require-managedby-tag",
description: "All resources must have a ManagedBy tag",
enforcementLevel: "mandatory",
validateResource: (args, reportViolation) => {
const tags = args.props.tags || {};
if (tags.ManagedBy !== "pulumi") {
reportViolation("Resource must have ManagedBy=pulumi tag");
}
},
}],
});
Compliance Checks (Tagging, Encryption, etc.)¶
Compliance Policies:
// policies/compliance-policies.ts
import { PolicyPack } from "@pulumi/policy";
const policies = new PolicyPack("atp-compliance-policies", {
policies: [{
name: "require-encryption-at-rest",
description: "Storage accounts must have encryption enabled",
enforcementLevel: "mandatory",
validateResource: (args, reportViolation) => {
if (args.type === "azure-native:storage:StorageAccount") {
if (!args.props.enableHttpsTrafficOnly) {
reportViolation("Storage account must have HTTPS-only traffic enabled");
}
}
},
}, {
name: "prevent-public-blob-access",
description: "Storage accounts must not allow public blob access",
enforcementLevel: "mandatory",
validateResource: (args, reportViolation) => {
if (args.type === "azure-native:storage:StorageAccount") {
if (args.props.allowBlobPublicAccess) {
reportViolation("Storage account must not allow public blob access");
}
}
},
}],
});
Cost Controls (SKU Limits, Region Restrictions)¶
Cost Control Policies:
// policies/cost-control-policies.ts
import { PolicyPack } from "@pulumi/policy";
const policies = new PolicyPack("atp-cost-control-policies", {
policies: [{
name: "limit-aks-node-vm-size",
description: "AKS node VM size must not exceed Standard_D4s_v3",
enforcementLevel: "mandatory",
validateResource: (args, reportViolation) => {
if (args.type === "azure-native:containerservice:ManagedCluster") {
const agentPools = args.props.agentPoolProfiles || [];
for (const pool of agentPools) {
if (pool.vmSize && pool.vmSize.includes("D8")) {
reportViolation(`VM size ${pool.vmSize} exceeds maximum allowed (Standard_D4s_v3)`);
}
}
}
},
}, {
name: "restrict-regions",
description: "Resources must be deployed only in approved regions",
enforcementLevel: "mandatory",
validateResource: (args, reportViolation) => {
const allowedRegions = ["eastus", "westus2"];
const location = args.props.location;
if (location && !allowedRegions.includes(location)) {
reportViolation(`Region ${location} is not in the approved list: ${allowedRegions.join(", ")}`);
}
},
}],
});
Apply Policies:
# Install policy pack
pulumi policy install ConnectSoft/atp-policies
# Validate against policies
pulumi preview --policy-pack ConnectSoft/atp-policies
Infrastructure Drift Detection¶
Detecting Out-of-Band Changes¶
Drift Detection Workflow:
#!/bin/bash
# scripts/detect-drift.sh
STACK="${1:-production}"
echo "🔄 Refreshing state to detect drift..."
# Refresh state from Azure
pulumi refresh --stack "$STACK" --yes
# Preview changes (drift)
pulumi preview --stack "$STACK" --diff > drift-report.txt
if grep -q "diff" drift-report.txt; then
echo "⚠️ Drift detected!"
cat drift-report.txt
# Send alert
echo "🚨 Infrastructure drift detected in $STACK stack" | \
mail -s "Infrastructure Drift Alert" sre-team@connectsoft.example
else
echo "✅ No drift detected"
fi
Pulumi Refresh and Diff¶
Refresh and Diff Commands:
# Refresh state from actual Azure resources
pulumi refresh --stack production
# Preview differences (drift)
pulumi preview --stack production --diff
# Show detailed diff
pulumi stack --show-urns --stack production
Automated Drift Correction or Alerts¶
Automated Drift Correction:
# Azure Pipeline: Automated drift correction
schedules:
- cron: "0 3 * * *" # Daily at 3 AM UTC
branches:
include:
- main
stages:
- stage: DriftCorrection
displayName: 'Automated Drift Correction'
jobs:
- job: CorrectDrift
steps:
- script: |
pulumi refresh --stack production --yes
pulumi preview --stack production --diff > drift-diff.txt
if [ -s drift-diff.txt ]; then
# Check if drift is auto-correctable
if grep -q "tags" drift-diff.txt && !grep -q "delete" drift-diff.txt; then
echo "✅ Auto-correcting drift (tags only)"
pulumi up --stack production --yes
else
echo "⚠️ Manual intervention required"
# Send alert
exit 1
fi
fi
displayName: 'Correct drift'
Disaster Recovery¶
Infrastructure Re-Provisioning from Git¶
DR Procedure:
#!/bin/bash
# scripts/disaster-recovery.sh
ENVIRONMENT="${1:-production}"
RESOURCE_GROUP="${2:-atp-production-rg}"
echo "🚨 Starting disaster recovery for $ENVIRONMENT..."
# 1. Verify Git repository is accessible
git clone https://dev.azure.com/ConnectSoft/ATP/_git/atp-infrastructure.git
cd atp-infrastructure
# 2. Select Pulumi stack
pulumi stack select "$ENVIRONMENT"
# 3. Configure backend (state may be lost, recreate if needed)
pulumi login --cloud-url azblob://atp-pulumi-state
# 4. Re-provision infrastructure
pulumi up --stack "$ENVIRONMENT" --yes
echo "✅ Disaster recovery complete"
RTO and RPO Targets¶
DR Targets:
| Metric | Target | Notes |
|---|---|---|
| RTO | 4 hours | Time to restore infrastructure |
| RPO | 24 hours | Maximum acceptable data loss |
| State Recovery | 1 hour | Time to restore Pulumi state |
| Infrastructure Provisioning | 2 hours | Time to provision all resources |
| Application Deployment | 1 hour | Time to deploy applications via GitOps |
DR Drill Procedures¶
DR Drill Checklist:
## Disaster Recovery Drill Checklist
### Pre-Drill
- [ ] Schedule DR drill (quarterly)
- [ ] Notify stakeholders
- [ ] Backup current state
- [ ] Document current infrastructure state
### Drill Execution
- [ ] Simulate disaster scenario
- [ ] Verify Git repository accessibility
- [ ] Restore Pulumi state (if needed)
- [ ] Re-provision infrastructure
- [ ] Verify resource provisioning
- [ ] Deploy applications via GitOps
- [ ] Run smoke tests
- [ ] Verify application functionality
### Post-Drill
- [ ] Document findings
- [ ] Update DR procedures
- [ ] Review RTO/RPO targets
- [ ] Schedule next drill
DR Drill Script:
#!/bin/bash
# scripts/dr-drill.sh
ENVIRONMENT="${1:-staging}" # Use staging for drills
echo "🎯 Starting DR drill for $ENVIRONMENT environment..."
# 1. Backup current state
echo "📦 Backing up current state..."
az storage blob download \
--account-name atppulumistate \
--container-name pulumi-state \
--name "$ENVIRONMENT.json" \
--file "backup-$(date +%Y%m%d)-$ENVIRONMENT.json"
# 2. Destroy infrastructure (simulate disaster)
echo "💥 Simulating disaster (destroying infrastructure)..."
read -p "Are you sure? (yes/no): " confirm
if [ "$confirm" == "yes" ]; then
pulumi destroy --stack "$ENVIRONMENT" --yes
fi
# 3. Re-provision from Git
echo "🔨 Re-provisioning infrastructure..."
pulumi up --stack "$ENVIRONMENT" --yes
# 4. Verify
echo "✅ DR drill complete. Verify infrastructure is operational."
Summary: Pulumi Infrastructure as Code Integration¶
- Pulumi Overview: C# programming model for ATP infrastructure with type safety and testability
- Stack Management: Environment-specific stacks (dev, test, staging, production) with configuration and secrets
- AKS Provisioning: Complete cluster configuration with node pools, networking (Azure CNI), managed identity, Azure Monitor integration
- Azure Resources: ACR, Key Vault, Service Bus, Storage Accounts, Application Insights/Log Analytics
- State Management: Azure Blob Storage backend with locking, encryption, and backup/recovery
- GitOps Workflow: Infrastructure changes via PR, Pulumi preview validation, manual approval, automated deployment
- Automation API: Programmatic infrastructure management and dynamic provisioning
- Policy as Code: Resource validation, compliance checks, cost controls
- Drift Detection: Automated detection and correction of out-of-band changes
- Disaster Recovery: Infrastructure re-provisioning from Git with RTO/RPO targets and DR drill procedures
Azure Key Vault Secret Management¶
Purpose: Define how Azure Key Vault is used for secure secret management in ATP, integrating with Kubernetes workloads via Workload Identity, External Secrets Operator, and CSI Driver to ensure secrets are never stored in Git and are securely injected into pods at runtime.
Azure Key Vault Architecture¶
Key Vault per Environment¶
Key Vault Organization:
| Environment | Key Vault Name | Resource Group | Purpose |
|---|---|---|---|
| Dev | atp-dev-kv |
atp-dev-rg |
Development secrets |
| Test | atp-test-kv |
atp-test-rg |
Testing secrets |
| Staging | atp-staging-kv |
atp-staging-rg |
Pre-production secrets |
| Production | atp-prod-kv |
atp-prod-rg |
Production secrets |
| Shared | atp-shared-kv |
atp-shared-rg |
Cross-environment secrets |
Key Vault Provisioning with Pulumi:
// infrastructure/KeyVault.cs
public class KeyVault
{
public KeyVault.Vault Vault { get; }
public KeyVault(Pulumi.Stack stack, string environment, string location)
{
var config = new Config();
var resourceGroupName = $"atp-{environment}-rg";
var keyVaultName = $"atp-{environment}-kv";
this.Vault = new KeyVault.Vault(keyVaultName, new()
{
ResourceGroupName = resourceGroupName,
Location = location,
Properties = new KeyVault.Inputs.VaultPropertiesArgs
{
TenantId = config.Require("tenantId"),
Sku = new KeyVault.Inputs.SkuArgs
{
Family = "A",
Name = environment == "production" ? "premium" : "standard",
},
EnabledForDeployment = false,
EnabledForTemplateDeployment = false,
EnabledForDiskEncryption = false,
EnableRbacAuthorization = true, // Use RBAC instead of access policies
EnableSoftDelete = true,
SoftDeleteRetentionInDays = environment == "production" ? 90 : 7,
EnablePurgeProtection = environment == "production",
PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true
? "Disabled"
: "Enabled",
},
Tags = new()
{
{ "Environment", environment },
{ "ManagedBy", "pulumi" },
{ "Compliance", "SOC2" },
},
});
// Private endpoint for production
if (environment == "production" && config.GetBool("enablePrivateEndpoint") == true)
{
new Network.PrivateEndpoint($"atp-{environment}-kv-pe", new()
{
ResourceGroupName = resourceGroupName,
Location = location,
Subnet = new Network.Inputs.SubnetArgs
{
Id = subnetId,
},
PrivateLinkServiceConnections = new[]
{
new Network.Inputs.PrivateLinkServiceConnectionArgs
{
Name = "keyvault-connection",
PrivateLinkServiceId = this.Vault.Id,
GroupIds = new[] { "vault" },
},
},
});
}
}
}
Secret Organization and Naming¶
Secret Naming Conventions:
Pattern: {category}/{service}/{secret-name}
Examples:
- connection-strings/atp-ingestion/sql-connection-string
- api-keys/atp-gateway/stripe-api-key
- certificates/atp-gateway/tls-cert
- credentials/atp-query/service-account-password
- encryption-keys/atp-integrity/data-encryption-key
Secret Categories:
atp-{env}-kv/
├── connection-strings/
│ ├── atp-ingestion/sql-connection-string
│ ├── atp-query/redis-connection-string
│ └── atp-export/blob-storage-connection-string
├── api-keys/
│ ├── atp-gateway/stripe-api-key
│ ├── atp-gateway/sendgrid-api-key
│ └── atp-search/elasticsearch-api-key
├── certificates/
│ ├── atp-gateway/tls-cert
│ └── atp-integrity/signing-cert
├── credentials/
│ ├── atp-query/service-account-password
│ └── atp-export/external-api-credentials
└── encryption-keys/
└── atp-integrity/data-encryption-key
Secret Metadata Tags:
// Set secret with metadata
var secret = new KeyVault.Secret("sql-connection-string", new()
{
ResourceGroupName = resourceGroupName,
VaultName = keyVault.Name,
Properties = new KeyVault.Inputs.SecretPropertiesArgs
{
Value = connectionString,
ContentType = "application/json",
Attributes = new KeyVault.Inputs.SecretAttributesArgs
{
Enabled = true,
ExpiresOn = DateTimeOffset.UtcNow.AddYears(1).ToString("O"),
},
},
Tags = new()
{
{ "Category", "connection-strings" },
{ "Service", "atp-ingestion" },
{ "Environment", environment },
{ "RotatedBy", "automation" },
{ "LastRotated", DateTimeOffset.UtcNow.ToString("O") },
},
});
Access Policies vs RBAC¶
RBAC Configuration (Recommended):
// Grant Key Vault Secrets User role to AKS Workload Identity
var workloadIdentityRoleAssignment = new Authorization.RoleAssignment(
"workload-identity-kv-secrets-user", new()
{
PrincipalId = workloadIdentityPrincipalId,
PrincipalType = "ServicePrincipal",
RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/4633458b-17de-408a-b874-0445c86b69e6", // Key Vault Secrets User
Scope = keyVault.Id,
});
// Grant Key Vault Secrets Officer role for secret rotation
var rotationRoleAssignment = new Authorization.RoleAssignment(
"rotation-kv-secrets-officer", new()
{
PrincipalId = rotationServicePrincipalId,
PrincipalType = "ServicePrincipal",
RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/b86a8fe4-44ce-494c-a47a-613bb0b0c8c7", // Key Vault Secrets Officer
Scope = keyVault.Id,
});
RBAC vs Access Policies Comparison:
| Feature | RBAC (Recommended) | Access Policies |
|---|---|---|
| Granularity | Role-based (Key Vault Secrets User, Officer) | Permission-based (get, list, set, delete) |
| Audit Trail | ✅ Better audit logging | ⚠️ Limited |
| Centralized Management | ✅ Azure AD integration | ❌ Vault-specific |
| Least Privilege | ✅ Fine-grained roles | ⚠️ Can be overly permissive |
| Maintenance | ✅ Easier to manage | ❌ Manual per-vault |
ATP Selection: RBAC
Rationale: - ✅ Better audit trail for compliance - ✅ Centralized Azure AD management - ✅ Fine-grained role assignments - ✅ Easier to maintain and review
Secret Categories¶
Connection Strings (Databases, Service Bus)¶
SQL Database Connection String:
# Set SQL connection string secret
az keyvault secret set \
--vault-name atp-prod-kv \
--name "connection-strings/atp-ingestion/sql-connection-string" \
--value "Server=tcp:atp-prod-sql.database.windows.net,1433;Initial Catalog=ATP;Persist Security Info=False;User ID=atp-ingestion;Password=SecurePassword123!;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;" \
--content-type "application/json" \
--tags Category=connection-strings Service=atp-ingestion Environment=production
Redis Connection String:
# Set Redis connection string secret
az keyvault secret set \
--vault-name atp-prod-kv \
--name "connection-strings/atp-query/redis-connection-string" \
--value "atp-prod-redis.redis.cache.windows.net:6380,password=SecurePassword123!,ssl=True,abortConnect=False" \
--content-type "application/json" \
--tags Category=connection-strings Service=atp-query Environment=production
Service Bus Connection String:
# Set Service Bus connection string secret
az keyvault secret set \
--vault-name atp-prod-kv \
--name "connection-strings/atp-ingestion/servicebus-connection-string" \
--value "Endpoint=sb://atp-prod-sb.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=SecureKey123!" \
--content-type "application/json" \
--tags Category=connection-strings Service=atp-ingestion Environment=production
API Keys and Tokens¶
External API Keys:
# Set Stripe API key
az keyvault secret set \
--vault-name atp-prod-kv \
--name "api-keys/atp-gateway/stripe-api-key" \
--value "sk_live_51ABC123..." \
--content-type "text/plain" \
--tags Category=api-keys Service=atp-gateway Environment=production Provider=stripe
# Set SendGrid API key
az keyvault secret set \
--vault-name atp-prod-kv \
--name "api-keys/atp-gateway/sendgrid-api-key" \
--value "SG.ABC123..." \
--content-type "text/plain" \
--tags Category=api-keys Service=atp-gateway Environment=production Provider=sendgrid
JWT Tokens:
# Set JWT signing key
az keyvault secret set \
--vault-name atp-prod-kv \
--name "api-keys/atp-gateway/jwt-signing-key" \
--value "-----BEGIN PRIVATE KEY-----\nABC123...\n-----END PRIVATE KEY-----" \
--content-type "application/x-pem-file" \
--tags Category=api-keys Service=atp-gateway Environment=production Type=jwt-signing-key
Certificates (TLS, Signing)¶
TLS Certificate:
# Import TLS certificate from file
az keyvault certificate import \
--vault-name atp-prod-kv \
--name "certificates/atp-gateway/tls-cert" \
--file tls-cert.pfx \
--password "SecurePassword123!" \
--tags Category=certificates Service=atp-gateway Environment=production Type=tls
# Or create certificate from CSR
az keyvault certificate create \
--vault-name atp-prod-kv \
--name "certificates/atp-gateway/tls-cert" \
--policy "$(cat cert-policy.json)"
Signing Certificate:
# Import signing certificate
az keyvault certificate import \
--vault-name atp-prod-kv \
--name "certificates/atp-integrity/signing-cert" \
--file signing-cert.pfx \
--password "SecurePassword123!" \
--tags Category=certificates Service=atp-integrity Environment=production Type=signing
Encryption Keys¶
Data Encryption Key:
# Create encryption key
az keyvault key create \
--vault-name atp-prod-kv \
--name "encryption-keys/atp-integrity/data-encryption-key" \
--kty RSA \
--size 2048 \
--ops encrypt decrypt \
--tags Category=encryption-keys Service=atp-integrity Environment=production
Credentials (Service Accounts)¶
Service Account Password:
# Set service account password
az keyvault secret set \
--vault-name atp-prod-kv \
--name "credentials/atp-query/service-account-password" \
--value "SecurePassword123!" \
--content-type "text/plain" \
--tags Category=credentials Service=atp-query Environment=production Type=service-account
Workload Identity for Pods¶
Azure AD Workload Identity Overview¶
Workload Identity Architecture:
graph LR
A[Pod] -->|Token Request| B[Azure AD<br/>OIDC Issuer]
B -->|JWT Token| A
A -->|Authenticate| C[Azure Key Vault]
C -->|Return Secret| A
style A fill:#90EE90
style B fill:#FFE5B4
style C fill:#ffcccc
Benefits of Workload Identity:
- ✅ No secrets stored in Kubernetes
- ✅ Automatic token rotation
- ✅ Fine-grained RBAC permissions
- ✅ Audit trail via Azure AD logs
- ✅ No certificate management
Federated Credentials Configuration¶
Federated Credential Setup:
// Create User Assigned Managed Identity
var workloadIdentity = new ManagedServiceIdentity.UserAssignedIdentity(
"atp-workload-identity", new()
{
ResourceGroupName = resourceGroupName,
Location = location,
});
// Create federated credential for Kubernetes ServiceAccount
var federatedCredential = new ManagedServiceIdentity.FederatedCredential(
"atp-federated-credential", new()
{
ResourceGroupName = resourceGroupName,
IdentityName = workloadIdentity.Name,
Properties = new ManagedServiceIdentity.Inputs.FederatedCredentialPropertiesArgs
{
Issuer = "https://kubernetes.default.svc.cluster.local",
Subject = "system:serviceaccount:atp-production:atp-ingestion", // K8s ServiceAccount
Audiences = new[] { "api://AzureADTokenExchange" },
},
});
Azure CLI Setup:
# Create User Assigned Managed Identity
az identity create \
--name atp-workload-identity \
--resource-group atp-production-rg
# Create federated credential
az identity federated-credential create \
--name atp-federated-credential \
--identity-name atp-workload-identity \
--resource-group atp-production-rg \
--issuer "https://kubernetes.default.svc.cluster.local" \
--subject "system:serviceaccount:atp-production:atp-ingestion" \
--audience "api://AzureADTokenExchange"
ServiceAccount Annotation¶
ServiceAccount with Workload Identity:
# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: atp-ingestion
namespace: atp-production
annotations:
azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc" # Managed Identity Client ID
azure.workload.identity/tenant-id: "87654321-4321-4321-4321-cba987654321" # Azure AD Tenant ID
Pod Authentication Flow¶
Pod Configuration with Workload Identity:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-production
spec:
template:
metadata:
labels:
azure.workload.identity/use: "true" # Enable Workload Identity
spec:
serviceAccountName: atp-ingestion
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
env:
# Secret will be injected via External Secrets Operator
- name: SQL_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: sql-connection-string # Created by External Secrets Operator
key: connection-string
Authentication Flow:
- Pod starts with Workload Identity annotation
- Azure AD Workload Identity webhook injects
AZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_FEDERATED_TOKEN_FILEenvironment variables - Pod authenticates to Azure AD using federated token
- Azure AD returns access token
- Pod uses access token to access Key Vault (via External Secrets Operator or CSI Driver)
No Secrets in Pod Specs!¶
❌ BAD: Plaintext Secrets in Pod Specs:
# ❌ NEVER DO THIS!
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
env:
- name: PASSWORD
value: "PlaintextPassword123!" # ❌ Exposed in Git!
✅ GOOD: Reference External Secret:
# ✅ Correct: Reference secret from External Secrets Operator
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
env:
- name: PASSWORD
valueFrom:
secretKeyRef:
name: sql-connection-string # Created by External Secrets Operator
key: connection-string
External Secrets Operator¶
Installation and Configuration¶
Install External Secrets Operator:
# Add Helm repository
helm repo add external-secrets https://charts.external-secrets.io
helm repo update
# Install External Secrets Operator
helm install external-secrets \
external-secrets/external-secrets \
-n external-secrets-system \
--create-namespace \
--version 0.9.0
Verify Installation:
kubectl get pods -n external-secrets-system
# NAME READY STATUS RESTARTS AGE
# external-secrets-operator-7d8f9c4b5-abc123 1/1 Running 0 2m
ClusterSecretStore Setup¶
ClusterSecretStore for Azure Key Vault:
# infrastructure/clustersecretstore.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: azure-keyvault
spec:
provider:
azurekv:
vaultUrl: "https://atp-prod-kv.vault.azure.net"
tenantId: "87654321-4321-4321-4321-cba987654321"
authType: WorkloadIdentity
serviceAccountRef:
name: external-secrets-operator
namespace: external-secrets-system
# Or use Service Principal (not recommended)
# authType: ServicePrincipal
# servicePrincipalRef:
# tenantId: "87654321-4321-4321-4321-cba987654321"
# clientId: "12345678-1234-1234-1234-123456789abc"
# clientSecret:
# secretRef:
# name: external-secrets-sp
# key: client-secret
Grant Permissions to External Secrets Operator:
// Grant Key Vault Secrets User role to External Secrets Operator Workload Identity
var esoRoleAssignment = new Authorization.RoleAssignment(
"eso-kv-secrets-user", new()
{
PrincipalId = externalSecretsOperatorIdentityPrincipalId,
PrincipalType = "ServicePrincipal",
RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/4633458b-17de-408a-b874-0445c86b69e6", // Key Vault Secrets User
Scope = keyVault.Id,
});
ExternalSecret Resources¶
ExternalSecret for Connection String:
# apps/atp-ingestion/base/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
namespace: atp-production
spec:
refreshInterval: 1h # Refresh every hour
secretStoreRef:
name: azure-keyvault
kind: ClusterSecretStore
target:
name: sql-connection-string # Kubernetes Secret name
creationPolicy: Owner
template:
type: Opaque
data:
connection-string: "{{ .connectionString | toString }}"
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
property: value
version: "" # Empty = latest version
ExternalSecret for Certificate:
# apps/atp-gateway/base/externalsecret-cert.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: tls-certificate
namespace: atp-production
spec:
refreshInterval: 24h
secretStoreRef:
name: azure-keyvault
kind: ClusterSecretStore
target:
name: tls-certificate
creationPolicy: Owner
template:
type: kubernetes.io/tls
data:
tls.crt: "{{ .certificate | b64enc }}"
tls.key: "{{ .privateKey | b64enc }}"
data:
- secretKey: certificate
remoteRef:
key: certificates/atp-gateway/tls-cert
property: cert
- secretKey: privateKey
remoteRef:
key: certificates/atp-gateway/tls-cert
property: key
ExternalSecret for Multiple Secrets:
# apps/atp-gateway/base/externalsecret-multiple.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: gateway-secrets
namespace: atp-production
spec:
refreshInterval: 1h
secretStoreRef:
name: azure-keyvault
kind: ClusterSecretStore
target:
name: gateway-secrets
creationPolicy: Owner
data:
- secretKey: stripe-api-key
remoteRef:
key: api-keys/atp-gateway/stripe-api-key
- secretKey: sendgrid-api-key
remoteRef:
key: api-keys/atp-gateway/sendgrid-api-key
- secretKey: jwt-signing-key
remoteRef:
key: api-keys/atp-gateway/jwt-signing-key
Sync Interval and Refresh¶
Refresh Strategies:
| Strategy | Refresh Interval | Use Case |
|---|---|---|
| Frequent | 5m | High-security, frequently rotated secrets |
| Standard | 1h | Most application secrets |
| Infrequent | 24h | Stable certificates, long-lived keys |
| On-Demand | Manual refresh | Rarely changed secrets |
Manual Refresh:
# Trigger manual refresh
kubectl annotate externalsecret sql-connection-string \
-n atp-production \
force-sync=$(date +%s) \
--overwrite
ExternalSecret Status:
# Check ExternalSecret status
kubectl get externalsecret sql-connection-string -n atp-production -o yaml
# Status output:
status:
conditions:
- lastTransitionTime: "2024-01-15T10:00:00Z"
message: Secret was synced
reason: SecretSynced
status: "True"
type: Ready
refreshTime: "2024-01-15T10:00:00Z"
syncedResourceVersion: "12345"
Secret Rotation Handling¶
ExternalSecret with Version Tracking:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
namespace: atp-production
spec:
refreshInterval: 1h
secretStoreRef:
name: azure-keyvault
kind: ClusterSecretStore
target:
name: sql-connection-string
creationPolicy: Owner
template:
metadata:
annotations:
external-secrets.io/last-sync-time: "{{ .refreshTime | date \"2006-01-02T15:04:05Z07:00\" }}"
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
# Track version for rotation
version: "" # Empty = latest, or specify version ID
Application Secret Rotation:
// C# application: Handle secret rotation gracefully
public class SecretRotationHandler
{
private readonly ILogger<SecretRotationHandler> _logger;
private string _currentConnectionString;
private readonly SemaphoreSlim _rotationLock = new(1, 1);
public async Task<string> GetConnectionStringAsync()
{
// Read from mounted secret file or environment variable
var newConnectionString = await File.ReadAllTextAsync(
"/mnt/secrets/sql-connection-string/connection-string");
if (_currentConnectionString != newConnectionString)
{
await _rotationLock.WaitAsync();
try
{
if (_currentConnectionString != newConnectionString)
{
_logger.LogInformation("Connection string rotated, updating connection");
await RotateConnectionAsync(newConnectionString);
_currentConnectionString = newConnectionString;
}
}
finally
{
_rotationLock.Release();
}
}
return _currentConnectionString;
}
private async Task RotateConnectionAsync(string newConnectionString)
{
// Close old connections
// Establish new connections with new connection string
// Zero-downtime rotation
}
}
CSI Driver Alternative¶
Azure Key Vault CSI Driver¶
Installation:
# Install Azure Key Vault CSI Driver
helm repo add csi-secrets-store-provider-azure https://raw.githubusercontent.com/Azure/secrets-store-csi-driver-provider-azure/master/charts
helm repo update
helm install csi-secrets-store-provider-azure \
csi-secrets-store-provider-azure/csi-secrets-store-provider-azure \
--namespace kube-system \
--version 1.4.0
Verify Installation:
kubectl get pods -n kube-system | grep csi-secrets-store
# NAME READY STATUS RESTARTS AGE
# csi-secrets-store-provider-azure-7d8f9c4b5-abc123 1/1 Running 0 2m
# csi-secrets-store-driver-9f8e7d6c5-def456 2/2 Running 0 2m
SecretProviderClass Configuration¶
SecretProviderClass with Workload Identity:
# apps/atp-ingestion/base/secretproviderclass.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: atp-ingestion-secrets
namespace: atp-production
spec:
provider: azure
secretObjects:
- secretName: sql-connection-string # Kubernetes Secret to create
type: Opaque
data:
- objectName: sql-connection-string
key: connection-string
parameters:
usePodIdentity: "false"
useVMManagedIdentity: "false"
useWorkloadIdentity: "true" # Use Workload Identity
workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc" # Managed Identity Client ID
tenantId: "87654321-4321-4321-4321-cba987654321" # Azure AD Tenant ID
keyvaultName: "atp-prod-kv"
objects: |
array:
- |
objectName: connection-strings/atp-ingestion/sql-connection-string
objectType: secret
objectVersion: "" # Empty = latest version
tenantId: "87654321-4321-4321-4321-cba987654321"
SecretProviderClass for Certificate:
# apps/atp-gateway/base/secretproviderclass-cert.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: tls-certificate
namespace: atp-production
spec:
provider: azure
secretObjects:
- secretName: tls-certificate
type: kubernetes.io/tls
data:
- objectName: tls-cert
key: tls.crt
- objectName: tls-key
key: tls.key
parameters:
useWorkloadIdentity: "true"
workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc"
tenantId: "87654321-4321-4321-4321-cba987654321"
keyvaultName: "atp-prod-kv"
objects: |
array:
- |
objectName: certificates/atp-gateway/tls-cert
objectType: secret
objectFormat: pfx
objectEncoding: base64
tenantId: "87654321-4321-4321-4321-cba987654321"
Mounting Secrets as Volumes¶
Deployment with CSI Volume Mount:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-production
spec:
template:
metadata:
labels:
azure.workload.identity/use: "true"
spec:
serviceAccountName: atp-ingestion
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
volumeMounts:
- name: secrets-store
mountPath: "/mnt/secrets-store"
readOnly: true
env:
- name: SQL_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: sql-connection-string # Created by SecretProviderClass secretObjects
key: connection-string
volumes:
- name: secrets-store
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "atp-ingestion-secrets"
Automatic Rotation¶
Secret Rotation with CSI Driver:
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: atp-ingestion-secrets
namespace: atp-production
spec:
provider: azure
# Enable rotation
rotationPolicy:
rotationEnabled: true
rotationPeriod: "1h"
secretObjects:
- secretName: sql-connection-string
type: Opaque
data:
- objectName: sql-connection-string
key: connection-string
parameters:
useWorkloadIdentity: "true"
workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc"
tenantId: "87654321-4321-4321-4321-cba987654321"
keyvaultName: "atp-prod-kv"
objects: |
array:
- |
objectName: connection-strings/atp-ingestion/sql-connection-string
objectType: secret
objectVersion: "" # Latest version
Rotation Status:
# Check rotation status
kubectl describe secretproviderclass atp-ingestion-secrets -n atp-production
# View mounted secrets
kubectl exec -it deployment/atp-ingestion -n atp-production -- \
ls -la /mnt/secrets-store/
When to Use CSI vs External Secrets¶
Comparison Matrix:
| Feature | External Secrets Operator | CSI Driver |
|---|---|---|
| Secret Access | Creates Kubernetes Secrets | Mounts directly or creates Secrets |
| Rotation | Manual refresh or polling | Automatic rotation support |
| Use Case | Standard Kubernetes Secret consumption | Direct file access or Secret creation |
| Performance | Slight delay (polling) | Faster (direct mount) |
| Compatibility | ✅ Works with existing Secret consumers | ⚠️ Requires CSI volume mounts |
| Complexity | ✅ Simpler | ⚠️ More complex setup |
ATP Selection Guide:
- External Secrets Operator: ✅ Recommended for most use cases
- Standard Kubernetes Secret consumption
- Works with existing applications
-
Simpler to manage
-
CSI Driver: Use when:
- Need direct file access to secrets
- Require automatic rotation without polling
- High-performance secret access needed
Secret References in Manifests¶
Never Plaintext Secrets in Git¶
❌ BAD: Plaintext Secret in Git:
# ❌ NEVER COMMIT THIS TO GIT!
apiVersion: v1
kind: Secret
metadata:
name: sql-connection-string
data:
connection-string: U2VjdXJlUGFzc3dvcmQxMjMh # Base64 encoded, but still in Git!
✅ GOOD: External Secret Reference:
# ✅ Correct: Reference External Secret
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
spec:
refreshInterval: 1h
secretStoreRef:
name: azure-keyvault
kind: ClusterSecretStore
target:
name: sql-connection-string
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
Referencing Secrets by Name¶
Deployment Using Secret Reference:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
env:
# Reference secret created by External Secrets Operator
- name: SQL_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: sql-connection-string
key: connection-string
- name: REDIS_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: redis-connection-string
key: connection-string
envFrom:
# Or use envFrom for multiple secrets
- secretRef:
name: gateway-secrets
Environment-Specific Secret Mappings¶
Environment-Specific ExternalSecret:
# apps/atp-ingestion/overlays/production/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
namespace: atp-production
spec:
refreshInterval: 1h
secretStoreRef:
name: azure-keyvault-prod # Production Key Vault
kind: ClusterSecretStore
target:
name: sql-connection-string
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
# Production-specific secret path
# apps/atp-ingestion/overlays/dev/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
namespace: atp-dev
spec:
refreshInterval: 24h # Less frequent refresh for dev
secretStoreRef:
name: azure-keyvault-dev # Dev Key Vault
kind: ClusterSecretStore
target:
name: sql-connection-string
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
# Dev-specific secret path
Secret Versioning¶
Reference Specific Secret Version:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
spec:
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
version: "abc123def456789" # Specific version ID
Track Secret Versions:
# List secret versions
az keyvault secret show-versions \
--vault-name atp-prod-kv \
--name "connection-strings/atp-ingestion/sql-connection-string" \
--query "[].{id:id, enabled:attributes.enabled, updated:attributes.updated}"
# Output:
# [
# {
# "id": "https://atp-prod-kv.vault.azure.net/secrets/connection-strings/atp-ingestion/sql-connection-string/abc123def456789",
# "enabled": true,
# "updated": "2024-01-15T10:00:00Z"
# }
# ]
Secret Rotation Procedures¶
Manual Rotation Workflow¶
Secret Rotation Checklist:
## Manual Secret Rotation Checklist
### Pre-Rotation
- [ ] Notify team of rotation schedule
- [ ] Verify application can handle secret rotation gracefully
- [ ] Backup current secret (if needed)
- [ ] Prepare new secret value
### Rotation
- [ ] Create new secret version in Key Vault
- [ ] Test new secret in non-production environment
- [ ] Update ExternalSecret to reference new version (optional)
- [ ] Trigger ExternalSecret refresh
- [ ] Verify application picks up new secret
- [ ] Monitor application for errors
### Post-Rotation
- [ ] Verify application is functioning correctly
- [ ] Disable old secret version (don't delete yet)
- [ ] Monitor for 24-48 hours
- [ ] Delete old secret version after confirmation
Manual Rotation Script:
#!/bin/bash
# scripts/rotate-secret.sh
VAULT_NAME="${1:-atp-prod-kv}"
SECRET_NAME="${2:-connection-strings/atp-ingestion/sql-connection-string}"
NEW_SECRET_VALUE="${3:-}"
if [ -z "$NEW_SECRET_VALUE" ]; then
echo "Usage: $0 <vault-name> <secret-name> <new-secret-value>"
exit 1
fi
echo "🔄 Rotating secret: $SECRET_NAME"
# 1. Create new secret version
echo "📝 Creating new secret version..."
az keyvault secret set \
--vault-name "$VAULT_NAME" \
--name "$SECRET_NAME" \
--value "$NEW_SECRET_VALUE" \
--tags LastRotated="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
# 2. Trigger ExternalSecret refresh
echo "🔄 Triggering ExternalSecret refresh..."
kubectl annotate externalsecret "$(basename $SECRET_NAME)" \
-n atp-production \
force-sync="$(date +%s)" \
--overwrite
# 3. Verify secret was synced
echo "✅ Waiting for secret sync..."
sleep 10
kubectl get externalsecret "$(basename $SECRET_NAME)" -n atp-production
echo "✅ Secret rotation complete"
Automated Rotation with Key Vault¶
Azure Key Vault Automatic Rotation:
# Enable automatic rotation for certificate
az keyvault certificate set-attributes \
--vault-name atp-prod-kv \
--name "certificates/atp-gateway/tls-cert" \
--enabled true \
--auto-renew true \
--days-before-expiry 30
Rotation Policy:
{
"lifetimeActions": [
{
"trigger": {
"daysBeforeExpiry": 30
},
"action": {
"actionType": "Rotate"
}
},
{
"trigger": {
"daysBeforeExpiry": 7
},
"action": {
"actionType": "EmailContacts"
}
}
],
"issuerParameters": {
"name": "Self"
},
"keyProperties": {
"exportable": true,
"keySize": 2048,
"keyType": "RSA",
"reuseKey": true
},
"secretProperties": {
"contentType": "application/x-pkcs12"
}
}
Application Handling of Rotated Secrets¶
C# Application: Secret Rotation Handler:
// SecretRotationHandler.cs
public class SecretRotationHandler : BackgroundService
{
private readonly ILogger<SecretRotationHandler> _logger;
private readonly IConfiguration _configuration;
private readonly SemaphoreSlim _rotationLock = new(1, 1);
private string _currentSecret;
private DateTime _lastRotationCheck = DateTime.UtcNow;
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
await CheckForSecretRotationAsync();
await Task.Delay(TimeSpan.FromMinutes(5), stoppingToken); // Check every 5 minutes
}
catch (Exception ex)
{
_logger.LogError(ex, "Error checking for secret rotation");
await Task.Delay(TimeSpan.FromMinutes(1), stoppingToken);
}
}
}
private async Task CheckForSecretRotationAsync()
{
// Read secret from mounted file or environment variable
var secretPath = "/mnt/secrets-store/connection-strings/atp-ingestion/sql-connection-string";
if (File.Exists(secretPath))
{
var newSecret = await File.ReadAllTextAsync(secretPath);
if (_currentSecret != null && _currentSecret != newSecret)
{
_logger.LogInformation("Secret rotated, updating connection");
await RotateSecretAsync(newSecret);
}
_currentSecret = newSecret;
}
}
private async Task RotateSecretAsync(string newSecret)
{
await _rotationLock.WaitAsync();
try
{
// Zero-downtime rotation:
// 1. Create new connection with new secret
// 2. Migrate traffic to new connection
// 3. Close old connection
_logger.LogInformation("Secret rotation complete");
}
finally
{
_rotationLock.Release();
}
}
}
Zero-Downtime Rotation¶
Zero-Downtime Rotation Strategy:
graph LR
A[Old Secret<br/>Active] -->|1. New Secret<br/>Created| B[New Secret<br/>Available]
B -->|2. New Connection<br/>Established| C[Dual Connections<br/>Active]
C -->|3. Migrate Traffic| D[New Connection<br/>Primary]
D -->|4. Close Old| E[New Secret<br/>Active]
style A fill:#ffcccc
style C fill:#FFE5B4
style E fill:#90EE90
Implementation:
public class ZeroDowntimeSecretRotation
{
private IDbConnection _primaryConnection;
private IDbConnection _secondaryConnection;
private bool _isRotating = false;
public async Task RotateConnectionStringAsync(string newConnectionString)
{
if (_isRotating) return;
_isRotating = true;
try
{
// 1. Create new connection
var newConnection = new SqlConnection(newConnectionString);
await newConnection.OpenAsync();
// 2. Verify new connection works
using var testCommand = new SqlCommand("SELECT 1", newConnection);
await testCommand.ExecuteScalarAsync();
// 3. Set secondary connection
_secondaryConnection = newConnection;
// 4. Migrate traffic gradually (e.g., 10% at a time)
await MigrateTrafficGraduallyAsync();
// 5. Close old connection
if (_primaryConnection != null)
{
_primaryConnection.Close();
_primaryConnection.Dispose();
}
// 6. Promote new connection to primary
_primaryConnection = _secondaryConnection;
_secondaryConnection = null;
}
finally
{
_isRotating = false;
}
}
}
Secret Versioning and Rollback¶
Key Vault Secret Versions¶
List Secret Versions:
# List all versions of a secret
az keyvault secret show-versions \
--vault-name atp-prod-kv \
--name "connection-strings/atp-ingestion/sql-connection-string" \
--query "[].{id:id, enabled:attributes.enabled, updated:attributes.updated, expires:attributes.expires}" \
--output table
# Output:
# ID ENABLED UPDATED EXPIRES
# https://atp-prod-kv.../abc123def456789 True 2024-01-15T10:00:00Z None
# https://atp-prod-kv.../def456ghi789012 True 2024-01-14T10:00:00Z None
# https://atp-prod-kv.../ghi789jkl012345 False 2024-01-13T10:00:00Z None # Disabled
Get Specific Secret Version:
# Get specific version
az keyvault secret show \
--vault-name atp-prod-kv \
--name "connection-strings/atp-ingestion/sql-connection-string" \
--version "def456ghi789012"
Rolling Back to Previous Secret Version¶
Rollback Procedure:
#!/bin/bash
# scripts/rollback-secret.sh
VAULT_NAME="${1:-atp-prod-kv}"
SECRET_NAME="${2:-connection-strings/atp-ingestion/sql-connection-string}"
VERSION_TO_ROLLBACK="${3:-}"
if [ -z "$VERSION_TO_ROLLBACK" ]; then
echo "Usage: $0 <vault-name> <secret-name> <version-id>"
echo "Listing available versions:"
az keyvault secret show-versions \
--vault-name "$VAULT_NAME" \
--name "$SECRET_NAME" \
--query "[].{version:split(id,'/')[-1], updated:attributes.updated}" \
--output table
exit 1
fi
echo "🔄 Rolling back secret to version: $VERSION_TO_ROLLBACK"
# 1. Get previous version value
PREVIOUS_VALUE=$(az keyvault secret show \
--vault-name "$VAULT_NAME" \
--name "$SECRET_NAME" \
--version "$VERSION_TO_ROLLBACK" \
--query value -o tsv)
# 2. Create new version with previous value
az keyvault secret set \
--vault-name "$VAULT_NAME" \
--name "$SECRET_NAME" \
--value "$PREVIOUS_VALUE" \
--tags RollbackFrom="$VERSION_TO_ROLLBACK" RollbackAt="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
# 3. Trigger ExternalSecret refresh
kubectl annotate externalsecret "$(basename $SECRET_NAME)" \
-n atp-production \
force-sync="$(date +%s)" \
--overwrite
echo "✅ Secret rolled back successfully"
Coordinating Secret Changes with Deployments¶
Coordinated Secret and Deployment Update:
# Strategy: Update secret first, then deployment
# 1. Update ExternalSecret to reference new secret version
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
spec:
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
version: "abc123def456789" # New version
---
# 2. Wait for secret sync
# 3. Update deployment (triggers rolling update with new secret)
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
metadata:
annotations:
secret-version: "abc123def456789" # Track secret version
Audit Logging¶
Key Vault Access Logs¶
Enable Diagnostic Settings:
// Enable Key Vault diagnostic logs
new Insights.DiagnosticSetting("keyvault-diagnostics", new()
{
ResourceId = keyVault.Id,
LogAnalyticsWorkspaceId = logAnalyticsWorkspace.Id,
Logs = new[]
{
new Insights.Inputs.LogSettingsArgs
{
CategoryGroup = "audit",
Enabled = true,
RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
{
Enabled = true,
Days = environment == "production" ? 365 : 30,
},
},
new Insights.Inputs.LogSettingsArgs
{
CategoryGroup = "allLogs",
Enabled = true,
RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
{
Enabled = true,
Days = environment == "production" ? 365 : 30,
},
},
},
Metrics = new[]
{
new Insights.Inputs.MetricSettingsArgs
{
Category = "AllMetrics",
Enabled = true,
RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
{
Enabled = true,
Days = environment == "production" ? 365 : 30,
},
},
},
});
Monitoring Secret Access¶
KQL Query for Secret Access:
// Key Vault access logs
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet" or OperationName == "SecretList"
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend CallerIP = tostring(parse_json(properties_s).httpStatusCode)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| project TimeGenerated, SecretName, OperationName, Identity, CallerIP, ResultSignature
| order by TimeGenerated desc
Access Pattern Analysis:
// Secret access patterns
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet"
| extend SecretName = tostring(parse_json(properties_s).objectName)
| summarize
AccessCount = count(),
UniqueIdentities = dcount(parse_json(properties_s).identity_claim_appid_g),
LastAccess = max(TimeGenerated)
by SecretName, bin(TimeGenerated, 1h)
| order by AccessCount desc
Alerting on Unauthorized Access¶
Azure Monitor Alert Rule:
# alerts/keyvault-unauthorized-access.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: keyvault-unauthorized-access
namespace: monitoring
spec:
groups:
- name: keyvault
rules:
- alert: KeyVaultUnauthorizedAccess
expr: |
count(
azure_keyvault_secret_access_total{
result="Forbidden"
} > 0
) by (secret_name)
for: 5m
labels:
severity: critical
annotations:
summary: "Unauthorized access attempt to Key Vault secret"
description: "Secret {{ $labels.secret_name }} has {{ $value }} unauthorized access attempts"
Log Analytics Alert:
// Alert query: Failed secret access
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet"
| where parse_json(properties_s).statusCode_s == "403" // Forbidden
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| project TimeGenerated, SecretName, Identity
Compliance Reporting¶
Compliance Report Query:
// Secret access compliance report
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(30d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Operation = OperationName
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| extend Result = ResultSignature
| summarize
TotalAccess = count(),
SuccessfulAccess = countif(Result == "OK"),
FailedAccess = countif(Result != "OK"),
UniqueIdentities = dcount(Identity),
LastAccess = max(TimeGenerated)
by SecretName, Operation
| order by TotalAccess desc
Compliance: SOC 2, GDPR, HIPAA¶
Encryption at Rest in Key Vault¶
Key Vault Encryption:
- ✅ Automatic Encryption: All secrets encrypted at rest by default
- ✅ Hardware Security Module (HSM): Premium SKU uses HSM-backed keys
- ✅ Azure Key Vault Managed HSM: Dedicated HSM for highest security
Enable HSM-Backed Keys:
// Use Premium SKU for HSM-backed keys
this.Vault = new KeyVault.Vault(keyVaultName, new()
{
Properties = new KeyVault.Inputs.VaultPropertiesArgs
{
Sku = new KeyVault.Inputs.SkuArgs
{
Family = "A",
Name = "premium", // Premium SKU for HSM
},
},
});
Access Reviews and Audits¶
Regular Access Reviews:
#!/bin/bash
# scripts/access-review.sh
VAULT_NAME="${1:-atp-prod-kv}"
echo "📋 Key Vault Access Review for: $VAULT_NAME"
# List all role assignments
az role assignment list \
--scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.KeyVault/vaults/$VAULT_NAME" \
--query "[].{principal:principalName, role:roleDefinitionName, scope:scope}" \
--output table
# List all secrets and their access patterns
echo "📊 Secret Access Summary:"
az keyvault secret list \
--vault-name "$VAULT_NAME" \
--query "[].name" -o tsv | while read secret; do
echo "Secret: $secret"
az keyvault secret show-versions \
--vault-name "$VAULT_NAME" \
--name "$secret" \
--query "[].{updated:attributes.updated, enabled:attributes.enabled}" \
--output table
done
Automated Access Review:
# Azure Policy: Require access reviews
apiVersion: policy.azure.com/v1beta1
kind: PolicyAssignment
metadata:
name: require-keyvault-access-reviews
spec:
displayName: "Require Key Vault Access Reviews"
policyDefinitionId: "/providers/Microsoft.Authorization/policyDefinitions/..."
parameters:
reviewFrequency: "30d"
Secret Lifecycle Management¶
Secret Lifecycle Policy:
# Secret lifecycle management
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
namespace: atp-production
spec:
refreshInterval: 1h
secretStoreRef:
name: azure-keyvault
kind: ClusterSecretStore
target:
name: sql-connection-string
template:
metadata:
annotations:
secret-lifecycle/created: "{{ .creationTime }}"
secret-lifecycle/expires: "{{ .expirationTime }}"
secret-lifecycle/rotation-policy: "30d"
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
Evidence Collection for Auditors¶
Audit Evidence Report:
// SOC 2 / GDPR Audit Evidence: Secret Access Log
let SecretAccess = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(90d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| extend Result = ResultSignature
| extend IPAddress = tostring(parse_json(properties_s).callerIpAddress_s)
| project TimeGenerated, SecretName, Identity, Result, IPAddress, OperationName;
// Generate report
SecretAccess
| summarize
TotalAccess = count(),
SuccessfulAccess = countif(Result == "OK"),
FailedAccess = countif(Result != "OK"),
UniqueIdentities = dcount(Identity),
DateRange = strcat(min(TimeGenerated), " to ", max(TimeGenerated))
by SecretName
| order by SecretName
Export Audit Logs:
# Export audit logs for compliance
az monitor log-analytics query \
--workspace "atp-prod-loganalytics" \
--analytics-query "
AzureDiagnostics
| where ResourceProvider == 'MICROSOFT.KEYVAULT'
| where Category == 'AuditEvent'
| where TimeGenerated > ago(90d)
| project TimeGenerated, OperationName, SecretName, Identity, Result
" \
--output table > keyvault-audit-log-$(date +%Y%m%d).csv
Summary: Azure Key Vault Secret Management¶
- Key Vault Architecture: Environment-specific Key Vaults with RBAC (recommended over access policies), organized secret naming conventions
- Secret Categories: Connection strings, API keys, certificates, encryption keys, credentials with proper tagging
- Workload Identity: Azure AD Workload Identity for pod authentication, federated credentials, ServiceAccount annotation, no secrets in pod specs
- External Secrets Operator: ClusterSecretStore setup, ExternalSecret resources, sync intervals, secret rotation handling
- CSI Driver: Alternative for direct secret mounting, SecretProviderClass configuration, automatic rotation support
- Secret References: Never plaintext secrets in Git, reference secrets by name, environment-specific mappings, secret versioning
- Secret Rotation: Manual and automated rotation procedures, application handling of rotated secrets, zero-downtime rotation strategies
- Secret Versioning: Key Vault secret versions, rollback procedures, coordinating secret changes with deployments
- Audit Logging: Key Vault access logs, monitoring secret access, alerting on unauthorized access, compliance reporting
- Compliance: Encryption at rest, access reviews, secret lifecycle management, evidence collection for SOC 2, GDPR, HIPAA audits
Security Policies & Compliance¶
Purpose: Define security policies, compliance controls, and enforcement mechanisms for ATP GitOps, ensuring all Kubernetes workloads, network traffic, and container images meet security standards and regulatory requirements (SOC 2, GDPR, HIPAA) through policy-as-code and automated enforcement.
Azure Policy for Kubernetes¶
Policy Overview and Architecture¶
Azure Policy for Kubernetes Architecture:
graph LR
A[Policy Definition<br/>in Azure] -->|Assignment| B[AKS Cluster<br/>with Policy Add-on]
B -->|Enforces| C[Admission Controller]
C -->|Validates| D[Kubernetes Resources]
D -->|Creates| E[Compliant Resources]
D -.->|Violates| F[Policy Violation<br/>Blocked/Reported]
style A fill:#90EE90
style B fill:#FFE5B4
style C fill:#FFE5B4
style D fill:#ffcccc
style E fill:#90EE90
style F fill:#ff9999
Azure Policy Components:
| Component | Purpose | Example |
|---|---|---|
| Policy Definition | Defines the policy rule | "All pods must have resource limits" |
| Policy Assignment | Applies policy to AKS cluster | Assign to atp-prod-eus-aks |
| Policy Effect | Enforcement action | deny, audit, disabled |
| Policy Parameters | Configurable values | Minimum CPU: 100m |
Built-in Policies for AKS¶
Enable Azure Policy Add-on:
# Enable Azure Policy add-on on AKS
az aks enable-addons \
--resource-group atp-production-rg \
--name atp-prod-eus-aks \
--addons azure-policy
# Verify installation
az aks show \
--resource-group atp-production-rg \
--name atp-prod-eus-aks \
--query addonProfiles.azurepolicy
Common Built-in Policies:
// Built-in policy: Kubernetes cluster containers should only use allowed capabilities
{
"policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/c26596ff-4d70-4e6a-9a30-c2506bd2f80c",
"parameters": {
"allowedCapabilities": {
"value": ["NET_BIND_SERVICE"]
},
"requiredDropCapabilities": {
"value": ["ALL"]
},
"effect": {
"value": "Audit"
}
}
}
Assign Built-in Policy:
# Assign built-in policy: Container images should be deployed from trusted registries only
az policy assignment create \
--name "aks-trusted-registries" \
--scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
--policy "/providers/Microsoft.Authorization/policyDefinitions/febd0533-8e55-448f-b837-bd0e06f16469" \
--params '{
"allowedContainerImagesRegex": {
"value": "^connectsoft\\.azurecr\\.io/.*"
},
"effect": {
"value": "Deny"
}
}'
Custom Policy Definitions¶
Custom Policy: Require Resource Limits:
// policies/require-resource-limits.json
{
"properties": {
"displayName": "ATP: Require resource limits on containers",
"description": "Ensures all containers have CPU and memory limits set",
"mode": "Microsoft.Kubernetes.Data",
"metadata": {
"category": "ATP Security"
},
"parameters": {
"minCpu": {
"type": "String",
"metadata": {
"displayName": "Minimum CPU limit",
"description": "Minimum CPU limit (e.g., 100m)"
},
"defaultValue": "100m"
},
"minMemory": {
"type": "String",
"metadata": {
"displayName": "Minimum memory limit",
"description": "Minimum memory limit (e.g., 128Mi)"
},
"defaultValue": "128Mi"
},
"effect": {
"type": "String",
"metadata": {
"displayName": "Effect",
"description": "Policy effect"
},
"allowedValues": ["audit", "deny", "disabled"],
"defaultValue": "deny"
}
},
"policyRule": {
"if": {
"field": "type",
"equals": "Microsoft.Kubernetes/connectedClusters"
},
"then": {
"effect": "[parameters('effect')]",
"details": {
"templateInfo": {
"sourceType": "PublicURL",
"url": "https://raw.githubusercontent.com/ConnectSoft/ATP-Policies/main/policies/require-resource-limits.yaml"
},
"apiGroups": ["apps"],
"kinds": ["Deployment", "StatefulSet"],
"excludedNamespaces": ["kube-system", "gatekeeper-system"]
}
}
}
}
}
Gatekeeper Constraint Template (Referenced by Policy):
# policies/require-resource-limits.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
name: k8srequiredresourcelimits
spec:
crd:
spec:
names:
kind: K8sRequiredResourceLimits
validation:
openAPIV3Schema:
type: object
properties:
minCpu:
type: string
minMemory:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredresourcelimits
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
not container.resources.limits.cpu
msg := sprintf("Container '%v' must have CPU limit", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
not container.resources.limits.memory
msg := sprintf("Container '%v' must have memory limit", [container.name])
}
Create Custom Policy:
# Create custom policy definition
az policy definition create \
--name "atp-require-resource-limits" \
--display-name "ATP: Require resource limits on containers" \
--description "Ensures all containers have CPU and memory limits set" \
--rules policies/require-resource-limits.json \
--params policies/require-resource-limits.parameters.json \
--mode Microsoft.Kubernetes.Data
Policy Assignment and Enforcement¶
Policy Assignment with Pulumi:
// Assign Azure Policy to AKS cluster
new Authorization.PolicyAssignment("atp-require-resource-limits", new()
{
Name = "atp-require-resource-limits",
DisplayName = "ATP: Require resource limits",
PolicyDefinitionId = "/providers/Microsoft.Authorization/policyDefinitions/atp-require-resource-limits",
Scope = aksCluster.Id,
Parameters = new()
{
{ "minCpu", new() { Value = "100m" } },
{ "minMemory", new() { Value = "128Mi" } },
{ "effect", new() { Value = "deny" } },
},
EnforcementMode = "Default", // Enforced
Identity = new Authorization.Inputs.IdentityArgs
{
Type = Authorization.ResourceIdentityType.SystemAssigned,
},
});
Policy Enforcement Modes:
| Mode | Behavior | Use Case |
|---|---|---|
| Enforced | Blocks non-compliant resources | Production environments |
| DoNotEnforce | Audits only, doesn't block | Testing policy effectiveness |
| Disabled | Policy disabled | Temporary disable |
Policy Compliance Check:
# Check policy compliance
az policy state list \
--resource "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
--policy-assignment "atp-require-resource-limits" \
--query "[].{resource:resourceId, complianceState:complianceState}" \
--output table
Pod Security Standards (PSS)¶
Privileged, Baseline, Restricted Profiles¶
Pod Security Levels:
| Level | Restrictions | ATP Use Case |
|---|---|---|
| Privileged | No restrictions | System pods only (CNI, CSI drivers) |
| Baseline | Minimal restrictions | Legacy applications |
| Restricted | Maximum restrictions | ✅ ATP production workloads |
Restricted Profile Requirements:
- ✅ Run as non-root user
- ✅ Read-only root filesystem
- ✅ Drop all capabilities
- ✅ Disallow privilege escalation
- ✅ Seccomp profile enforced
- ✅ AppArmor/SELinux enforced
Pod Security Admission Configuration¶
Enable Pod Security Admission:
# infrastructure/namespace-pod-security.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Admission Configuration:
# cluster-config/admission-configuration.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1beta1
kind: PodSecurityConfiguration
defaults:
enforce: "restricted"
audit: "restricted"
warn: "restricted"
exemptions:
usernames: []
runtimeClasses: []
namespaces:
- kube-system
- gatekeeper-system
- external-secrets-system
Security Context Requirements¶
Deployment with Restricted Security Context:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
securityContext: # Pod-level security context
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
supplementalGroups: []
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
securityContext: # Container-level security context
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
add: [] # No additional capabilities
seccompProfile:
type: RuntimeDefault
volumeMounts:
- name: tmp
mountPath: /tmp
- name: var-run
mountPath: /var/run
volumes:
- name: tmp
emptyDir: {}
- name: var-run
emptyDir: {}
Gradual Enforcement Strategy¶
Enforcement Strategy:
| Environment | Enforce Level | Audit Level | Warn Level | Timeline |
|---|---|---|---|---|
| Dev | Baseline | Restricted | Restricted | Immediate |
| Test | Baseline | Restricted | Restricted | Month 1 |
| Staging | Restricted | Restricted | Restricted | Month 2 |
| Production | Restricted | Restricted | Restricted | Month 3 |
Migration Plan:
Phase 1 (Month 1): Dev and Test
- Set enforce: baseline
- Set audit: restricted
- Identify violations
- Fix applications
Phase 2 (Month 2): Staging
- Set enforce: restricted
- Fix remaining violations
- Validate applications
Phase 3 (Month 3): Production
- Set enforce: restricted
- Full enforcement
Network Policies¶
Default Deny All Traffic¶
Default Deny Network Policy:
# platform/network-policies/default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: atp-production
spec:
podSelector: {} # Match all pods
policyTypes:
- Ingress
- Egress
# No ingress or egress rules = deny all
Ingress Rules (Allow Specific Sources)¶
Allow Ingress from Gateway:
# apps/atp-ingestion/base/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: atp-ingestion-ingress
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-ingestion
policyTypes:
- Ingress
ingress:
# Allow ingress from gateway
- from:
- podSelector:
matchLabels:
app: atp-gateway
namespaceSelector:
matchLabels:
name: atp-production
ports:
- protocol: TCP
port: 8080
# Allow ingress from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
podSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
ports:
- protocol: TCP
port: 8080
Egress Rules (Allow Specific Destinations)¶
Allow Egress to Dependencies:
# apps/atp-ingestion/base/network-policy-egress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: atp-ingestion-egress
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-ingestion
policyTypes:
- Egress
egress:
# Allow egress to SQL Database
- to:
- namespaceSelector:
matchLabels:
name: external-services
ports:
- protocol: TCP
port: 1433 # SQL Server
# Allow egress to Redis
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
# Allow DNS resolution
- to:
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow egress to Azure services (via Private Link)
- to:
- namespaceSelector: {}
podSelector: {}
ports:
- protocol: TCP
port: 443
# Allow egress to monitoring
- to:
- namespaceSelector:
matchLabels:
name: monitoring
podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090
DNS and Monitoring Exceptions¶
DNS Exception:
# platform/network-policies/allow-dns.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: atp-production
spec:
podSelector: {} # Apply to all pods
policyTypes:
- Egress
egress:
# Allow DNS queries to CoreDNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Monitoring Exception:
# platform/network-policies/allow-monitoring.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-monitoring
namespace: atp-production
spec:
podSelector: {} # Apply to all pods
policyTypes:
- Egress
egress:
# Allow metrics scraping
- to:
- namespaceSelector:
matchLabels:
name: monitoring
podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090
# Allow log forwarding
- to:
- namespaceSelector:
matchLabels:
name: logging
podSelector:
matchLabels:
app: fluent-bit
ports:
- protocol: TCP
port: 24224
Network Policy per Service¶
Service-Specific Network Policies:
# apps/atp-query/base/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: atp-query-network-policy
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-query
policyTypes:
- Ingress
- Egress
ingress:
# Allow from gateway
- from:
- podSelector:
matchLabels:
app: atp-gateway
ports:
- protocol: TCP
port: 8080
# Allow from ingestion service
- from:
- podSelector:
matchLabels:
app: atp-ingestion
ports:
- protocol: TCP
port: 8080
egress:
# Allow to Redis
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
# Allow to SQL
- to:
- namespaceSelector:
matchLabels:
name: external-services
ports:
- protocol: TCP
port: 1433
OPA Gatekeeper Alternative¶
Open Policy Agent Overview¶
OPA Gatekeeper Architecture:
graph LR
A[Policy Templates<br/>in Git] -->|Deploy| B[Gatekeeper<br/>Controller]
B -->|Creates| C[Constraint CRDs]
C -->|Enforces| D[Kubernetes<br/>Admission Webhook]
D -->|Validates| E[Resource Requests]
E -->|Allows| F[Compliant Resources]
E -.->|Violates| G[Rejected Resources]
style A fill:#90EE90
style B fill:#FFE5B4
style C fill:#FFE5B4
style D fill:#FFE5B4
style E fill:#ffcccc
style F fill:#90EE90
style G fill:#ff9999
Install Gatekeeper:
# Install Gatekeeper
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.14/deploy/gatekeeper.yaml
# Verify installation
kubectl get pods -n gatekeeper-system
Gatekeeper Constraints and Templates¶
ConstraintTemplate: Require Resource Limits:
# policies/gatekeeper/require-resource-limits-template.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
name: k8srequiredresourcelimits
spec:
crd:
spec:
names:
kind: K8sRequiredResourceLimits
validation:
openAPIV3Schema:
type: object
properties:
cpu:
type: string
default: "100m"
memory:
type: string
default: "128Mi"
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredresourcelimits
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
not container.resources
msg := sprintf("Container '%v' must have resources defined", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
not container.resources.limits
msg := sprintf("Container '%v' must have resource limits", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
not container.resources.limits.cpu
msg := sprintf("Container '%v' must have CPU limit", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
not container.resources.limits.memory
msg := sprintf("Container '%v' must have memory limit", [container.name])
}
Constraint: Enforce Resource Limits:
# policies/gatekeeper/require-resource-limits-constraint.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResourceLimits
metadata:
name: require-resource-limits-production
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet", "DaemonSet"]
excludedNamespaces:
- kube-system
- gatekeeper-system
- external-secrets-system
- ingress-nginx
parameters:
cpu: "100m"
memory: "128Mi"
Custom Policy Authoring with Rego¶
Rego Policy: Require Non-Root User:
# policies/gatekeeper/require-non-root.rego
package requirenonroot
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
not container.securityContext
msg := sprintf("Container '%v' must have securityContext defined", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
container.securityContext
not container.securityContext.runAsNonRoot
msg := sprintf("Container '%v' must run as non-root user", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
container.securityContext
container.securityContext.runAsNonRoot == false
msg := sprintf("Container '%v' must run as non-root user (currently runAsNonRoot=false)", [container.name])
}
ConstraintTemplate:
# policies/gatekeeper/require-non-root-template.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
name: k8srequirednonroot
spec:
crd:
spec:
names:
kind: K8sRequiredNonRoot
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package requirenonroot
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
not container.securityContext
msg := sprintf("Container '%v' must have securityContext defined", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
container.securityContext
not container.securityContext.runAsNonRoot
msg := sprintf("Container '%v' must run as non-root user", [container.name])
}
Constraint:
# policies/gatekeeper/require-non-root-constraint.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredNonRoot
metadata:
name: require-non-root-production
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet"]
namespaces:
- atp-production
excludedNamespaces:
- kube-system
Integration with CI/CD¶
PR Validation with Gatekeeper:
# .azuredevops/pipelines/pr-validation-gatekeeper.yml
stages:
- stage: ValidateGatekeeper
displayName: 'Validate with Gatekeeper'
jobs:
- job: GatekeeperValidation
steps:
- script: |
# Install Gatekeeper CLI
wget -q https://github.com/open-policy-agent/gatekeeper/releases/latest/download/gatekeeper-linux-amd64.tar.gz
tar xf gatekeeper-linux-amd64.tar.gz
sudo mv gatekeeper /usr/local/bin/
displayName: 'Install Gatekeeper CLI'
- script: |
# Validate manifests against Gatekeeper constraints
for manifest in apps/*/base/*.yaml; do
echo "Validating $manifest"
# Use OPA CLI to test against policies
opa test policies/gatekeeper/ --bundle policies/gatekeeper/ || exit 1
done
displayName: 'Validate manifests against Gatekeeper policies'
Image Signing and Verification¶
Image Signing with Notary/Cosign¶
Cosign Image Signing:
# Install Cosign
wget -O cosign https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64
chmod +x cosign
sudo mv cosign /usr/local/bin/
# Generate signing key pair
cosign generate-key-pair
# Sign container image
cosign sign --key cosign.key \
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
# Verify signature
cosign verify --key cosign.pub \
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
Azure Pipeline: Image Signing:
# .azuredevops/pipelines/image-signing.yml
- stage: SignImage
displayName: 'Sign Container Image'
jobs:
- job: SignWithCosign
steps:
- script: |
# Install Cosign
wget -O cosign https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64
chmod +x cosign
sudo mv cosign /usr/local/bin/
displayName: 'Install Cosign'
- task: AzureKeyVault@2
inputs:
azureSubscription: 'ATP-KeyVault-Connection'
KeyVaultName: 'atp-shared-kv'
SecretsFilter: 'cosign-private-key'
displayName: 'Retrieve Cosign private key'
- script: |
# Sign image
cosign sign --key $(cosign-private-key) \
--yes \
$(ImageRepository):$(ImageTag)
echo "✅ Image signed: $(ImageRepository):$(ImageTag)"
displayName: 'Sign container image'
env:
COSIGN_PASSWORD: $(cosign-key-password)
Signature Storage in ACR¶
ACR Image Signing:
# Enable ACR content trust
az acr config retention update \
--registry connectsoft \
--days 30 \
--status Enabled
# Sign image with ACR
az acr repository show-manifests \
--name connectsoft \
--repository atp/ingestion \
--detail
Cosign with ACR:
# Sign and store signature in ACR
cosign sign --key cosign.key \
--registry connectsoft.azurecr.io \
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
Admission Controller for Verification¶
Image Policy Webhook:
# platform/image-policy-webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: image-policy-webhook
webhooks:
- name: image-policy.atp.connectsoft.com
clientConfig:
service:
name: image-policy-webhook
namespace: image-policy-system
path: "/validate"
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: ["apps"]
apiVersions: ["v1"]
resources: ["deployments", "statefulsets", "daemonsets"]
admissionReviewVersions: ["v1", "v1beta1"]
sideEffects: None
failurePolicy: Fail
Image Verification with Cosign Admission Controller:
# Install Cosign admission controller
kubectl apply -f https://raw.githubusercontent.com/sigstore/policy-controller/main/config/release/policy-controller.yaml
# Create image policy
Image Policy:
# platform/image-policy.yaml
apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
name: atp-image-policy
spec:
images:
- glob: "connectsoft.azurecr.io/atp/**"
authorities:
- key:
data: |
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0CAQYIKoZIzj0CAQYIKoZIzj0CAQYIKoZI...
-----END PUBLIC KEY-----
- keyless:
identities:
- issuer: "https://token.actions.githubusercontent.com"
subject: "https://github.com/ConnectSoft/ATP/.github/workflows/*"
mode: enforce # enforce or warn
Rejecting Unsigned Images¶
Policy Enforcement:
# With mode: enforce
# Unsigned images will be rejected at admission time
# Error: Image signature verification failed
# With mode: warn
# Unsigned images will be allowed but warnings logged
SBOM Generation and Storage¶
Generating SBOM During CI Build¶
SBOM Generation in Pipeline:
# .azuredevops/pipelines/sbom-generation.yml
- stage: GenerateSBOM
displayName: 'Generate SBOM'
jobs:
- job: GenerateSBOM
steps:
- script: |
# Install Syft
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
syft version
displayName: 'Install Syft'
- script: |
# Generate SBOM in SPDX format
syft packages docker:$(ImageRepository):$(ImageTag) \
--output spdx-json \
--file sbom-$(ImageTag).spdx.json
# Generate SBOM in CycloneDX format
syft packages docker:$(ImageRepository):$(ImageTag) \
--output cyclonedx-json \
--file sbom-$(ImageTag).cyclonedx.json
echo "✅ SBOM generated for $(ImageRepository):$(ImageTag)"
displayName: 'Generate SBOM'
- script: |
# Attach SBOM to ACR image as OCI artifact
oras attach \
--artifact-type application/vnd.cyclonedx+json \
connectsoft.azurecr.io/atp/ingestion:$(ImageTag) \
sbom-$(ImageTag).cyclonedx.json
displayName: 'Attach SBOM to image'
SBOM Formats (CycloneDX, SPDX)¶
SPDX Format Example:
{
"SPDXID": "SPDXRef-DOCUMENT",
"spdxVersion": "SPDX-2.3",
"name": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
"dataLicense": "CC0-1.0",
"documentNamespace": "https://connectsoft.example/sbom/atp-ingestion/v1.2.3-abc123d",
"packages": [
{
"SPDXID": "SPDXRef-Package-dotnet-runtime",
"name": "dotnet-runtime",
"versionInfo": "8.0.0",
"downloadLocation": "NOASSERTION",
"filesAnalyzed": false,
"licenseConcluded": "NOASSERTION"
},
{
"SPDXID": "SPDXRef-Package-aspnetcore",
"name": "aspnetcore",
"versionInfo": "8.0.0",
"downloadLocation": "NOASSERTION",
"filesAnalyzed": false,
"licenseConcluded": "NOASSERTION"
}
]
}
CycloneDX Format Example:
{
"bomFormat": "CycloneDX",
"specVersion": "1.5",
"version": 1,
"metadata": {
"timestamp": "2024-01-15T10:00:00Z",
"tools": [
{
"vendor": "Anchore",
"name": "syft",
"version": "1.0.0"
}
],
"component": {
"type": "container",
"name": "atp-ingestion",
"version": "v1.2.3-abc123d"
}
},
"components": [
{
"type": "library",
"name": "dotnet-runtime",
"version": "8.0.0"
},
{
"type": "library",
"name": "aspnetcore",
"version": "8.0.0"
}
]
}
Storing SBOM in ACR Artifacts¶
Attach SBOM to ACR Image:
# Install ORAS CLI
wget -q https://github.com/oras-project/oras/releases/latest/download/oras_linux_amd64.tar.gz
tar xf oras_linux_amd64.tar.gz
sudo mv oras /usr/local/bin/
# Login to ACR
az acr login --name connectsoft
# Attach SBOM as OCI artifact
oras attach \
--artifact-type application/vnd.cyclonedx+json \
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
sbom-v1.2.3-abc123d.cyclonedx.json
# Attach SPDX SBOM
oras attach \
--artifact-type application/spdx+json \
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
sbom-v1.2.3-abc123d.spdx.json
Query SBOM from ACR:
# List attached artifacts (including SBOM)
az acr repository show \
--name connectsoft \
--image atp/ingestion:v1.2.3-abc123d \
--query "manifests"
# Download SBOM
oras pull \
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
--artifact-type application/vnd.cyclonedx+json \
-o sbom-downloaded.json
Vulnerability Scanning of SBOM¶
Scan SBOM for Vulnerabilities:
# Scan SBOM with Grype
grype sbom:sbom-$(ImageTag).cyclonedx.json \
--output json \
--file vulnerability-report-$(ImageTag).json
# Or scan with Trivy
trivy sbom sbom-$(ImageTag).cyclonedx.json \
--format json \
--output trivy-sbom-report-$(ImageTag).json
Vulnerability Scanning¶
Azure Defender for Containers¶
Enable Azure Defender:
# Enable Defender for Containers
az security pricing create \
--name "Containers" \
--tier "Standard"
Defender for Containers Configuration:
// Enable Defender for Containers via Pulumi
new Security.Pricing("defender-containers", new()
{
PricingTier = "Standard",
SubPlan = "DefenderForContainers",
});
Trivy Scanning in CI Pipeline¶
Trivy Vulnerability Scan:
# .azuredevops/pipelines/vulnerability-scanning.yml
- stage: VulnerabilityScan
displayName: 'Vulnerability Scanning'
jobs:
- job: TrivyScan
steps:
- script: |
# Install Trivy
wget -q https://github.com/aquasecurity/trivy/releases/latest/download/trivy_linux_amd64.tar.gz
tar xf trivy_linux_amd64.tar.gz
sudo mv trivy /usr/local/bin/
displayName: 'Install Trivy'
- script: |
# Scan container image
trivy image \
--format json \
--output trivy-$(ImageTag).json \
--severity HIGH,CRITICAL \
--exit-code 0 \
$(ImageRepository):$(ImageTag)
displayName: 'Scan image for vulnerabilities'
continueOnError: true
- script: |
# Generate HTML report
trivy image \
--format template \
--template "@contrib/html.tpl" \
--output trivy-$(ImageTag).html \
--severity HIGH,CRITICAL \
$(ImageRepository):$(ImageTag)
# Publish report
echo "##vso[task.addattachment type=Distributedtask.Core.Summary;name=Vulnerability Report;]$PWD/trivy-$(ImageTag).html"
displayName: 'Generate vulnerability report'
- script: |
# Fail build if critical vulnerabilities found
CRITICAL_COUNT=$(jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="CRITICAL")] | length' trivy-$(ImageTag).json)
if [ "$CRITICAL_COUNT" -gt 0 ]; then
echo "❌ Critical vulnerabilities found: $CRITICAL_COUNT"
exit 1
fi
echo "✅ No critical vulnerabilities found"
displayName: 'Check for critical vulnerabilities'
Runtime Vulnerability Detection¶
Trivy Operator for Runtime Scanning:
# Install Trivy Operator
helm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update
helm install trivy-operator aqua/trivy-operator \
--namespace trivy-system \
--create-namespace \
--version 0.18.0
VulnerabilityReport CRD:
# Trivy Operator automatically creates VulnerabilityReport resources
apiVersion: aquasecurity.github.io/v1alpha1
kind: VulnerabilityReport
metadata:
name: atp-ingestion-abc123
namespace: atp-production
spec:
artifact:
repository: connectsoft.azurecr.io/atp/ingestion
tag: v1.2.3-abc123d
summary:
criticalCount: 0
highCount: 2
mediumCount: 5
lowCount: 10
Query Vulnerability Reports:
# List vulnerability reports
kubectl get vulnerabilityreports -n atp-production
# View detailed report
kubectl get vulnerabilityreport atp-ingestion-abc123 -n atp-production -o yaml
Remediation Workflows¶
Vulnerability Remediation Process:
graph LR
A[Vulnerability<br/>Detected] -->|Alert| B[Security Team]
B -->|Assess| C{Critical?}
C -->|Yes| D[Immediate Patch]
C -->|No| E[Schedule Patch]
D -->|Rebuild Image| F[Rescan]
E -->|Rebuild Image| F
F -->|Verify| G[Deploy]
style A fill:#ffcccc
style D fill:#ff9999
style F fill:#90EE90
style G fill:#90EE90
Remediation Script:
#!/bin/bash
# scripts/remediate-vulnerability.sh
IMAGE="${1:-connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d}"
VULN_ID="${2:-CVE-2024-1234}"
echo "🔧 Remediating vulnerability: $VULN_ID in $IMAGE"
# 1. Update dependencies
# 2. Rebuild image
# 3. Rescan
trivy image --severity HIGH,CRITICAL "$IMAGE" | grep -q "$VULN_ID" && \
echo "❌ Vulnerability still present" || \
echo "✅ Vulnerability remediated"
RBAC Policies in Kubernetes¶
ServiceAccounts per ATP Service¶
ServiceAccount Definition:
# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: atp-ingestion
namespace: atp-production
labels:
app: atp-ingestion
managed-by: fluxcd
Roles and RoleBindings¶
Role: Service-Specific Permissions:
# apps/atp-ingestion/base/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: atp-ingestion-role
namespace: atp-production
rules:
# Allow read ConfigMaps in same namespace
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]
# Allow read Secrets in same namespace
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list"]
resourceNames:
- sql-connection-string
- redis-connection-string
RoleBinding:
# apps/atp-ingestion/base/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: atp-ingestion-rolebinding
namespace: atp-production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: atp-ingestion-role
subjects:
- kind: ServiceAccount
name: atp-ingestion
namespace: atp-production
ClusterRoles and ClusterRoleBindings¶
ClusterRole: Cross-Namespace Permissions:
# platform/rbac/clusterrole-monitoring.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: atp-monitoring-reader
rules:
# Allow read pods for metrics
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
# Allow read nodes for metrics
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
resourceNames: []
ClusterRoleBinding:
# platform/rbac/clusterrolebinding-monitoring.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: atp-monitoring-reader-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: atp-monitoring-reader
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Least Privilege Principle¶
Least Privilege RBAC Matrix:
| Service | ServiceAccount | Role | Permissions |
|---|---|---|---|
| atp-ingestion | atp-ingestion |
Role (namespace-scoped) | Read ConfigMap, Read specific Secrets |
| atp-query | atp-query |
Role (namespace-scoped) | Read ConfigMap, Read specific Secrets |
| prometheus | prometheus |
ClusterRole | Read Pods, Nodes (cluster-wide) |
| fluent-bit | fluent-bit |
ClusterRole | Read Pods, Namespaces (cluster-wide) |
RBAC Audit Script:
#!/bin/bash
# scripts/audit-rbac.sh
echo "🔍 Auditing RBAC permissions..."
# List all ServiceAccounts with excessive permissions
kubectl get clusterrolebindings -o json | \
jq -r '.items[] | select(.subjects[].kind=="ServiceAccount") | .metadata.name'
# Check for wildcard permissions
kubectl get clusterroles -o json | \
jq -r '.items[] | select(.rules[]?.verbs[]?=="*") | .metadata.name'
Audit Logging¶
Kubernetes Audit Logs¶
Enable Audit Logging:
# cluster-config/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Log all requests in these stages
- level: Metadata
namespaces: ["atp-production"]
verbs: ["create", "update", "patch", "delete"]
- level: RequestResponse
namespaces: ["atp-production"]
resources:
- group: ""
resources: ["secrets", "configmaps"]
verbs: ["create", "update", "patch", "delete"]
- level: None
users: ["system:kube-proxy"]
verbs: ["watch"]
resources:
- group: ""
resources: ["endpoints", "services", "services/status"]
Configure Audit Logging on AKS:
# Enable audit logging via AKS cluster configuration
az aks update \
--resource-group atp-production-rg \
--name atp-prod-eus-aks \
--enable-managed-identity \
--enable-azure-rbac
Forwarding to Azure Monitor¶
Audit Log Forwarding:
# platform/audit-log-forwarder.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: audit-log-forwarder-config
namespace: kube-system
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/audit/kube-apiserver-audit.log
Parser json
Tag kube-audit.*
Refresh_Interval 5
[OUTPUT]
Name azure
Match kube-audit.*
Workspace_ID $(LOG_ANALYTICS_WORKSPACE_ID)
Shared_Key $(LOG_ANALYTICS_SHARED_KEY)
Audit Policy Configuration¶
Comprehensive Audit Policy:
# cluster-config/audit-policy-comprehensive.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
omitStages:
- "RequestReceived"
rules:
# Log all requests to production namespace
- level: RequestResponse
namespaces: ["atp-production"]
- level: Metadata
namespaces: ["atp-staging"]
# Log secret access
- level: RequestResponse
resources:
- group: ""
resources: ["secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# Log RBAC changes
- level: RequestResponse
resources:
- group: "rbac.authorization.k8s.io"
resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
verbs: ["create", "update", "patch", "delete"]
Log Retention and Analysis¶
Audit Log Retention:
// Query Kubernetes audit logs
KubePodInventory
| where Namespace == "atp-production"
| join kind=inner (
AzureDiagnostics
| where Category == "kube-audit"
| where OperationName in ("create", "update", "delete")
) on $left.Name == $right.ObjectName
| project TimeGenerated, OperationName, User, Resource, Namespace
| order by TimeGenerated desc
Audit Log Analysis:
// Secret access audit trail
AzureDiagnostics
| where Category == "kube-audit"
| where ObjectRef.Resource == "secrets"
| extend User = tostring(parse_json(requestObject).user.username)
| extend Action = OperationName
| project TimeGenerated, User, Action, ObjectRef.Name, Namespace
| order by TimeGenerated desc
Policy Enforcement via GitOps¶
Policy as Code in Git¶
Policy Organization in Git:
atp-gitops/
├── policies/
│ ├── azure-policy/
│ │ ├── require-resource-limits.json
│ │ └── require-non-root.json
│ ├── gatekeeper/
│ │ ├── constraint-templates/
│ │ │ ├── require-resource-limits-template.yaml
│ │ │ └── require-non-root-template.yaml
│ │ └── constraints/
│ │ ├── require-resource-limits-constraint.yaml
│ │ └── require-non-root-constraint.yaml
│ └── network-policies/
│ ├── default-deny-all.yaml
│ └── service-policies/
│ ├── atp-ingestion-network-policy.yaml
│ └── atp-query-network-policy.yaml
Automated Policy Application¶
FluxCD Kustomization for Policies:
# infrastructure/policies/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
# Azure Policy assignments
- azure-policy/assignments.yaml
# Gatekeeper templates
- gatekeeper/constraint-templates/
# Gatekeeper constraints
- gatekeeper/constraints/
# Network policies
- network-policies/default-deny-all.yaml
- network-policies/service-policies/
FluxCD GitRepository for Policies:
# clusters/production/policies-gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-policies
namespace: flux-system
spec:
interval: 5m
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: production
secretRef:
name: gitops-credentials
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: policies
namespace: flux-system
spec:
interval: 10m
sourceRef:
kind: GitRepository
name: atp-policies
path: ./policies/
prune: true
validation: client
Policy Violation Detection¶
Monitor Policy Violations:
# Check Azure Policy violations
az policy state list \
--resource "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
--filter "complianceState eq 'NonCompliant'" \
--query "[].{resource:resourceId, policy:policyAssignmentName, reason:complianceReason}" \
--output table
# Check Gatekeeper constraint violations
kubectl get constraints -A
kubectl describe k8srequiredresourcelimits require-resource-limits-production -n atp-production
Policy Violation Alert:
# alerts/policy-violation.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: policy-violations
namespace: monitoring
spec:
groups:
- name: policy-violations
rules:
- alert: AzurePolicyViolation
expr: |
azure_policy_compliance_state{state="NonCompliant"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Azure Policy violation detected"
description: "{{ $value }} non-compliant resources detected"
Remediation Through PR¶
Policy Violation Remediation Workflow:
graph LR
A[Policy Violation<br/>Detected] -->|Alert| B[Developer]
B -->|Create PR| C[Fix Manifest]
C -->|Merge| D[GitOps Sync]
D -->|Apply| E[Compliant Resource]
style A fill:#ffcccc
style C fill:#90EE90
style E fill:#90EE90
Remediation PR Process:
- Developer receives policy violation alert
- Create PR to fix manifest
- PR validation ensures compliance
- Merge PR triggers FluxCD sync
- Policy violation resolved
Compliance Evidence Generation¶
Deployment Receipts with Approvals¶
Deployment Receipt Generation:
#!/bin/bash
# scripts/generate-deployment-receipt.sh
DEPLOYMENT_NAME="${1:-atp-ingestion}"
NAMESPACE="${2:-atp-production}"
cat > deployment-receipt-$(date +%Y%m%d-%H%M%S).json <<EOF
{
"deploymentId": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.uid}')",
"deployedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"deployedBy": "FluxCD",
"gitCommit": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.labels.app\.kubernetes\.io/version}')",
"approvals": [
{
"approver": "architect-team@connectsoft.example",
"approvedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"approvalType": "CAB"
}
],
"policyCompliance": {
"azurePolicy": "Compliant",
"gatekeeper": "Compliant",
"podSecurity": "Compliant"
}
}
EOF
Security Scan Reports¶
Security Scan Evidence:
# Generate security scan evidence
cat > security-scan-evidence-$(date +%Y%m%d).json <<EOF
{
"scanDate": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"image": "$(ImageRepository):$(ImageTag)",
"scanner": "Trivy",
"vulnerabilities": {
"critical": 0,
"high": 2,
"medium": 5,
"low": 10
},
"compliance": "Pass",
"scanReport": "trivy-$(ImageTag).json"
}
EOF
SBOM Artifacts¶
SBOM Evidence:
{
"sbomGeneratedAt": "2024-01-15T10:00:00Z",
"image": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
"sbomFormat": "CycloneDX",
"sbomVersion": "1.5",
"sbomLocation": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d (OCI artifact)",
"components": 245,
"verification": {
"signed": true,
"signatureVerified": true
}
}
Policy Compliance Reports¶
Compliance Report Generation:
// Generate compliance report
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(30d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend AccessCount = count()
| summarize
TotalAccess = count(),
UniqueIdentities = dcount(parse_json(properties_s).identity_claim_appid_g)
by SecretName
| project SecretName, TotalAccess, UniqueIdentities, ComplianceStatus = "Compliant"
SOC 2 / GDPR / HIPAA Controls¶
Mapping GitOps Workflows to Controls¶
SOC 2 Control Mapping:
| SOC 2 Control | GitOps Implementation | Evidence |
|---|---|---|
| CC6.1 Logical Access Controls | RBAC in Kubernetes, Azure AD integration | RBAC audit logs, access reviews |
| CC6.2 Authentication | Workload Identity, Azure AD | Authentication logs |
| CC6.7 Audit Logging | Kubernetes audit logs, Key Vault logs | Log Analytics queries |
| CC7.2 Change Management | GitOps PR workflow, approvals | PR history, deployment receipts |
| CC8.1 Encryption | Secrets in Key Vault, TLS in transit | Key Vault encryption logs |
GDPR Control Mapping:
| GDPR Article | GitOps Implementation | Evidence |
|---|---|---|
| Art. 32 Security of Processing | Pod Security Standards, Network Policies | Security policy compliance reports |
| Art. 33 Breach Notification | Audit logging, alerting | Security incident logs |
| Art. 35 Data Protection Impact Assessment | SBOM, vulnerability scanning | SBOM artifacts, scan reports |
Evidence Collection Automation¶
Automated Evidence Collection:
# .azuredevops/pipelines/compliance-evidence.yml
- stage: CollectEvidence
displayName: 'Collect Compliance Evidence'
jobs:
- job: GenerateEvidence
steps:
- script: |
# Generate deployment receipt
./scripts/generate-deployment-receipt.sh atp-ingestion atp-production
# Generate security scan evidence
./scripts/generate-security-scan-evidence.sh
# Generate SBOM evidence
./scripts/generate-sbom-evidence.sh
# Generate policy compliance report
./scripts/generate-policy-compliance-report.sh
# Archive all evidence
tar -czf compliance-evidence-$(Build.BuildNumber).tar.gz \
deployment-receipt-*.json \
security-scan-evidence-*.json \
sbom-evidence-*.json \
policy-compliance-report-*.json
displayName: 'Collect compliance evidence'
- task: PublishPipelineArtifact@1
inputs:
targetPath: 'compliance-evidence-*.tar.gz'
artifactName: 'compliance-evidence'
Audit Trail for Compliance¶
Audit Trail Generation:
// Complete audit trail for compliance
let DeploymentEvents = KubePodInventory
| where Namespace == "atp-production"
| extend DeploymentTime = TimeGenerated;
let PolicyCompliance = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.AUTHORIZATION"
| where Category == "PolicyState"
| extend ComplianceTime = TimeGenerated;
let SecretAccess = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| extend AccessTime = TimeGenerated;
union DeploymentEvents, PolicyCompliance, SecretAccess
| project TimeGenerated, EventType, Resource, User, Action
| order by TimeGenerated desc
Regular Access Reviews¶
Access Review Automation:
#!/bin/bash
# scripts/access-review.sh
echo "📋 Generating Access Review Report..."
# Review Kubernetes RBAC
echo "## Kubernetes RBAC Review" > access-review-$(date +%Y%m%d).md
kubectl get clusterrolebindings -o json | \
jq -r '.items[] | select(.subjects[].kind=="ServiceAccount") | "\(.metadata.name): \(.subjects[].name)"' \
>> access-review-$(date +%Y%m%d).md
# Review Azure Key Vault access
echo "## Key Vault Access Review" >> access-review-$(date +%Y%m%d).md
az role assignment list \
--scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.KeyVault/vaults/atp-prod-kv" \
--query "[].{principal:principalName, role:roleDefinitionName}" \
--output table >> access-review-$(date +%Y%m%d).md
Summary: Security Policies & Compliance¶
- Azure Policy for Kubernetes: Policy definitions, assignments, and enforcement for AKS clusters
- Pod Security Standards: Restricted profile enforcement, Pod Security Admission configuration, security context requirements
- Network Policies: Default deny, ingress/egress rules, DNS and monitoring exceptions, service-specific policies
- OPA Gatekeeper: Constraint templates, custom Rego policies, CI/CD integration
- Image Signing: Cosign signing, signature storage in ACR, admission controller verification
- SBOM Generation: CycloneDX and SPDX formats, storage in ACR artifacts, vulnerability scanning
- Vulnerability Scanning: Azure Defender, Trivy in CI, runtime detection, remediation workflows
- RBAC Policies: ServiceAccounts, Roles/RoleBindings, ClusterRoles, least privilege principle
- Audit Logging: Kubernetes audit logs, Azure Monitor forwarding, log retention and analysis
- Policy Enforcement via GitOps: Policy as code, automated application, violation detection, remediation through PR
- Compliance Evidence: Deployment receipts, security scan reports, SBOM artifacts, policy compliance reports
- SOC 2/GDPR/HIPAA Controls: Control mapping, evidence collection automation, audit trails, access reviews
FluxCD Continuous Reconciliation¶
Purpose: Define how FluxCD continuously reconciles the desired state from Git with the live Kubernetes cluster state, including reconciliation loops, drift detection, self-healing mechanisms, health assessment, and observability to ensure ATP deployments remain aligned with Git-managed manifests.
FluxCD Reconciliation Loop¶
How Reconciliation Works¶
Reconciliation Flow:
sequenceDiagram
participant Git as Git Repository
participant Source as Source Controller
participant Kustomize as Kustomize Controller
participant K8s as Kubernetes Cluster
Git->>Source: Poll for changes (every 1m)
Source->>Source: Fetch latest commit
Source->>Source: Compare with last sync
alt Changes detected
Source->>Source: Update GitRepository status
Source->>Kustomize: Trigger reconciliation
end
Kustomize->>Git: Fetch manifest artifacts
Kustomize->>K8s: Apply manifests (kubectl apply)
K8s->>Kustomize: Return apply result
Kustomize->>Kustomize: Update Kustomization status
alt Drift detected
Kustomize->>K8s: Re-apply to correct drift
end
Kustomize->>Source: Report reconciliation result
Reconciliation Components:
| Component | Responsibility | Reconciliation Trigger |
|---|---|---|
| Source Controller | Monitors Git repository | Polls Git every interval (default: 1m) |
| Kustomize Controller | Applies Kustomization | Triggered by Source Controller when changes detected |
| Helm Controller | Applies Helm releases | Triggered by Source Controller when HelmRepository changes |
| Image Automation Controller | Updates image tags | Triggered by ImagePolicy changes |
Polling Interval Configuration¶
GitRepository Polling Interval:
# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops
namespace: flux-system
spec:
interval: 1m # Poll Git every 1 minute
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: production
timeout: 20s
ignore: |
# Exclude paths from reconciliation
exclude: |
^docs/
^\.git/
Kustomization Reconciliation Interval:
# apps/atp-ingestion/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
namespace: flux-system
spec:
interval: 5m # Reconcile every 5 minutes
path: ./apps/atp-ingestion
prune: true
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: infrastructure
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
Environment-Specific Intervals:
| Environment | GitRepository Interval | Kustomization Interval | Rationale |
|---|---|---|---|
| Dev | 30s | 1m | Faster feedback loop |
| Test | 1m | 2m | Balance between speed and load |
| Staging | 2m | 5m | Reduced cluster load |
| Production | 5m | 10m | Stability over speed |
Reconciliation Triggers¶
Automatic Triggers:
- Git Commit: New commit pushed to monitored branch
- Polling Interval: Periodic check (even if no changes)
- Webhook: Immediate trigger via webhook (bypasses polling)
Webhook Configuration:
# clusters/production/receiver.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Receiver
metadata:
name: gitops-receiver
namespace: flux-system
spec:
type: git
events:
- "push"
resources:
- kind: GitRepository
name: atp-gitops
secretRef:
name: gitops-webhook-token
---
# Azure DevOps webhook trigger
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: atp-gitops
spec:
interval: 1m
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: production
# Webhook URL: https://<cluster-ip>/hook/<token>
Manual Trigger:
# Force immediate reconciliation
flux reconcile source git atp-gitops --with-source
# Force Kustomization reconciliation
flux reconcile kustomization atp-ingestion --with-source
# Trigger all reconciliations
flux reconcile source git --all
Retry Strategies and Backoff¶
Retry Configuration:
# Kustomization with retry settings
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
spec:
interval: 5m
retryInterval: 2m # Retry failed reconciliation every 2 minutes
timeout: 10m # Timeout after 10 minutes
path: ./apps/atp-ingestion
sourceRef:
kind: GitRepository
name: atp-gitops
# Exponential backoff: 2m, 4m, 8m, 16m (max)
Retry Behavior:
| Failure Type | Retry Interval | Max Retries | Backoff |
|---|---|---|---|
| Git fetch error | 1m | 3 | Linear |
| Apply error (transient) | 2m | 5 | Exponential (2x) |
| Apply error (permanent) | 5m | 10 | Exponential (1.5x) |
| Health check failure | 1m | Until healthy | Linear |
Retry Status:
# Check reconciliation status and retries
kubectl get kustomization atp-ingestion -n flux-system -o jsonpath='{.status}'
# Output:
# {
# "conditions": [{
# "type": "Ready",
# "status": "False",
# "reason": "ReconciliationFailed",
# "message": "apply failed: error applying manifests",
# "lastTransitionTime": "2024-01-15T10:00:00Z"
# }],
# "lastAppliedRevision": "abc123...",
# "lastAttemptedRevision": "def456...",
# "observedGeneration": 1,
# "snapshot": {...}
# }
Automated Sync Policies¶
Auto-Sync for Dev and Test Environments¶
Dev Environment Auto-Sync:
# clusters/dev/kustomization-dev.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-dev
namespace: flux-system
spec:
interval: 1m # Fast sync interval
path: ./apps
prune: true # Auto-delete removed resources
wait: true # Wait for resources to be ready
timeout: 5m
sourceRef:
kind: GitRepository
name: atp-gitops
syncOptions:
- CreateNamespace=true
- PruneLast=true
- ReplaceOnCreate=true
Test Environment Auto-Sync:
# clusters/test/kustomization-test.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-test
namespace: flux-system
spec:
interval: 2m
path: ./apps
prune: true
wait: true
timeout: 10m
sourceRef:
kind: GitRepository
name: atp-gitops
syncOptions:
- CreateNamespace=true
- PruneLast=true
Manual Sync for Staging and Production¶
Staging Environment Manual Sync:
# clusters/staging/kustomization-staging.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-staging
namespace: flux-system
spec:
suspend: false # Reconciliation enabled
interval: 5m # Still polls for changes
path: ./apps
prune: false # Manual pruning only
wait: true
timeout: 15m
sourceRef:
kind: GitRepository
name: atp-gitops
# Manual sync via: flux reconcile kustomization apps-staging
Production Environment Manual Sync:
# clusters/production/kustomization-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
suspend: false
interval: 10m # Longer interval
path: ./apps
prune: false # Never auto-prune in production
wait: true
timeout: 20m
sourceRef:
kind: GitRepository
name: atp-gitops
# Requires explicit: flux reconcile kustomization apps-production
Manual Sync Workflow:
# 1. Review changes in Git
git log --oneline production
# 2. Trigger manual sync
flux reconcile kustomization apps-production --with-source
# 3. Monitor sync status
flux get kustomizations apps-production --watch
# 4. Verify deployment
kubectl rollout status deployment/atp-ingestion -n atp-production
Sync Options (Prune, Force, Wait)¶
Sync Options Reference:
| Option | Description | Use Case |
|---|---|---|
| Prune | Delete resources removed from Git | Cleanup unused resources |
| PruneLast | Prune after applying new resources | Preserve dependencies during apply |
| Force | Force apply even if no changes | Trigger reconciliation |
| Wait | Wait for resources to be ready | Ensure deployment success |
| CreateNamespace | Auto-create namespaces | Simplify namespace management |
| ReplaceOnCreate | Replace existing resources | Handle immutable fields |
Comprehensive Sync Options:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion-production
spec:
interval: 10m
path: ./apps/atp-ingestion
prune: false # Production: manual pruning
wait: true # Wait for Deployment to be ready
timeout: 20m
retryInterval: 5m
sourceRef:
kind: GitRepository
name: atp-gitops
syncOptions:
- CreateNamespace=true # Auto-create namespace
- PruneLast=true # Prune after applying (if prune enabled)
# - Force=true # Force apply (use with caution)
- ReplaceOnCreate=false # Don't replace on create (safer)
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
Per-Resource Sync Configuration¶
Service-Specific Sync Settings:
# apps/atp-ingestion/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
namespace: flux-system
spec:
interval: 5m
path: ./apps/atp-ingestion
prune: true
wait: true
timeout: 15m
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: infrastructure # Wait for infrastructure first
- name: secrets # Wait for secrets to be synced
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
# Per-resource annotations for sync control
postBuild:
substitute:
sync.fluxcd.io/prune: "disabled" # Per-resource pruning control
Drift Detection Mechanisms¶
Comparing Git State to Live Cluster¶
Drift Detection Flow:
graph LR
A[Git State<br/>Manifests] -->|Compare| B[Cluster State<br/>Live Resources]
B -->|Matches| C[No Action]
B -->|Differs| D[Drift Detected]
D -->|Self-Heal Enabled| E[Re-apply from Git]
D -->|Self-Heal Disabled| F[Alert Only]
E -->|Success| C
E -->|Failure| G[Retry/Alert]
style A fill:#90EE90
style B fill:#FFE5B4
style D fill:#ffcccc
style E fill:#90EE90
style F fill:#ff9999
Drift Detection Process:
- Fetch Git State: Source Controller fetches latest manifests
- Fetch Cluster State: Kustomize Controller queries Kubernetes API
- Compute Diff: Compare desired (Git) vs actual (Cluster) state
- Detect Drift: Identify differences
- Correct Drift: Re-apply manifests (if self-healing enabled)
Check Drift Status:
# Check for drift
flux get kustomizations atp-ingestion
# Output shows:
# NAME READY MESSAGE REVISION SUSPENDED
# atp-ingestion True Applied revision: abc123def abc123def False
# Detailed drift information
kubectl describe kustomization atp-ingestion -n flux-system
# Events show drift detection:
# Warning ReconciliationFailed drift detected: Deployment replicas changed from 3 to 5
Drift Types (Manual Changes, External Controllers)¶
Manual Changes:
# Example: Manual replica scaling
kubectl scale deployment atp-ingestion -n atp-production --replicas=5
# FluxCD detects drift:
# Warning ReconciliationFailed drift detected in Deployment/atp-ingestion:
# spec.replicas: expected 3, found 5
# With self-healing: FluxCD reverts to 3 replicas
# Without self-healing: Alert only
External Controller Changes:
# Example: HPA scales deployment
# HPA changes replicas to 4 based on CPU usage
# FluxCD behavior:
# - If replicas in Git: 3 (no replicas field)
# - HPA-managed replicas: 4
# - FluxCD: No drift (HPA takes precedence when replicas field absent)
# If Git specifies replicas: 3
# HPA-managed replicas: 4
# FluxCD: Detects drift, reverts to 3 (may conflict with HPA)
Resource Annotation for Drift Ignore:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
annotations:
fluxcd.io/ignore: "true" # Ignore drift for this resource
spec:
replicas: 3 # May be modified by HPA, FluxCD won't revert
Drift Detection Frequency¶
Drift Detection Intervals:
| Component | Detection Method | Frequency |
|---|---|---|
| GitRepository | Polls Git for changes | Every interval (default: 1m) |
| Kustomization | Compares Git state to cluster | Every interval (default: 10m) |
| Manual Trigger | Immediate comparison | On-demand via flux reconcile |
Optimized Drift Detection:
# Production: Less frequent drift checks
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
interval: 10m # Check for drift every 10 minutes
path: ./apps
sourceRef:
kind: GitRepository
name: atp-gitops
# Dev: More frequent drift checks
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-dev
spec:
interval: 1m # Check for drift every minute
path: ./apps
sourceRef:
kind: GitRepository
name: atp-gitops
Alerting on Drift¶
Drift Alert Configuration:
# alerts/drift-detection.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
name: drift-detection
namespace: flux-system
spec:
providerRef:
name: azure-monitor
eventSeverity: warning
eventSources:
- kind: Kustomization
name: apps-production
namespace: flux-system
exclusionList:
- ".* is ready"
- ".*applied revision.*"
Drift Alert with Notification:
# Notification provider for Azure Monitor
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
name: azure-monitor
namespace: flux-system
spec:
type: generic
address: https://api.loganalytics.io/v1/workspaces/{workspaceId}/events
secretRef:
name: azure-monitor-credentials
Query Drift Alerts:
// Query FluxCD drift alerts from Log Analytics
FluxCDEvents
| where EventType == "Warning"
| where Message contains "drift detected"
| project TimeGenerated, Kustomization, Message, Namespace
| order by TimeGenerated desc
Self-Healing Configuration¶
Automatic Revert of Manual Changes¶
Self-Healing Enabled (Default):
# Self-healing is enabled by default
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
spec:
interval: 5m
path: ./apps/atp-ingestion
prune: true
# Self-healing: automatically reverts manual changes
sourceRef:
kind: GitRepository
name: atp-gitops
Self-Healing Behavior:
# 1. Manual change
kubectl patch deployment atp-ingestion -n atp-production \
-p '{"spec":{"replicas":5}}'
# 2. FluxCD detects drift (within 5 minutes)
# 3. FluxCD reverts to Git state (replicas: 3)
# 4. Deployment restored to desired state
Disable Self-Healing for Specific Resource:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
annotations:
# Disable self-healing for this resource
reconcile.fluxcd.io/disabled: "true"
spec:
replicas: 3
Self-Heal Enable/Disable per Environment¶
Environment-Specific Self-Healing:
| Environment | Self-Healing | Rationale |
|---|---|---|
| Dev | ✅ Enabled | Fast feedback, automatic correction |
| Test | ✅ Enabled | Validate self-healing behavior |
| Staging | ⚠️ Selective | Enable for critical resources only |
| Production | ⚠️ Selective | Manual intervention preferred for critical changes |
Production: Selective Self-Healing:
# clusters/production/kustomization-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
interval: 10m
path: ./apps
prune: false # Manual pruning in production
# Self-healing enabled, but prune disabled for safety
sourceRef:
kind: GitRepository
name: atp-gitops
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-gateway
namespace: atp-production
# Self-healing reverts manual changes to Gateway
Disable Self-Healing Globally:
# Suspend Kustomization (disables all reconciliation including self-healing)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
suspend: true # Temporarily disable all reconciliation
interval: 10m
path: ./apps
Force Recreation of Resources¶
Force Recreate on Drift:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
spec:
interval: 5m
path: ./apps/atp-ingestion
syncOptions:
- ReplaceOnCreate=true # Replace existing resources
sourceRef:
kind: GitRepository
name: atp-gitops
Force Recreation via Annotation:
# Add annotation to force recreation
kubectl annotate deployment atp-ingestion -n atp-production \
fluxcd.io/reconcile="forced" \
--overwrite
# FluxCD will recreate the resource on next reconciliation
Preserving Stateful Resources¶
Protect Stateful Resources from Self-Healing:
# apps/atp-query/base/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: atp-query
annotations:
# Preserve manual changes to StatefulSet
reconcile.fluxcd.io/disabled: "true"
spec:
replicas: 3
# ... other spec
Protect PVCs from Pruning:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
prune: true
pruneOptions:
keepLabels:
- app=atp-query # Keep resources with this label when pruning
sourceRef:
kind: GitRepository
name: atp-gitops
Health Assessment¶
Built-in Health Checks (Deployment, StatefulSet)¶
Deployment Health Check:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
spec:
interval: 5m
path: ./apps/atp-ingestion
wait: true # Wait for health checks to pass
timeout: 15m
sourceRef:
kind: GitRepository
name: atp-gitops
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
# FluxCD checks:
# - Deployment status.availableReplicas == spec.replicas
# - All pods are Ready
StatefulSet Health Check:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-query
spec:
interval: 5m
path: ./apps/atp-query
wait: true
timeout: 20m # Longer timeout for StatefulSet
sourceRef:
kind: GitRepository
name: atp-gitops
healthChecks:
- apiVersion: apps/v1
kind: StatefulSet
name: atp-query
namespace: atp-production
# FluxCD checks:
# - StatefulSet status.readyReplicas == spec.replicas
# - All pods are Ready and in correct order
Multiple Health Checks:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-gateway
spec:
interval: 5m
path: ./apps/atp-gateway
wait: true
timeout: 15m
sourceRef:
kind: GitRepository
name: atp-gitops
healthChecks:
# Deployment health check
- apiVersion: apps/v1
kind: Deployment
name: atp-gateway
namespace: atp-production
# Service health check
- apiVersion: v1
kind: Service
name: atp-gateway
namespace: atp-production
# Ingress health check
- apiVersion: networking.k8s.io/v1
kind: Ingress
name: atp-gateway
namespace: atp-production
Custom Health Checks¶
Custom Health Check with CRD:
# Custom health check using custom resource
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
spec:
interval: 5m
path: ./apps/atp-ingestion
wait: true
timeout: 15m
sourceRef:
kind: GitRepository
name: atp-gitops
healthChecks:
- apiVersion: custom.health.check/v1
kind: HealthCheck
name: atp-ingestion-health
namespace: atp-production
Health Check Status:
# Check health check status
kubectl get kustomization atp-ingestion -n flux-system -o jsonpath='{.status.conditions}'
# Output:
# [
# {
# "type": "Ready",
# "status": "True",
# "reason": "HealthCheckPassed",
# "message": "all health checks passed",
# "lastTransitionTime": "2024-01-15T10:00:00Z"
# },
# {
# "type": "Healthy",
# "status": "True",
# "reason": "AllHealthChecksPassed",
# "message": "Deployment/atp-ingestion is healthy"
# }
# ]
Readiness Gates¶
Readiness Gate Configuration:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 3
template:
spec:
readinessGates:
- conditionType: PodHasNetwork
- conditionType: PodHasStorage
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
FluxCD Health Check with Readiness Gates:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
spec:
interval: 5m
path: ./apps/atp-ingestion
wait: true
timeout: 20m # Longer timeout if readiness gates present
sourceRef:
kind: GitRepository
name: atp-gitops
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
# FluxCD waits for:
# - Deployment ready
# - All pods Ready
# - All readiness gates conditions met
Timeout and Failure Thresholds¶
Health Check Timeout:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-ingestion
spec:
interval: 5m
path: ./apps/atp-ingestion
wait: true
timeout: 15m # Max time to wait for health checks
retryInterval: 2m # Retry failed health checks every 2 minutes
sourceRef:
kind: GitRepository
name: atp-gitops
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
Health Check Failure Handling:
| Scenario | Behavior | Action |
|---|---|---|
| Health check passes | Reconciliation succeeds | Continue normal operation |
| Health check fails (transient) | Retry up to timeout | Retry every retryInterval |
| Health check fails (permanent) | Reconciliation marked failed | Alert and manual intervention |
Check Health Check Status:
# View health check failures
kubectl describe kustomization atp-ingestion -n flux-system
# Events show:
# Warning ReconciliationFailed health check failed:
# Deployment/atp-ingestion not ready: 2/3 replicas available
Prune Policies¶
Automatic Cleanup of Deleted Resources¶
Prune Enabled:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-dev
spec:
interval: 1m
path: ./apps
prune: true # Auto-delete resources removed from Git
sourceRef:
kind: GitRepository
name: atp-gitops
Prune Behavior:
# 1. Resource exists in Git and cluster
# apps/atp-old-service/deployment.yaml
# 2. Delete resource from Git
rm apps/atp-old-service/deployment.yaml
git commit -m "Remove old service"
git push
# 3. FluxCD detects resource removed from Git
# 4. FluxCD deletes resource from cluster (prune enabled)
# 5. Resource removed from cluster
Prune Safety (PVC, PV Protection)¶
Prune with Safety Labels:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
interval: 10m
path: ./apps
prune: true
pruneOptions:
keepLabels:
- app=atp-query # Keep resources with this label
- persistent=true # Keep persistent resources
sourceRef:
kind: GitRepository
name: atp-gitops
Protect PVCs from Pruning:
# apps/atp-query/base/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: atp-query-data
labels:
persistent: "true" # Protected from pruning
fluxcd.io/prune: "disabled" # Explicit disable pruning
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
Prune Exclusions:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
interval: 10m
path: ./apps
prune: true
sourceRef:
kind: GitRepository
name: atp-gitops
# Resources with these labels are never pruned
pruneOptions:
keepLabels:
- persistent=true
- backup=true
- managed-by=external-operator
Selective Pruning¶
Selective Prune by Namespace:
# Prune only in specific namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
interval: 10m
path: ./apps
prune: true
sourceRef:
kind: GitRepository
name: atp-gitops
# Only prune resources in this namespace
namespace: atp-production
Selective Prune by Resource Type:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
interval: 10m
path: ./apps
prune: true
# Only prune Deployments and Services, not PVCs
sourceRef:
kind: GitRepository
name: atp-gitops
# Use postBuild to exclude resource types from pruning
postBuild:
substitute:
prune.fluxcd.io/exclude: "PersistentVolumeClaim,PersistentVolume"
Prune Validation¶
Dry-Run Prune:
# Check what would be pruned
flux reconcile kustomization apps-production --dry-run
# Output shows resources that would be deleted
Prune Status:
# Check prune status
kubectl get kustomization apps-production -n flux-system -o jsonpath='{.status.inventory}'
# Output:
# {
# "entries": [
# {"id": "apps_v1_Deployment_atp-production_atp-ingestion", "v": "v1"},
# {"id": "v1_Service_atp-production_atp-ingestion", "v": "v1"}
# ]
# }
# Resources not in inventory but in cluster will be pruned
Sync Ordering and Dependencies¶
Depends-On in Kustomization¶
Dependency Chain:
# 1. Infrastructure (base)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: infrastructure
namespace: flux-system
spec:
interval: 5m
path: ./infrastructure
sourceRef:
kind: GitRepository
name: atp-gitops
# No dependencies
---
# 2. Secrets (depends on infrastructure)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: secrets
namespace: flux-system
spec:
interval: 5m
path: ./secrets
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: infrastructure # Wait for infrastructure first
---
# 3. Apps (depends on infrastructure and secrets)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 10m
path: ./apps
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: infrastructure
- name: secrets # Wait for both
Dependency Graph:
graph TD
A[Infrastructure] --> B[Secrets]
A --> C[Apps]
B --> C
C --> D[atp-ingestion]
C --> E[atp-query]
C --> F[atp-gateway]
style A fill:#90EE90
style B fill:#FFE5B4
style C fill:#FFE5B4
Ordering Infrastructure Before Apps¶
Infrastructure First:
# Infrastructure Kustomization (no dependencies)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: infrastructure
namespace: flux-system
spec:
interval: 5m
path: ./infrastructure
prune: false # Don't auto-prune infrastructure
sourceRef:
kind: GitRepository
name: atp-gitops
Apps After Infrastructure:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 10m
path: ./apps
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: infrastructure # Ensure infrastructure ready first
Cross-Resource Dependencies¶
Service Dependencies:
# atp-query depends on atp-ingestion
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: atp-query
namespace: flux-system
spec:
interval: 5m
path: ./apps/atp-query
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: atp-ingestion # Wait for ingestion service
Cross-Namespace Dependencies:
# Apps in production namespace depend on monitoring in monitoring namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 10m
path: ./apps
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: monitoring # Wait for monitoring stack
healthChecks:
- apiVersion: v1
kind: Service
name: prometheus
namespace: monitoring # Cross-namespace health check
Wait for Readiness¶
Wait for Dependencies to be Ready:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 10m
path: ./apps
wait: true # Wait for resources to be ready
timeout: 20m
sourceRef:
kind: GitRepository
name: atp-gitops
dependsOn:
- name: infrastructure
# Waits for:
# 1. Infrastructure Kustomization to be ready
# 2. All health checks in infrastructure to pass
# 3. Then proceeds with apps reconciliation
Dependency Readiness Check:
# Check dependency status
flux get kustomizations apps-production
# Shows dependency status:
# NAME READY MESSAGE REVISION
# infrastructure True Applied revision: abc123 abc123
# apps-production True Applied revision: def456 def456
# If dependency not ready:
# apps-production False dependency 'infrastructure' is not ready
Notification Controller¶
Sending Alerts to Azure Monitor¶
Azure Monitor Provider:
# notification/provider-azure-monitor.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
name: azure-monitor
namespace: flux-system
spec:
type: generic
address: https://api.loganalytics.io/v1/workspaces/{workspaceId}/events
secretRef:
name: azure-monitor-credentials
---
# Secret with Azure Monitor credentials
apiVersion: v1
kind: Secret
metadata:
name: azure-monitor-credentials
namespace: flux-system
type: Opaque
stringData:
token: "{workspace-key}" # Log Analytics workspace key
Alert Configuration:
# notification/alert-reconciliation.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
name: reconciliation-alerts
namespace: flux-system
spec:
providerRef:
name: azure-monitor
eventSeverity: info
eventSources:
- kind: Kustomization
name: apps-production
namespace: flux-system
- kind: GitRepository
name: atp-gitops
namespace: flux-system
Slack/Teams Integration¶
Slack Provider:
# notification/provider-slack.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
name: slack
namespace: flux-system
spec:
type: slack
channel: "#atp-alerts"
username: fluxcd
secretRef:
name: slack-credentials
---
apiVersion: v1
kind: Secret
metadata:
name: slack-credentials
namespace: flux-system
type: Opaque
stringData:
address: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
Teams Provider:
# notification/provider-teams.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
name: teams
namespace: flux-system
spec:
type: generic
address: "https://outlook.office.com/webhook/YOUR/WEBHOOK/URL"
secretRef:
name: teams-credentials
Alert for Slack/Teams:
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
name: production-alerts
namespace: flux-system
spec:
providerRef:
name: slack # or teams
eventSeverity: error
eventSources:
- kind: Kustomization
name: apps-production
namespace: flux-system
exclusionList:
- ".* is ready"
- ".*applied revision.*"
summary: "Production deployment {{ .InvolvedObject.kind }} {{ .InvolvedObject.name }}"
Email Notifications¶
Email Provider:
# notification/provider-email.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
name: email
namespace: flux-system
spec:
type: generic
address: "smtp://smtp.office365.com:587"
secretRef:
name: email-credentials
---
apiVersion: v1
kind: Secret
metadata:
name: email-credentials
namespace: flux-system
type: Opaque
stringData:
username: "fluxcd@connectsoft.example"
password: "{smtp-password}"
Email Alert:
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
name: critical-alerts-email
namespace: flux-system
spec:
providerRef:
name: email
eventSeverity: error
eventSources:
- kind: Kustomization
name: apps-production
namespace: flux-system
# Only send critical errors via email
Custom Webhooks¶
Webhook Provider:
# notification/provider-webhook.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
name: custom-webhook
namespace: flux-system
spec:
type: generic
address: "https://api.connectsoft.example/fluxcd/webhook"
secretRef:
name: webhook-credentials
---
apiVersion: v1
kind: Secret
metadata:
name: webhook-credentials
namespace: flux-system
type: Opaque
stringData:
token: "{webhook-token}"
Webhook Alert:
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
name: webhook-alerts
namespace: flux-system
spec:
providerRef:
name: custom-webhook
eventSeverity: info
eventSources:
- kind: Kustomization
name: apps-production
namespace: flux-system
Handling Stuck Reconciliations¶
Identifying Stuck Reconciliations¶
Check Reconciliation Status:
# Check if Kustomization is stuck
flux get kustomizations apps-production
# Stuck indicators:
# - READY: False for extended period
# - MESSAGE: Contains "error" or "failed"
# - No recent status updates
# Detailed status
kubectl describe kustomization apps-production -n flux-system
# Check events for stuck reconciliation
kubectl get events -n flux-system \
--field-selector involvedObject.name=apps-production \
--sort-by='.lastTimestamp'
Common Stuck Scenarios:
| Scenario | Symptom | Resolution |
|---|---|---|
| Git fetch error | MESSAGE: git fetch failed |
Check Git credentials, network |
| Apply timeout | MESSAGE: apply timeout |
Increase timeout, check resource complexity |
| Health check failure | MESSAGE: health check failed |
Fix failing resource, disable health check |
| Dependency stuck | MESSAGE: dependency not ready |
Resolve dependency issue |
Suspending and Resuming¶
Suspend Reconciliation:
# Suspend to stop reconciliation
flux suspend kustomization apps-production
# Or via kubectl
kubectl patch kustomization apps-production -n flux-system \
-p '{"spec":{"suspend":true}}'
Resume Reconciliation:
# Resume reconciliation
flux resume kustomization apps-production
# Or via kubectl
kubectl patch kustomization apps-production -n flux-system \
-p '{"spec":{"suspend":false}}'
Suspend with Annotation:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
annotations:
fluxcd.io/suspend: "true" # Suspend via annotation
spec:
interval: 10m
path: ./apps
Force Reconciliation¶
Force Reconciliation:
# Force immediate reconciliation
flux reconcile kustomization apps-production --with-source
# Force with source update
flux reconcile source git atp-gitops
flux reconcile kustomization apps-production
# Force all reconciliations
flux reconcile kustomization --all
Force via Annotation:
# Add annotation to force reconciliation
kubectl annotate kustomization apps-production -n flux-system \
reconcile.fluxcd.io/requestedAt="$(date +%s)" \
--overwrite
Debugging Techniques¶
Enable Verbose Logging:
# Check FluxCD controller logs
kubectl logs -n flux-system \
-l app=kustomize-controller \
--tail=100
# Follow logs in real-time
kubectl logs -n flux-system \
-l app=kustomize-controller \
-f
# Filter for specific Kustomization
kubectl logs -n flux-system \
-l app=kustomize-controller \
| grep "apps-production"
Debug Commands:
# Check GitRepository status
flux get source git atp-gitops
# Check Kustomization status
flux get kustomizations apps-production
# Check events
kubectl get events -n flux-system \
--field-selector involvedObject.name=apps-production
# Check resource status
kubectl get deployment atp-ingestion -n atp-production -o yaml
Dry-Run Reconciliation:
# Simulate reconciliation without applying
flux reconcile kustomization apps-production --dry-run
# Output shows what would be applied
Observability¶
FluxCD Metrics in Prometheus¶
Enable Metrics:
# FluxCD automatically exposes Prometheus metrics
# Metrics endpoint: http://kustomize-controller:8080/metrics
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fluxcd-kustomize-controller
namespace: flux-system
spec:
selector:
matchLabels:
app: kustomize-controller
endpoints:
- port: http-prom
interval: 30s
path: /metrics
Key Metrics:
| Metric | Description |
|---|---|
fluxcd_kustomize_reconciliation_duration_seconds |
Reconciliation duration |
fluxcd_kustomize_reconciliation_total |
Total reconciliations |
fluxcd_kustomize_reconciliation_errors_total |
Reconciliation errors |
fluxcd_source_git_reconciliation_duration_seconds |
Git fetch duration |
Prometheus Query Examples:
# Reconciliation success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))
# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)
# Error rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))
Grafana Dashboards¶
Grafana Dashboard JSON:
{
"dashboard": {
"title": "FluxCD Reconciliation",
"panels": [
{
"title": "Reconciliation Success Rate",
"targets": [{
"expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[5m])) / sum(rate(fluxcd_kustomize_reconciliation_total[5m]))"
}]
},
{
"title": "Reconciliation Duration",
"targets": [{
"expr": "avg(fluxcd_kustomize_reconciliation_duration_seconds)"
}]
},
{
"title": "Reconciliation Errors",
"targets": [{
"expr": "sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))"
}]
}
]
}
}
Log Forwarding to Log Analytics¶
Fluent Bit Configuration:
# platform/logging/fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/kustomize-controller-*.log
Parser docker
Tag kube.fluxcd.*
Refresh_Interval 5
[FILTER]
Name kubernetes
Match kube.fluxcd.*
Merge_Log On
[OUTPUT]
Name azure
Match kube.fluxcd.*
Workspace_ID {workspace-id}
Shared_Key {workspace-key}
Log_Type FluxCD
Log Analytics Query:
// Query FluxCD logs
FluxCDLogs_CL
| where ContainerName_s contains "kustomize-controller"
| where LogMessage_s contains "reconciliation"
| project TimeGenerated, ContainerName_s, LogMessage_s
| order by TimeGenerated desc
Reconciliation Duration and Success Rate¶
Success Rate Monitoring:
# Overall success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))
# Per-Kustomization success rate
sum by (kustomization) (
rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m])
)
/
sum by (kustomization) (
rate(fluxcd_kustomize_reconciliation_total[5m])
)
Duration Monitoring:
# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)
# P95 reconciliation duration
histogram_quantile(0.95,
rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)
# Per-Kustomization duration
avg by (kustomization) (
fluxcd_kustomize_reconciliation_duration_seconds
)
Alert on High Error Rate:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: fluxcd-reconciliation-alerts
namespace: monitoring
spec:
groups:
- name: fluxcd
rules:
- alert: FluxCDHighErrorRate
expr: |
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "FluxCD reconciliation error rate is high"
description: "{{ $value }} errors per second"
- alert: FluxCDReconciliationSlow
expr: |
avg(fluxcd_kustomize_reconciliation_duration_seconds) > 300
for: 10m
labels:
severity: warning
annotations:
summary: "FluxCD reconciliations are taking longer than expected"
description: "Average duration: {{ $value }}s"
Summary: FluxCD Continuous Reconciliation¶
- Reconciliation Loop: Polling intervals, reconciliation triggers, retry strategies and backoff
- Automated Sync Policies: Auto-sync for dev/test, manual sync for staging/production, sync options, per-resource sync configuration
- Drift Detection: Comparing Git state to live cluster, drift types, detection frequency, alerting on drift
- Self-Healing: Automatic revert of manual changes, enable/disable per environment, force recreation, preserving stateful resources
- Health Assessment: Built-in health checks, custom health checks, readiness gates, timeout and failure thresholds
- Prune Policies: Automatic cleanup, prune safety (PVC/PV protection), selective pruning, prune validation
- Sync Ordering: depends-on in Kustomization, infrastructure before apps, cross-resource dependencies, wait for readiness
- Notification Controller: Azure Monitor alerts, Slack/Teams integration, email notifications, custom webhooks
- Handling Stuck Reconciliations: Identifying stuck reconciliations, suspending and resuming, force reconciliation, debugging techniques
- Observability: FluxCD metrics in Prometheus, Grafana dashboards, log forwarding to Log Analytics, reconciliation duration and success rate monitoring
Multi-Environment AKS Deployment¶
Purpose: Define how ATP is deployed across multiple environments (dev, test, staging, production) using separate AKS clusters, environment-specific configurations, Kustomize overlays, Helm values, and FluxCD per-environment reconciliation, ensuring proper isolation, resource management, and multi-region high availability.
Environment-Specific AKS Clusters¶
Separate Clusters per Environment Rationale¶
Multi-Cluster Architecture:
graph TB
subgraph "Production Subscription"
PROD[Production AKS<br/>East US]
STAGING[Staging AKS<br/>East US]
end
subgraph "Non-Prod Subscription"
TEST[Test AKS<br/>East US]
DEV[Dev AKS<br/>East US]
end
subgraph "Production Subscription - DR"
PROD_DR[Production AKS<br/>West Europe]
end
PROD -->|Traffic| PROD_DR
STAGING -->|Validate| PROD
style PROD fill:#90EE90
style PROD_DR fill:#90EE90
style STAGING fill:#FFE5B4
style TEST fill:#FFE5B4
style DEV fill:#FFE5B4
Rationale for Separate Clusters:
| Aspect | Separate Clusters | Shared Cluster | ATP Decision |
|---|---|---|---|
| Isolation | ✅ Complete isolation | ⚠️ Namespace-level only | ✅ Separate Clusters |
| Security | ✅ Environment boundaries | ⚠️ Shared RBAC/network | ✅ Separate Clusters |
| Resource Management | ✅ Independent scaling | ⚠️ Shared resources | ✅ Separate Clusters |
| Cost | ❌ Higher (multiple clusters) | ✅ Lower (single cluster) | ✅ Separate Clusters (security/compliance priority) |
| Operational Complexity | ⚠️ More clusters to manage | ✅ Simpler | ✅ Separate Clusters (acceptable trade-off) |
| Blast Radius | ✅ Isolated failures | ❌ Cross-environment impact | ✅ Separate Clusters |
ATP Selection: Separate Clusters
Reasons: - ✅ Compliance: SOC 2 requires production isolation - ✅ Security: No risk of dev/test workloads accessing production resources - ✅ Resource Isolation: Production resources guaranteed, not shared - ✅ Independent Scaling: Each environment scales independently - ✅ Rollback Safety: Production cluster unaffected by dev/test issues
Cluster Sizing and SKU Selection¶
Environment-Specific Cluster Sizing:
| Environment | Node Pool SKU | Node Count | CPU/Memory per Node | Total Capacity | Rationale |
|---|---|---|---|---|---|
| Dev | Standard_D4s_v3 | 2-3 nodes | 4 vCPU / 16 GB | 8-12 vCPU / 32-48 GB | Minimal resources, cost-effective |
| Test | Standard_D4s_v3 | 3-5 nodes | 4 vCPU / 16 GB | 12-20 vCPU / 48-80 GB | Integration testing needs |
| Staging | Standard_D8s_v3 | 5-10 nodes | 8 vCPU / 32 GB | 40-80 vCPU / 160-320 GB | Production-like capacity |
| Production | Standard_D16s_v3 | 10-20 nodes | 16 vCPU / 64 GB | 160-320 vCPU / 640-1280 GB | High availability, performance |
Production Cluster Configuration:
// infrastructure/AKS-Production.cs
public class AKSProduction
{
public ContainerService.KubernetesCluster Cluster { get; }
public AKSProduction(Pulumi.Stack stack, string location)
{
this.Cluster = new ContainerService.KubernetesCluster("atp-prod-eus-aks", new()
{
ResourceGroupName = "atp-production-rg",
Location = location, // "eastus"
DnsPrefix = "atp-prod-eus",
DefaultNodePool = new ContainerService.Inputs.KubernetesClusterDefaultNodePoolArgs
{
Name = "system",
NodeCount = 3,
VmSize = "Standard_D16s_v3",
OsDiskSizeGb = 256,
OsDiskType = "Ephemeral",
Type = "VirtualMachineScaleSets",
EnableAutoScaling = true,
MinCount = 3,
MaxCount = 5,
MaxPods = 110,
NodeTaints = new[]
{
"CriticalAddonsOnly=true:NoSchedule"
},
},
// User node pools for workloads
NodeResourceGroup = "atp-prod-eus-aks-nodes",
KubernetesVersion = "1.28.0",
NetworkProfile = new ContainerService.Inputs.KubernetesClusterNetworkProfileArgs
{
NetworkPlugin = "azure",
NetworkPolicy = "azure",
ServiceCidr = "10.0.1.0/24",
DnsServiceIp = "10.0.1.10",
DockerBridgeCidr = "172.17.0.1/16",
},
Identity = new ContainerService.Inputs.KubernetesClusterIdentityArgs
{
Type = "UserAssigned",
IdentityIds = new[] { managedIdentity.Id },
},
AzurePolicyEnabled = true,
HttpApplicationRoutingEnabled = false,
RoleBasedAccessControlEnabled = true,
AzureRbacEnabled = true,
PrivateClusterEnabled = true,
ApiServerAuthorizedIpRanges = new[]
{
"10.0.0.0/16", // VNet CIDR
},
Tags = new()
{
{ "Environment", "production" },
{ "CostCenter", "ATP-Production" },
{ "Compliance", "SOC2" },
},
});
}
}
Networking Configuration per Environment¶
Environment Network Isolation:
| Environment | VNet | Subnet | Private Endpoints | Network Policies | Rationale |
|---|---|---|---|---|---|
| Dev | atp-dev-vnet |
atp-dev-subnet |
❌ Disabled | ⚠️ Baseline | Cost optimization |
| Test | atp-test-vnet |
atp-test-subnet |
❌ Disabled | ✅ Enforced | Test network policies |
| Staging | atp-staging-vnet |
atp-staging-subnet |
✅ Enabled | ✅ Enforced | Production-like |
| Production | atp-prod-vnet |
atp-prod-subnet |
✅ Enabled | ✅ Enforced | Maximum security |
Production Network Configuration:
// Production VNet with private endpoints
var prodVNet = new Network.VirtualNetwork("atp-prod-vnet", new()
{
ResourceGroupName = "atp-production-rg",
Location = "eastus",
AddressSpace = new[] { "10.1.0.0/16" },
Subnets = new[]
{
new Network.Inputs.SubnetArgs
{
Name = "atp-prod-aks-subnet",
AddressPrefix = "10.1.1.0/24",
PrivateEndpointNetworkPoliciesEnabled = true,
},
new Network.Inputs.SubnetArgs
{
Name = "atp-prod-private-endpoints",
AddressPrefix = "10.1.2.0/24",
PrivateEndpointNetworkPoliciesEnabled = false,
},
},
});
Subscription Strategy (Shared vs Dedicated)¶
ATP Subscription Strategy:
| Environment | Subscription | Rationale |
|---|---|---|
| Dev | ATP-NonProd |
Cost optimization, shared resources |
| Test | ATP-NonProd |
Cost optimization, shared resources |
| Staging | ATP-Production |
Production-like isolation, compliance |
| Production (East US) | ATP-Production |
Production isolation, compliance |
| Production (West Europe) | ATP-Production |
DR region, same subscription |
Subscription Configuration:
# List subscriptions
az account list --output table
# Set production subscription
az account set --subscription "ATP-Production"
# Set non-production subscription
az account set --subscription "ATP-NonProd"
Regional Deployment Strategy¶
Primary Region: East US¶
Primary Region Configuration:
// Primary region: East US
var primaryRegion = new AKSCluster("atp-prod-eus-aks", new()
{
Location = "eastus",
ResourceGroupName = "atp-production-rg",
Environment = "production",
ClusterSku = "Standard",
NodePools = new[]
{
new NodePoolConfig
{
Name = "system",
VmSize = "Standard_D16s_v3",
MinCount = 3,
MaxCount = 5,
},
new NodePoolConfig
{
Name = "user",
VmSize = "Standard_D16s_v3",
MinCount = 10,
MaxCount = 20,
},
},
});
Primary Region Resources: - ✅ Production AKS cluster - ✅ Azure SQL Database (Primary) - ✅ Azure Redis Cache - ✅ Azure Service Bus - ✅ Azure Key Vault - ✅ Azure Container Registry (geo-replicated)
Secondary Region: West Europe¶
Secondary Region (DR) Configuration:
// Secondary region: West Europe (DR)
var secondaryRegion = new AKSCluster("atp-prod-weu-aks", new()
{
Location = "westeurope",
ResourceGroupName = "atp-production-rg",
Environment = "production",
ClusterSku = "Standard",
NodePools = new[]
{
new NodePoolConfig
{
Name = "system",
VmSize = "Standard_D16s_v3",
MinCount = 2,
MaxCount = 3,
},
new NodePoolConfig
{
Name = "user",
VmSize = "Standard_D16s_v3",
MinCount = 5,
MaxCount = 10,
},
},
});
Secondary Region Resources: - ✅ Production AKS cluster (standby/DR) - ✅ Azure SQL Database (Geo-replica) - ✅ Azure Redis Cache (Geo-replica) - ✅ Azure Service Bus (DR namespace) - ✅ Azure Key Vault (Geo-replicated) - ✅ Azure Container Registry (geo-replicated)
Multi-Region for Production (HA/DR)¶
Multi-Region Architecture:
graph TB
subgraph "East US (Primary)"
PROD_EUS[Production AKS<br/>East US]
SQL_EUS[SQL Primary]
REDIS_EUS[Redis Primary]
end
subgraph "West Europe (DR)"
PROD_WEU[Production AKS<br/>West Europe<br/>Standby]
SQL_WEU[SQL Geo-Replica]
REDIS_WEU[Redis Geo-Replica]
end
subgraph "Traffic Management"
FD[Azure Front Door]
end
FD -->|Primary| PROD_EUS
FD -.->|Failover| PROD_WEU
SQL_EUS -.->|Replication| SQL_WEU
REDIS_EUS -.->|Replication| REDIS_WEU
style PROD_EUS fill:#90EE90
style PROD_WEU fill:#FFE5B4
style FD fill:#FFE5B4
Multi-Region RTO/RPO Targets:
| Component | RTO | RPO | Strategy |
|---|---|---|---|
| AKS Cluster | 1 hour | 5 minutes | GitOps-based recreation |
| SQL Database | 5 minutes | < 1 minute | Active geo-replication |
| Redis Cache | 15 minutes | < 1 minute | Geo-replication |
| Application | 5 minutes | < 1 minute | Traffic failover via Front Door |
Regional Failover Mechanisms¶
Azure Front Door Failover:
# infrastructure/azure-front-door.yaml
apiVersion: networking.azure.com/v1
kind: FrontDoor
metadata:
name: atp-frontdoor
spec:
backendPools:
- name: primary-eus
backends:
- address: atp-prod-eus-aks.region.cloudapp.azure.com
enabled: true
priority: 1
weight: 100
healthProbe:
path: /health
protocol: Http
interval: 30
- name: secondary-weu
backends:
- address: atp-prod-weu-aks.region.cloudapp.azure.com
enabled: true
priority: 2
weight: 0
healthProbe:
path: /health
protocol: Http
interval: 30
routingRules:
- name: failover-rule
acceptedProtocols:
- Http
- Https
patternsToMatch:
- "/*"
routeConfiguration:
@odata.type: "#Microsoft.Azure.FrontDoor.Models.FrontdoorForwardingConfiguration"
forwardingProtocol: MatchRequest
backendPool:
id: /subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.Network/frontDoors/atp-frontdoor/backendPools/primary-eus
Kustomize Overlays Per Environment¶
Base Manifests (Common)¶
Base Structure:
apps/
├── atp-ingestion/
│ ├── base/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ ├── configmap.yaml
│ │ └── kustomization.yaml
│ ├── overlays/
│ │ ├── dev/
│ │ │ └── kustomization.yaml
│ │ ├── test/
│ │ │ └── kustomization.yaml
│ │ ├── staging/
│ │ │ └── kustomization.yaml
│ │ └── production/
│ │ └── kustomization.yaml
Base Kustomization:
# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-production
resources:
- deployment.yaml
- service.yaml
- configmap.yaml
commonLabels:
app: atp-ingestion
managed-by: fluxcd
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3-abc123d
Dev Overlay (Minimal Resources, Debug Logging)¶
Dev Overlay Configuration:
# apps/atp-ingestion/overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-dev
resources:
- ../../base
patchesStrategicMerge:
- deployment-patch.yaml
- configmap-patch.yaml
commonLabels:
environment: dev
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: latest # Dev uses latest images
Dev Deployment Patch:
# apps/atp-ingestion/overlays/dev/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 1 # Single replica for dev
template:
spec:
containers:
- name: atp-ingestion
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Development"
- name: Logging__LogLevel__Default
value: "Debug"
- name: Logging__LogLevel__Microsoft
value: "Debug"
Dev ConfigMap Patch:
# apps/atp-ingestion/overlays/dev/configmap-patch.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: atp-ingestion-config
data:
telemetry:sampling: "100" # 100% sampling in dev
feature-flags:enable-debug-mode: "true"
feature-flags:enable-profiling: "true"
Test Overlay (Moderate Resources, Integration Tests)¶
Test Overlay Configuration:
# apps/atp-ingestion/overlays/test/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-test
resources:
- ../../base
patchesStrategicMerge:
- deployment-patch.yaml
commonLabels:
environment: test
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3 # Test uses tagged releases
Test Deployment Patch:
# apps/atp-ingestion/overlays/test/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 2 # Two replicas for test
template:
spec:
containers:
- name: atp-ingestion
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Test"
- name: Logging__LogLevel__Default
value: "Information"
Staging Overlay (Production-Like)¶
Staging Overlay Configuration:
# apps/atp-ingestion/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-staging
resources:
- ../../base
patchesStrategicMerge:
- deployment-patch.yaml
commonLabels:
environment: staging
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3 # Staging uses production-ready tags
Staging Deployment Patch:
# apps/atp-ingestion/overlays/staging/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 3 # Production-like replica count
template:
spec:
containers:
- name: atp-ingestion
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Staging"
- name: Logging__LogLevel__Default
value: "Warning"
- name: telemetry:sampling
valueFrom:
configMapKeyRef:
name: atp-ingestion-config
key: telemetry:sampling
Production Overlay (Optimized, Strict Policies)¶
Production Overlay Configuration:
# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-production
resources:
- ../../base
patchesStrategicMerge:
- deployment-patch.yaml
- network-policy-patch.yaml
commonLabels:
environment: production
compliance: soc2
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3-abc123d # Production uses immutable tags
Production Deployment Patch:
# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 5 # High availability
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: atp-ingestion
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: Logging__LogLevel__Default
value: "Error" # Minimal logging in production
- name: telemetry:sampling
value: "10" # 10% sampling
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Helm Values Files Per Environment¶
values-dev.yaml¶
Dev Helm Values:
# charts/atp-ingestion/values-dev.yaml
replicaCount: 1
image:
repository: connectsoft.azurecr.io/atp/ingestion
tag: latest
pullPolicy: Always
serviceAccount:
create: true
annotations:
azure.workload.identity/client-id: "{dev-workload-identity-id}"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: false # No autoscaling in dev
environment:
name: Development
logging:
level: Debug
telemetry:
sampling: 100 # 100% sampling
featureFlags:
enableDebugMode: true
enableProfiling: true
config:
database:
connectionString: "{dev-sql-connection-string}"
redis:
connectionString: "{dev-redis-connection-string}"
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-staging # Staging certs in dev
hosts:
- host: atp-ingestion-dev.connectsoft.example
paths:
- path: /
pathType: Prefix
tls:
- secretName: atp-ingestion-dev-tls
hosts:
- atp-ingestion-dev.connectsoft.example
values-test.yaml¶
Test Helm Values:
# charts/atp-ingestion/values-test.yaml
replicaCount: 2
image:
repository: connectsoft.azurecr.io/atp/ingestion
tag: v1.2.3
pullPolicy: IfNotPresent
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 4
targetCPUUtilizationPercentage: 70
environment:
name: Test
logging:
level: Information
telemetry:
sampling: 50 # 50% sampling
config:
database:
connectionString: "{test-sql-connection-string}"
values-staging.yaml¶
Staging Helm Values:
# charts/atp-ingestion/values-staging.yaml
replicaCount: 3
image:
repository: connectsoft.azurecr.io/atp/ingestion
tag: v1.2.3
pullPolicy: IfNotPresent
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 6
targetCPUUtilizationPercentage: 70
environment:
name: Staging
logging:
level: Warning
telemetry:
sampling: 25 # 25% sampling
config:
database:
connectionString: "{staging-sql-connection-string}"
values-production.yaml¶
Production Helm Values:
# charts/atp-ingestion/values-production.yaml
replicaCount: 5
image:
repository: connectsoft.azurecr.io/atp/ingestion
tag: v1.2.3-abc123d # Immutable tag
pullPolicy: IfNotPresent
serviceAccount:
create: true
annotations:
azure.workload.identity/client-id: "{prod-workload-identity-id}"
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
environment:
name: Production
logging:
level: Error # Minimal logging
telemetry:
sampling: 10 # 10% sampling
featureFlags:
enableDebugMode: false
enableProfiling: false
config:
database:
connectionString: "{prod-sql-connection-string}"
redis:
connectionString: "{prod-redis-connection-string}"
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
hosts:
- host: atp-ingestion.connectsoft.example
paths:
- path: /
pathType: Prefix
tls:
- secretName: atp-ingestion-tls
hosts:
- atp-ingestion.connectsoft.example
networkPolicy:
enabled: true
ingress:
- from:
- podSelector:
matchLabels:
app: atp-gateway
podDisruptionBudget:
enabled: true
minAvailable: 3
Value Precedence and Overrides¶
Helm Value Precedence (Highest to Lowest):
--setcommand-line flagsvalues-production.yaml(or environment-specific)values.yaml(base/default values)
Deploy with Environment-Specific Values:
# Deploy to dev
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
-f charts/atp-ingestion/values.yaml \
-f charts/atp-ingestion/values-dev.yaml \
-n atp-dev
# Deploy to production
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
-f charts/atp-ingestion/values.yaml \
-f charts/atp-ingestion/values-production.yaml \
-n atp-production
# Override specific value
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
-f charts/atp-ingestion/values-production.yaml \
--set replicaCount=10 \
-n atp-production
FluxCD Configuration Per Environment¶
GitRepository per Environment (Branch Targeting)¶
Dev GitRepository:
# clusters/dev/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops-dev
namespace: flux-system
spec:
interval: 30s # Fast polling for dev
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: dev # Dev branch
secretRef:
name: gitops-credentials
ignore: |
exclude: |
^production/
^staging/
^test/
Test GitRepository:
# clusters/test/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops-test
namespace: flux-system
spec:
interval: 1m
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: test
secretRef:
name: gitops-credentials
Production GitRepository:
# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops-production
namespace: flux-system
spec:
interval: 5m # Slower polling for production
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: production # Production branch
secretRef:
name: gitops-credentials
ignore: |
exclude: |
^dev/
^test/
^staging/
Kustomization per Environment¶
Dev Kustomization:
# clusters/dev/kustomization-apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-dev
namespace: flux-system
spec:
interval: 1m
path: ./apps
prune: true # Auto-prune in dev
wait: false # Don't wait for readiness in dev
timeout: 5m
sourceRef:
kind: GitRepository
name: atp-gitops-dev
kustomizeFlags:
- --load-restrictor=LoadRestrictionsNone
Production Kustomization:
# clusters/production/kustomization-apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 10m
path: ./apps
prune: false # Manual pruning only in production
wait: true # Wait for readiness
timeout: 20m
retryInterval: 5m
sourceRef:
kind: GitRepository
name: atp-gitops-production
dependsOn:
- name: infrastructure
- name: secrets
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-gateway
namespace: atp-production
Sync Policies per Environment¶
Environment Sync Policy Matrix:
| Environment | Auto-Sync | Prune | Wait | Timeout | Manual Approval |
|---|---|---|---|---|---|
| Dev | ✅ Yes | ✅ Yes | ❌ No | 5m | ❌ No |
| Test | ✅ Yes | ✅ Yes | ✅ Yes | 10m | ❌ No |
| Staging | ⚠️ Selective | ❌ No | ✅ Yes | 15m | ✅ Yes (1 approver) |
| Production | ❌ No | ❌ No | ✅ Yes | 20m | ✅ Yes (2 approvers, CAB) |
Environment-Specific Reconciliation Settings¶
Production Reconciliation Settings:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
spec:
interval: 10m
retryInterval: 5m
timeout: 20m
suspend: false # Reconciliation enabled
path: ./apps
prune: false # Never auto-prune
wait: true # Wait for health checks
sourceRef:
kind: GitRepository
name: atp-gitops-production
syncOptions:
- CreateNamespace=true
- ReplaceOnCreate=false # Safer in production
Environment-Specific Configurations¶
Log Levels (Debug → Error)¶
Environment Log Levels:
| Environment | Default Level | Microsoft Level | Log Retention |
|---|---|---|---|
| Dev | Debug | Debug | 7 days |
| Test | Information | Information | 30 days |
| Staging | Warning | Warning | 90 days |
| Production | Error | Error | 365 days |
Log Level Configuration:
# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: atp-ingestion-config
data:
Logging__LogLevel__Default: "Information" # Base level
Logging__LogLevel__Microsoft: "Warning"
Logging__LogLevel__System: "Error"
# Production overlay patch
apiVersion: v1
kind: ConfigMap
metadata:
name: atp-ingestion-config
data:
Logging__LogLevel__Default: "Error" # Override for production
Logging__LogLevel__Microsoft: "Error"
Telemetry Sampling (100% → 10%)¶
Telemetry Sampling Rates:
| Environment | Sampling Rate | Rationale |
|---|---|---|
| Dev | 100% | Full visibility for debugging |
| Test | 50% | Balance between visibility and cost |
| Staging | 25% | Production-like, reduced cost |
| Production | 10% | Cost optimization, sufficient insights |
Telemetry Configuration:
# Production telemetry settings
env:
- name: telemetry:sampling
value: "10" # 10% sampling
- name: APPLICATIONINSIGHTS_SAMPLING_PERCENTAGE
value: "10"
Feature Flags per Environment¶
Feature Flag Configuration:
# Dev feature flags
featureFlags:
enableDebugMode: true
enableProfiling: true
enableDetailedMetrics: true
enableExperimentalFeatures: true
# Production feature flags
featureFlags:
enableDebugMode: false
enableProfiling: false
enableDetailedMetrics: false
enableExperimentalFeatures: false
enableMaintenanceMode: false
Database Connection Strings¶
Environment-Specific Database Connections:
# Dev database
env:
- name: ConnectionStrings__DefaultConnection
valueFrom:
secretKeyRef:
name: sql-connection-string
key: connection-string
---
# ExternalSecret for dev
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-string
namespace: atp-dev
spec:
secretStoreRef:
name: azure-keyvault-dev
kind: ClusterSecretStore
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-ingestion/sql-connection-string
External Service Endpoints¶
Environment-Specific Endpoints:
# Dev endpoints
config:
externalServices:
paymentGateway: "https://api.stripe.com/test"
emailService: "https://api.sendgrid.com/v3/test"
storageAccount: "https://atpdevstorage.blob.core.windows.net"
# Production endpoints
config:
externalServices:
paymentGateway: "https://api.stripe.com"
emailService: "https://api.sendgrid.com/v3"
storageAccount: "https://atpprodstorage.blob.core.windows.net"
Resource Quotas and Limits¶
Namespace-Level Quotas¶
Dev Namespace Quota:
# platform/resource-quotas/dev-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: atp-dev-quota
namespace: atp-dev
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
persistentvolumeclaims: "5"
pods: "20"
services: "10"
Production Namespace Quota:
# platform/resource-quotas/production-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: atp-production-quota
namespace: atp-production
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
persistentvolumeclaims: "50"
pods: "200"
services: "50"
CPU and Memory Limits per Environment¶
Resource Limit Matrix:
| Environment | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Dev | 100m | 500m | 256Mi | 512Mi |
| Test | 200m | 1000m | 512Mi | 1Gi |
| Staging | 500m | 2000m | 1Gi | 2Gi |
| Production | 1000m | 2000m | 2Gi | 4Gi |
Storage Quotas¶
Storage Quota per Environment:
# Production storage quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: atp-production-storage-quota
namespace: atp-production
spec:
hard:
requests.storage: 500Gi
persistentvolumeclaims: "50"
Pod Count Limits¶
Pod Count Limits:
| Environment | Max Pods | Rationale |
|---|---|---|
| Dev | 20 | Minimal footprint |
| Test | 50 | Integration testing needs |
| Staging | 100 | Production-like scale |
| Production | 200 | High availability, scale |
HPA Configuration Per Environment¶
Min/Max Replicas per Environment¶
HPA Configuration Matrix:
| Environment | Min Replicas | Max Replicas | Target CPU | Target Memory |
|---|---|---|---|---|
| Dev | 1 | 2 | 70% | 80% |
| Test | 2 | 4 | 70% | 80% |
| Staging | 3 | 6 | 70% | 80% |
| Production | 5 | 10 | 70% | 80% |
Production HPA:
# apps/atp-ingestion/overlays/production/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-ingestion-hpa
namespace: atp-production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
minReplicas: 5
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Max
Scaling Thresholds (CPU, Memory)¶
Scaling Thresholds:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-ingestion-hpa
spec:
metrics:
# CPU-based scaling
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up when CPU > 70%
# Memory-based scaling
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale up when memory > 80%
Custom Metrics with KEDA¶
KEDA ScaledObject:
# apps/atp-ingestion/overlays/production/keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: atp-ingestion-scaler
namespace: atp-production
spec:
scaleTargetRef:
name: atp-ingestion
minReplicaCount: 5
maxReplicaCount: 10
triggers:
# CPU-based scaling
- type: cpu
metadata:
type: Utilization
value: "70"
# Memory-based scaling
- type: memory
metadata:
type: Utilization
value: "80"
# RabbitMQ queue length
- type: rabbitmq
metadata:
queueName: audit-events
queueLength: "100"
host: "amqp://rabbitmq.atp-production:5672"
# HTTP requests per second
- type: prometheus
metadata:
serverAddress: "http://prometheus.monitoring:9090"
metricName: http_requests_per_second
threshold: "100"
query: "sum(rate(http_requests_total[1m]))"
Scale-to-Zero in Dev¶
Dev Scale-to-Zero:
# apps/atp-ingestion/overlays/dev/keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: atp-ingestion-scaler
namespace: atp-dev
spec:
scaleTargetRef:
name: atp-ingestion
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 2
idleReplicaCount: 0 # Scale down to zero after inactivity
triggers:
- type: http
metadata:
targetValue: "1"
activationTargetValue: "1"
Multi-Region Traffic Routing¶
Azure Front Door Configuration¶
Azure Front Door Setup:
# infrastructure/azure-front-door.yaml
apiVersion: networking.azure.com/v1
kind: FrontDoor
metadata:
name: atp-frontdoor
spec:
resourceGroupName: atp-production-rg
location: global
frontendEndpoints:
- name: atp-frontend
hostName: atp.connectsoft.example
sessionAffinityEnabledState: Enabled
sessionAffinityTtlSeconds: 0
backendPools:
- name: primary-eus
loadBalancingSettings:
name: default
healthProbeSettings:
name: default
backends:
- address: atp-prod-eus-aks.region.cloudapp.azure.com
enabled: true
priority: 1
weight: 100
httpPort: 80
httpsPort: 443
- name: secondary-weu
backends:
- address: atp-prod-weu-aks.region.cloudapp.azure.com
enabled: true
priority: 2
weight: 0 # Standby
httpPort: 80
httpsPort: 443
routingRules:
- name: failover-rule
acceptedProtocols:
- Http
- Https
patternsToMatch:
- "/*"
routeConfiguration:
forwardingConfiguration:
forwardingProtocol: MatchRequest
backendPool:
id: primary-eus
cacheConfiguration:
queryParameterStripDirective: StripAll
dynamicCompression: Enabled
frontendEndpoints:
- atp-frontend
Traffic Manager for DNS-Based Routing¶
Traffic Manager Configuration:
# infrastructure/traffic-manager.yaml
apiVersion: network.azure.com/v1
kind: TrafficManagerProfile
metadata:
name: atp-traffic-manager
spec:
resourceGroupName: atp-production-rg
location: global
profileStatus: Enabled
trafficRoutingMethod: Priority # Failover routing
dnsConfig:
relativeName: atp-connectsoft
ttl: 60
monitorConfig:
protocol: Https
port: 443
path: /health
intervalInSeconds: 30
timeoutInSeconds: 10
toleratedNumberOfFailures: 3
endpoints:
- name: primary-eus
target: atp-prod-eus-aks.region.cloudapp.azure.com
type: ExternalEndpoints
priority: 1
weight: 100
enabled: true
- name: secondary-weu
target: atp-prod-weu-aks.region.cloudapp.azure.com
type: ExternalEndpoints
priority: 2
weight: 0
enabled: true
Regional Failover Policies¶
Failover Policy Configuration:
# Front Door failover rules
routingRules:
- name: failover-rule
acceptedProtocols:
- Http
- Https
routeConfiguration:
forwardingConfiguration:
backendPool:
id: primary-eus
# Failover to secondary if primary unhealthy
loadBalancingSettings:
sampleSize: 4
successfulSamplesRequired: 3
Health Probe Configuration:
healthProbeSettings:
- name: default
path: /health
protocol: Https
intervalInSeconds: 30
enabledState: Enabled
Health Probe Configuration¶
Health Probe Setup:
# Health probe for Front Door
healthProbeSettings:
- name: atp-health-probe
path: /health/live
protocol: Https
intervalInSeconds: 30
timeoutInSeconds: 10
unhealthyThreshold: 3
enabledState: Enabled
healthProbeMethod: Head
Application Health Endpoint:
// Health check endpoint for multi-region routing
[ApiController]
[Route("[controller]")]
public class HealthController : ControllerBase
{
private readonly IHealthCheckService _healthCheck;
[HttpGet("live")]
public async Task<IActionResult> Liveness()
{
var result = await _healthCheck.CheckHealthAsync();
return result.Status == HealthStatus.Healthy
? Ok()
: StatusCode(503);
}
[HttpGet("ready")]
public async Task<IActionResult> Readiness()
{
// Check critical dependencies
var result = await _healthCheck.CheckHealthAsync(
predicate: check => check.Tags.Contains("ready"));
return result.Status == HealthStatus.Healthy
? Ok()
: StatusCode(503);
}
}
Cross-Environment Dependencies¶
Shared Services (Monitoring, Logging)¶
Shared Monitoring Stack:
# Shared monitoring namespace (single instance for all environments)
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
labels:
shared: "true"
---
# Prometheus (shared across environments)
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
name: http
Cross-Environment Service Access:
# Service in production namespace accessing shared monitoring
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
type: ExternalName
externalName: prometheus.monitoring.svc.cluster.local
Service Discovery Across Environments¶
Multi-Cluster Service Discovery:
# Service export (if using service mesh)
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
name: atp-gateway
namespace: atp-production
---
# Service import
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
name: atp-gateway
namespace: atp-staging
spec:
type: ClusterSetIP
ports:
- port: 8080
protocol: TCP
VNet Peering (If Needed)¶
VNet Peering for Cross-Environment Access:
// VNet peering between environments (if required)
var devTestPeering = new Network.VirtualNetworkPeering("dev-test-peering", new()
{
ResourceGroupName = "atp-nonprod-rg",
VirtualNetworkName = "atp-dev-vnet",
RemoteVirtualNetworkId = testVNet.Id,
AllowForwardedTraffic = true,
AllowGatewayTransit = false,
UseRemoteGateways = false,
});
VNet Peering Policy:
| Environment Pair | Peering | Rationale |
|---|---|---|
| Dev ↔ Test | ⚠️ Optional | Shared resources, cost optimization |
| Staging ↔ Production | ❌ No | Security isolation required |
| Production EUS ↔ Production WEU | ✅ Yes | Multi-region HA/DR |
Summary: Multi-Environment AKS Deployment¶
- Environment-Specific AKS Clusters: Separate clusters per environment with rationale, cluster sizing/SKU selection, networking configuration, subscription strategy
- Regional Deployment Strategy: Primary region (East US), secondary region (West Europe), multi-region HA/DR, regional failover mechanisms
- Kustomize Overlays: Base manifests, dev overlay (minimal resources, debug logging), test overlay, staging overlay (production-like), production overlay (optimized, strict policies)
- Helm Values Files: values-dev.yaml, values-test.yaml, values-staging.yaml, values-production.yaml, value precedence and overrides
- FluxCD Configuration: GitRepository per environment (branch targeting), Kustomization per environment, sync policies per environment, environment-specific reconciliation settings
- Environment-Specific Configurations: Log levels, telemetry sampling rates, feature flags, database connection strings, external service endpoints
- Resource Quotas: Namespace-level quotas, CPU/memory limits per environment, storage quotas, pod count limits
- HPA Configuration: Min/max replicas per environment, scaling thresholds, custom metrics with KEDA, scale-to-zero in dev
- Multi-Region Traffic Routing: Azure Front Door configuration, Traffic Manager for DNS-based routing, regional failover policies, health probe configuration
- Cross-Environment Dependencies: Shared services (monitoring, logging), service discovery across environments, VNet peering if needed
Azure Monitor Integration & Observability¶
Purpose: Define how Azure Monitor, Log Analytics, and Grafana are integrated with ATP GitOps workflows to provide comprehensive observability, monitoring, alerting, and compliance evidence collection, ensuring complete visibility into deployment health, FluxCD operations, and application performance across all environments.
Azure Monitor Container Insights¶
Enabling Container Insights on AKS¶
Enable Container Insights:
# Enable Container Insights on AKS cluster
az aks enable-addons \
--resource-group atp-production-rg \
--name atp-prod-eus-aks \
--addons monitoring \
--workspace-resource-id /subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.OperationalInsights/workspaces/atp-prod-loganalytics
Container Insights via Pulumi:
// Enable Container Insights
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
ResourceGroupName = "atp-production-rg",
Location = "eastus",
Sku = new OperationalInsights.Inputs.WorkspaceSkuArgs
{
Name = "PerGB2018",
},
RetentionInDays = environment == "production" ? 2555 : 30, // 7 years for production
Tags = new()
{
{ "Environment", environment },
{ "Retention", environment == "production" ? "7years" : "30days" },
},
});
// Enable Container Insights addon
az aks enable-addons --addons monitoring --workspace-resource-id {logAnalyticsWorkspace.Id}
Verify Container Insights:
# Check Container Insights status
az aks show \
--resource-group atp-production-rg \
--name atp-prod-eus-aks \
--query addonProfiles.omsagent
# Check OMS agent pods
kubectl get pods -n kube-system | grep omsagent
Metrics Collection and Aggregation¶
Container Insights Metrics:
| Metric Category | Examples | Collection Interval |
|---|---|---|
| Node Metrics | CPU, Memory, Disk I/O, Network | 60s |
| Pod Metrics | CPU, Memory, Restart count | 60s |
| Container Metrics | CPU, Memory per container | 60s |
| Controller Metrics | Replica count, Ready replicas | 60s |
| Workload Metrics | Deployment, StatefulSet status | 60s |
Key Metrics Collected:
// Node metrics
InsightsMetrics
| where Origin == "container.azm.ms"
| where Namespace == "insights-metrics"
| where Name == "cpuUsageNanoCores"
| summarize avg(Val) by Computer, bin(TimeGenerated, 1m)
// Pod metrics
InsightsMetrics
| where Name == "podCpuUsageNanoCores"
| summarize avg(Val) by Computer, bin(TimeGenerated, 1m)
// Container restart count
ContainerLog
| where ContainerRestartCount > 0
| project TimeGenerated, Computer, ContainerName, ContainerRestartCount
Log Analytics Workspace Configuration¶
Workspace Configuration:
// Log Analytics Workspace with long retention for production
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
ResourceGroupName = "atp-production-rg",
Location = "eastus",
Sku = new OperationalInsights.Inputs.WorkspaceSkuArgs
{
Name = "PerGB2018", // Pay-as-you-go
},
RetentionInDays = 2555, // 7 years for compliance
DailyQuotaGb = -1, // No daily quota
Tags = new()
{
{ "Environment", "production" },
{ "Retention", "7years" },
{ "Compliance", "SOC2" },
},
});
Workspace Strategy:
| Strategy | Workspace per Environment | Shared Workspace |
|---|---|---|
| Dev/Test | ⚠️ Shared workspace | ✅ Recommended (cost optimization) |
| Staging | ✅ Separate workspace | ⚠️ Optional |
| Production | ✅ Separate workspace | ❌ Not recommended |
ATP Workspace Strategy:
- Dev/Test: Shared atp-nonprod-loganalytics workspace
- Staging: Separate atp-staging-loganalytics workspace
- Production: Separate atp-prod-loganalytics workspace (7-year retention)
Cost Optimization (Sampling, Retention)¶
Cost Optimization Strategies:
| Strategy | Configuration | Impact |
|---|---|---|
| Log Sampling | 10% sampling in production | ✅ 90% cost reduction |
| Metric Aggregation | 5-minute aggregation | ✅ Reduced data volume |
| Retention Tiers | 7 years (prod), 30 days (dev) | ✅ Cost-optimized retention |
| Data Export | Archive to Blob Storage | ✅ Long-term storage cost reduction |
Production Log Sampling:
# Application Insights sampling (10% in production)
env:
- name: APPLICATIONINSIGHTS_SAMPLING_PERCENTAGE
value: "10"
# Or via ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: appinsights-config
data:
samplingPercentage: "10"
Log Retention Configuration:
// Production: 7-year retention
var prodWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
RetentionInDays = 2555, // 7 years
});
// Dev/Test: 30-day retention
var nonprodWorkspace = new OperationalInsights.Workspace("atp-nonprod-loganalytics", new()
{
RetentionInDays = 30,
});
Log Analytics Workspace¶
Workspace per Environment or Shared¶
Workspace Organization:
graph TB
subgraph "Production Subscription"
PROD_WS[atp-prod-loganalytics<br/>7-year retention]
STAGING_WS[atp-staging-loganalytics<br/>90-day retention]
end
subgraph "Non-Prod Subscription"
NONPROD_WS[atp-nonprod-loganalytics<br/>30-day retention]
end
PROD_EUS[Production AKS<br/>East US] --> PROD_WS
PROD_WEU[Production AKS<br/>West Europe] --> PROD_WS
STAGING[Staging AKS] --> STAGING_WS
DEV[Dev AKS] --> NONPROD_WS
TEST[Test AKS] --> NONPROD_WS
style PROD_WS fill:#90EE90
style STAGING_WS fill:#FFE5B4
style NONPROD_WS fill:#FFE5B4
Workspace Matrix:
| Environment | Workspace Name | Retention | Data Sources |
|---|---|---|---|
| Dev/Test | atp-nonprod-loganalytics |
30 days | Dev AKS, Test AKS |
| Staging | atp-staging-loganalytics |
90 days | Staging AKS |
| Production | atp-prod-loganalytics |
7 years (2555 days) | Production AKS (EUS, WEU) |
Log Retention Policies¶
Retention Policy Configuration:
// Log Analytics Workspace with retention
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
ResourceGroupName = "atp-production-rg",
Location = "eastus",
RetentionInDays = 2555, // 7 years for compliance
DailyQuotaGb = -1, // No daily quota
PublicNetworkAccessForIngestion = "Enabled",
PublicNetworkAccessForQuery = "Enabled",
});
// Data export to Blob Storage for long-term archival
var dataExport = new OperationalInsights.DataExport("atp-prod-export", new()
{
ResourceGroupName = "atp-production-rg",
WorkspaceName = logAnalyticsWorkspace.Name,
TableNames = new[]
{
"ContainerLog",
"ContainerInventory",
"InsightsMetrics",
"AzureDiagnostics",
},
Destination = new OperationalInsights.Inputs.DestinationArgs
{
ResourceId = storageAccount.Id,
Type = "StorageAccount",
},
Enabled = true,
});
Retention by Table:
| Table | Production Retention | Non-Production Retention |
|---|---|---|
| ContainerLog | 7 years | 30 days |
| InsightsMetrics | 7 years | 30 days |
| AzureDiagnostics | 7 years | 30 days |
| FluxCDLogs | 7 years | 30 days |
Kusto Query Language (KQL) Examples¶
Pod Restart Query:
// Pod restart count per namespace
ContainerLog
| where TimeGenerated > ago(24h)
| where ContainerRestartCount > 0
| summarize
RestartCount = count(),
UniquePods = dcount(ContainerName),
LastRestart = max(TimeGenerated)
by Namespace, Computer
| order by RestartCount desc
Deployment Status Query:
// Deployment status from Container Insights
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend PodCount = Val
| summarize
TotalPods = sum(PodCount),
AvgPods = avg(PodCount),
MaxPods = max(PodCount)
by Namespace, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
FluxCD Reconciliation Query:
// FluxCD reconciliation events
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "reconciliation"
| extend
Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
Status = extract(@"status=(\S+)", 1, LogEntry, typeof(string)),
Duration = extract(@"duration=(\d+\.\d+)", 1, LogEntry, typeof(real))
| summarize
TotalReconciliations = count(),
AvgDuration = avg(Duration),
MaxDuration = max(Duration),
SuccessCount = countif(Status == "success"),
FailureCount = countif(Status == "failure")
by Kustomization, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
Error Rate Query:
// Application error rate
ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "ERROR" or LogEntry contains "Exception"
| extend
Service = extract(@"app=(\S+)", 1, LogEntry, typeof(string)),
ErrorType = extract(@"(\w+Exception)", 1, LogEntry, typeof(string))
| summarize
ErrorCount = count(),
UniqueErrors = dcount(ErrorType)
by Service, Computer, bin(TimeGenerated, 5m)
| order by ErrorCount desc
Custom Log Tables¶
Custom Log Table: Deployment Events:
// Create custom log table for deployment events
.create table DeploymentEvents (TimeGenerated:datetime, DeploymentId:string, ServiceName:string, Environment:string, Status:string, GitCommit:string, DeployedBy:string, Duration:real)
// Ingest deployment events
.ingest inline into table DeploymentEvents <|
2024-01-15T10:00:00Z, "deployment-abc123", "atp-ingestion", "production", "success", "abc123def", "FluxCD", 45.2
2024-01-15T11:00:00Z, "deployment-def456", "atp-query", "production", "success", "def456ghi", "FluxCD", 52.8
// Query deployment events
DeploymentEvents
| where Environment == "production"
| where TimeGenerated > ago(7d)
| summarize
TotalDeployments = count(),
SuccessfulDeployments = countif(Status == "success"),
FailedDeployments = countif(Status == "failure"),
AvgDuration = avg(Duration)
by ServiceName, bin(TimeGenerated, 1d)
Custom Log via Azure Function:
// Azure Function to ingest deployment events
[FunctionName("IngestDeploymentEvent")]
public async Task IngestDeploymentEvent(
[EventGridTrigger] EventGridEvent eventGridEvent,
[LogAnalyticsOutput] IAsyncCollector<LogAnalyticsEvent> logAnalytics)
{
var deploymentEvent = JsonSerializer.Deserialize<DeploymentEvent>(eventGridEvent.Data.ToString());
await logAnalytics.AddAsync(new LogAnalyticsEvent
{
TimeGenerated = DateTime.UtcNow,
DeploymentId = deploymentEvent.DeploymentId,
ServiceName = deploymentEvent.ServiceName,
Environment = deploymentEvent.Environment,
Status = deploymentEvent.Status,
GitCommit = deploymentEvent.GitCommit,
DeployedBy = "FluxCD",
Duration = deploymentEvent.Duration,
});
}
FluxCD Metrics Export¶
Prometheus Metrics from FluxCD¶
FluxCD Metrics Endpoint:
# FluxCD automatically exposes Prometheus metrics
# Endpoint: http://kustomize-controller:8080/metrics
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fluxcd-kustomize-controller
namespace: flux-system
spec:
selector:
matchLabels:
app: kustomize-controller
endpoints:
- port: http-prom
interval: 30s
path: /metrics
scrapeTimeout: 10s
Key FluxCD Metrics:
| Metric | Description | Labels |
|---|---|---|
fluxcd_kustomize_reconciliation_total |
Total reconciliations | status, kustomization |
fluxcd_kustomize_reconciliation_duration_seconds |
Reconciliation duration | kustomization |
fluxcd_kustomize_reconciliation_errors_total |
Reconciliation errors | kustomization, error_type |
fluxcd_source_git_reconciliation_total |
Git fetch reconciliations | status, gitrepository |
fluxcd_source_git_reconciliation_duration_seconds |
Git fetch duration | gitrepository |
Metrics Scraping Configuration¶
Prometheus Scrape Configuration:
# Prometheus scrape config for FluxCD
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 30s
scrape_timeout: 10s
scrape_configs:
- job_name: 'fluxcd-kustomize-controller'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- flux-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: kustomize-controller
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "8080"
action: keep
- job_name: 'fluxcd-source-controller'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- flux-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: source-controller
action: keep
Prometheus ServiceMonitor for FluxCD:
# ServiceMonitor for FluxCD controllers
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fluxcd-controllers
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/part-of: flux
namespaceSelector:
matchNames:
- flux-system
endpoints:
- port: http-prom
interval: 30s
path: /metrics
scrapeTimeout: 10s
Key Metrics to Monitor¶
Critical FluxCD Metrics:
# Reconciliation success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))
# Reconciliation error rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))
# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)
# Git fetch duration (indicates network issues)
avg(fluxcd_source_git_reconciliation_duration_seconds)
Per-Kustomization Metrics:
# Reconciliation success rate per Kustomization
sum by (kustomization) (
rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m])
)
/
sum by (kustomization) (
rate(fluxcd_kustomize_reconciliation_total[5m])
)
# Reconciliation duration per Kustomization
avg by (kustomization) (
fluxcd_kustomize_reconciliation_duration_seconds
)
Alerting on FluxCD Issues¶
FluxCD Alert Rules:
# alerts/fluxcd-reconciliation-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: fluxcd-reconciliation-alerts
namespace: monitoring
spec:
groups:
- name: fluxcd
interval: 30s
rules:
- alert: FluxCDHighErrorRate
expr: |
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0.1
for: 5m
labels:
severity: warning
component: fluxcd
annotations:
summary: "FluxCD reconciliation error rate is high"
description: "{{ $value }} errors per second detected"
- alert: FluxCDReconciliationSlow
expr: |
avg(fluxcd_kustomize_reconciliation_duration_seconds) > 300
for: 10m
labels:
severity: warning
component: fluxcd
annotations:
summary: "FluxCD reconciliations are taking longer than expected"
description: "Average duration: {{ $value }}s"
- alert: FluxCDGitFetchFailed
expr: |
increase(fluxcd_source_git_reconciliation_total{status="failure"}[5m]) > 3
for: 5m
labels:
severity: critical
component: fluxcd
annotations:
summary: "FluxCD Git fetch failures detected"
description: "Git repository {{ $labels.gitrepository }} failed to fetch"
Deployment Metrics¶
Sync Status per Application¶
Application Sync Status Query:
// Sync status per application
let FluxCDEvents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied" or LogEntry contains "sync"
| extend
Application = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
Status = case(
LogEntry contains "successfully applied", "Success",
LogEntry contains "sync failed", "Failed",
LogEntry contains "drift detected", "Drift",
"Unknown"
),
GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string))
| project TimeGenerated, Application, Status, GitCommit;
FluxCDEvents
| summarize
LastSync = max(TimeGenerated),
SyncStatus = arg_max(TimeGenerated, Status),
GitCommit = arg_max(TimeGenerated, GitCommit)
by Application
| order by LastSync desc
Sync Status Dashboard Query:
// Real-time sync status per application
ContainerLog
| where ContainerName contains "kustomize-controller"
| where TimeGenerated > ago(1h)
| extend
Application = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
Status = case(
LogEntry contains "successfully applied", "Success",
LogEntry contains "sync failed", "Failed",
"InProgress"
)
| summarize
Count = count(),
LastSync = max(TimeGenerated)
by Application, Status
| order by LastSync desc
Reconciliation Duration¶
Reconciliation Duration Metrics:
# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)
# P50, P95, P99 reconciliation duration
histogram_quantile(0.50,
rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)
histogram_quantile(0.95,
rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)
histogram_quantile(0.99,
rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)
# Per-Kustomization duration
avg by (kustomization) (
fluxcd_kustomize_reconciliation_duration_seconds
)
KQL Query for Reconciliation Duration:
// Reconciliation duration from logs
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "reconciliation"
| extend
Duration = extract(@"duration=(\d+\.\d+)", 1, LogEntry, typeof(real)),
Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| where isnotnull(Duration)
| summarize
AvgDuration = avg(Duration),
P50Duration = percentile(Duration, 50),
P95Duration = percentile(Duration, 95),
P99Duration = percentile(Duration, 99),
MaxDuration = max(Duration)
by Kustomization, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
Reconciliation Failure Rate¶
Failure Rate Metrics:
# Reconciliation failure rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))
# Per-Kustomization failure rate
sum by (kustomization) (
rate(fluxcd_kustomize_reconciliation_errors_total[5m])
)
/
sum by (kustomization) (
rate(fluxcd_kustomize_reconciliation_total[5m])
)
KQL Failure Rate Query:
// Reconciliation failure rate
ContainerLog
| where ContainerName contains "kustomize-controller"
| where TimeGenerated > ago(24h)
| extend
Status = case(
LogEntry contains "successfully", "Success",
LogEntry contains "failed" or LogEntry contains "error", "Failure",
"Unknown"
),
Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| where Status != "Unknown"
| summarize
TotalReconciliations = count(),
Successful = countif(Status == "Success"),
Failed = countif(Status == "Failure"),
FailureRate = (countif(Status == "Failure") * 100.0) / count()
by Kustomization, bin(TimeGenerated, 1h)
| order by FailureRate desc
Drift Detection Events¶
Drift Detection Metrics:
# Drift detection rate
sum(rate(fluxcd_kustomize_drift_detected_total[5m]))
# Drift correction rate
sum(rate(fluxcd_kustomize_drift_corrected_total[5m]))
KQL Drift Detection Query:
// Drift detection events
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "drift detected"
| extend
Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
Resource = extract(@"resource=(\S+)", 1, LogEntry, typeof(string)),
DriftType = extract(@"drift type=(\S+)", 1, LogEntry, typeof(string))
| summarize
DriftCount = count(),
UniqueResources = dcount(Resource),
LastDrift = max(TimeGenerated)
by Kustomization, DriftType, bin(TimeGenerated, 1h)
| order by DriftCount desc
Deployment Frequency¶
Deployment Frequency Calculation:
// Deployment frequency (DORA metric)
let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| extend
Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string))
| project TimeGenerated, Service, GitCommit;
Deployments
| summarize
DeploymentCount = count(),
UniqueServices = dcount(Service)
by bin(TimeGenerated, 1d)
| extend
DeploymentFrequency = DeploymentCount // Deployments per day
| order by TimeGenerated desc
Prometheus Query for Deployment Frequency:
# Deployment frequency (successful reconciliations per day)
sum(increase(fluxcd_kustomize_reconciliation_total{status="success"}[1d]))
Application Health After Deployment¶
Readiness Probe Success Rate¶
Readiness Probe Metrics:
# Readiness probe success rate
sum(rate(kube_pod_status_ready{condition="true"}[5m]))
/
sum(rate(kube_pod_status_ready[5m]))
# Readiness probe failures
sum(rate(kube_pod_status_ready{condition="false"}[5m]))
KQL Readiness Probe Query:
// Readiness probe success rate
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend ReadyPods = case(
Namespace contains "Ready", 1,
0
)
| summarize
TotalPods = sum(Val),
ReadyPods = sum(ReadyPods)
by Namespace, bin(TimeGenerated, 5m)
| extend ReadinessRate = (ReadyPods * 100.0) / TotalPods
| order by TimeGenerated desc
Pod Restart Count¶
Pod Restart Metrics:
# Pod restart count
sum(increase(kube_pod_container_status_restarts_total[1h]))
# Pod restart rate per service
sum by (pod, namespace) (
increase(kube_pod_container_status_restarts_total[1h])
)
KQL Pod Restart Query:
// Pod restart count
ContainerLog
| where ContainerRestartCount > 0
| where TimeGenerated > ago(24h)
| extend
Service = extract(@"app=(\S+)", 1, LogEntry, typeof(string))
| summarize
RestartCount = max(ContainerRestartCount),
RestartEvents = count(),
LastRestart = max(TimeGenerated)
by Computer, ContainerName, Service, Namespace
| order by RestartCount desc
HTTP Error Rates¶
HTTP Error Rate Metrics:
# HTTP 5xx error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# HTTP 4xx error rate
sum(rate(http_requests_total{status=~"4.."}[5m]))
/
sum(rate(http_requests_total[5m]))
KQL HTTP Error Rate Query:
// HTTP error rates from Application Insights
AppRequests
| where TimeGenerated > ago(1h)
| extend
StatusCode = tostring(resultCode),
IsError = resultCode >= 400
| summarize
TotalRequests = count(),
ErrorRequests = countif(IsError),
ErrorRate = (countif(IsError) * 100.0) / count()
by appName, bin(TimeGenerated, 5m)
| order by ErrorRate desc
Response Latency¶
Response Latency Metrics:
# Average response latency
avg(http_request_duration_seconds)
# P95 response latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# P99 response latency
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
KQL Response Latency Query:
// Response latency from Application Insights
AppRequests
| where TimeGenerated > ago(1h)
| extend DurationMs = duration
| summarize
AvgLatency = avg(DurationMs),
P50Latency = percentile(DurationMs, 50),
P95Latency = percentile(DurationMs, 95),
P99Latency = percentile(DurationMs, 99),
MaxLatency = max(DurationMs)
by appName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
Integration with Application Metrics¶
Custom Application Metrics:
// C# application: Export custom metrics
public class MetricsExporter
{
private readonly IMetricsCollector _metrics;
public void RecordDeploymentEvent(string serviceName, string gitCommit)
{
_metrics.IncrementCounter("atp_deployment_total", new Dictionary<string, string>
{
{ "service", serviceName },
{ "git_commit", gitCommit },
{ "environment", "production" },
});
}
public void RecordDeploymentDuration(double durationSeconds)
{
_metrics.RecordHistogram("atp_deployment_duration_seconds", durationSeconds);
}
}
Prometheus Metrics Export:
// Prometheus metrics endpoint
app.UseMetricServer(); // Exposes /metrics endpoint
// Custom metrics
var deploymentCounter = Metrics.CreateCounter(
"atp_deployment_total",
"Total deployments",
new[] { "service", "environment", "status" });
Grafana Dashboards¶
FluxCD Operational Dashboard¶
FluxCD Dashboard JSON:
{
"dashboard": {
"title": "FluxCD Operational Dashboard",
"panels": [
{
"title": "Reconciliation Success Rate",
"targets": [{
"expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[5m])) / sum(rate(fluxcd_kustomize_reconciliation_total[5m]))",
"legendFormat": "Success Rate"
}],
"type": "stat",
"thresholds": {
"steps": [
{ "value": 0, "color": "red" },
{ "value": 0.95, "color": "yellow" },
{ "value": 0.99, "color": "green" }
]
}
},
{
"title": "Reconciliation Duration",
"targets": [{
"expr": "avg(fluxcd_kustomize_reconciliation_duration_seconds)",
"legendFormat": "Avg Duration"
}],
"type": "graph"
},
{
"title": "Reconciliation Errors",
"targets": [{
"expr": "sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))",
"legendFormat": "Errors/sec"
}],
"type": "graph"
},
{
"title": "Reconciliation Status by Kustomization",
"targets": [{
"expr": "sum by (kustomization) (rate(fluxcd_kustomize_reconciliation_total[5m]))",
"legendFormat": "{{kustomization}}"
}],
"type": "bargauge"
}
]
}
}
Deployment Status Dashboard¶
Deployment Status Dashboard:
{
"dashboard": {
"title": "Deployment Status Dashboard",
"panels": [
{
"title": "Deployment Frequency",
"targets": [{
"expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1d]))",
"legendFormat": "Deployments/Day"
}],
"type": "stat"
},
{
"title": "Deployment Success Rate",
"targets": [{
"expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1h])) / sum(rate(fluxcd_kustomize_reconciliation_total[1h]))",
"legendFormat": "Success Rate"
}],
"type": "gauge"
},
{
"title": "Deployment Status by Service",
"targets": [{
"expr": "sum by (kustomization) (fluxcd_kustomize_reconciliation_total)",
"legendFormat": "{{kustomization}}"
}],
"type": "table"
}
]
}
}
Application Health Dashboard¶
Application Health Dashboard:
{
"dashboard": {
"title": "Application Health Dashboard",
"panels": [
{
"title": "Pod Readiness",
"targets": [{
"expr": "sum(rate(kube_pod_status_ready{condition=\"true\"}[5m])) / sum(rate(kube_pod_status_ready[5m]))",
"legendFormat": "Readiness Rate"
}],
"type": "stat"
},
{
"title": "Pod Restart Count",
"targets": [{
"expr": "sum(increase(kube_pod_container_status_restarts_total[1h]))",
"legendFormat": "Restarts"
}],
"type": "graph"
},
{
"title": "HTTP Error Rate",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
"legendFormat": "5xx Error Rate"
}],
"type": "graph"
},
{
"title": "Response Latency (P95)",
"targets": [{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95 Latency"
}],
"type": "graph"
}
]
}
}
DORA Metrics Dashboard¶
DORA Metrics Dashboard:
{
"dashboard": {
"title": "DORA Metrics Dashboard",
"panels": [
{
"title": "Deployment Frequency",
"targets": [{
"expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1d]))",
"legendFormat": "Deployments/Day"
}],
"type": "stat"
},
{
"title": "Lead Time for Changes",
"targets": [{
"expr": "avg(deployment_lead_time_seconds)",
"legendFormat": "Avg Lead Time"
}],
"type": "stat"
},
{
"title": "Mean Time to Recovery (MTTR)",
"targets": [{
"expr": "avg(incident_recovery_time_seconds)",
"legendFormat": "MTTR"
}],
"type": "stat"
},
{
"title": "Change Failure Rate",
"targets": [{
"expr": "sum(rate(deployment_failures_total[1d])) / sum(rate(deployments_total[1d]))",
"legendFormat": "Failure Rate"
}],
"type": "gauge"
}
]
}
}
Azure Monitor Workbooks¶
Custom Workbooks for GitOps¶
GitOps Workbook Template:
{
"version": "Notebook/1.0",
"items": [
{
"type": 9,
"content": {
"version": "KqlParameterItem/1.0",
"parameters": [
{
"id": "timeRange",
"version": "KqlParameterItem/1.0",
"name": "TimeRange",
"type": 4,
"value": {
"durationMs": 86400000
}
},
{
"id": "environment",
"version": "KqlParameterItem/1.0",
"name": "Environment",
"type": 1,
"value": "production"
}
]
}
},
{
"type": 1,
"content": {
"version": "TextBlock/1.0",
"text": "## GitOps Deployment Status"
}
},
{
"type": 3,
"content": {
"version": "KqlItem/1.0",
"query": "ContainerLog\n| where ContainerName contains \"kustomize-controller\"\n| where TimeGenerated > ago({TimeRange})\n| where Namespace == \"{Environment}\"\n| summarize DeploymentCount = count() by bin(TimeGenerated, 1h)\n| render timechart",
"visualization": "timechart",
"size": 0,
"queryType": 0,
"resourceType": "microsoft.operationalinsights/workspaces"
}
}
]
}
Compliance Reporting Workbooks¶
Compliance Workbook:
{
"version": "Notebook/1.0",
"items": [
{
"type": 1,
"content": {
"version": "TextBlock/1.0",
"text": "## Compliance Audit Report"
}
},
{
"type": 3,
"content": {
"version": "KqlItem/1.0",
"query": "// Deployment audit trail\nContainerLog\n| where ContainerName contains \"kustomize-controller\"\n| where LogEntry contains \"applied\"\n| extend \n DeploymentId = extract(@\"deployment=(\\S+)\", 1, LogEntry, typeof(string)),\n GitCommit = extract(@\"revision=(\\S+)\", 1, LogEntry, typeof(string)),\n DeployedBy = \"FluxCD\"\n| project TimeGenerated, DeploymentId, GitCommit, DeployedBy, Namespace\n| order by TimeGenerated desc",
"visualization": "table",
"size": 0,
"queryType": 0
}
},
{
"type": 3,
"content": {
"version": "KqlItem/1.0",
"query": "// Policy compliance status\nAzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.AUTHORIZATION\"\n| where Category == \"PolicyState\"\n| where TimeGenerated > ago(7d)\n| extend ComplianceState = tostring(parse_json(properties_s).complianceState_s)\n| summarize \n Compliant = countif(ComplianceState == \"Compliant\"),\n NonCompliant = countif(ComplianceState == \"NonCompliant\")\n by bin(TimeGenerated, 1d)\n| render timechart",
"visualization": "timechart",
"size": 0
}
}
]
}
Cost Analysis Workbooks¶
Cost Analysis Workbook:
{
"version": "Notebook/1.0",
"items": [
{
"type": 1,
"content": {
"version": "TextBlock/1.0",
"text": "## GitOps Cost Analysis"
}
},
{
"type": 3,
"content": {
"version": "KqlItem/1.0",
"query": "// Resource usage by environment\nInsightsMetrics\n| where Origin == \"container.azm.ms\"\n| where Name == \"cpuUsageNanoCores\"\n| extend Environment = extract(@\"namespace=(\\S+)\", 1, Namespace, typeof(string))\n| summarize \n AvgCPU = avg(Val),\n MaxCPU = max(Val)\n by Environment, bin(TimeGenerated, 1d)\n| render timechart",
"visualization": "timechart",
"size": 0
}
}
]
}
Alerting¶
Sync Failure Alerts¶
Sync Failure Alert Rule:
# alerts/fluxcd-sync-failure.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: fluxcd-sync-failure
namespace: monitoring
spec:
groups:
- name: fluxcd-sync
rules:
- alert: FluxCDSyncFailure
expr: |
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0
for: 5m
labels:
severity: critical
component: fluxcd
annotations:
summary: "FluxCD sync failure detected"
description: "{{ $value }} sync failures in the last 5 minutes"
Azure Monitor Alert Rule:
{
"location": "global",
"properties": {
"displayName": "FluxCD Sync Failure",
"description": "Alert when FluxCD sync failures detected",
"severity": 1,
"enabled": true,
"evaluationFrequency": "PT5M",
"windowSize": "PT5M",
"criteria": {
"allOf": [{
"odata.type": "Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria",
"name": "SyncFailure",
"metricName": "fluxcd_kustomize_reconciliation_errors_total",
"operator": "GreaterThan",
"threshold": 0,
"timeAggregation": "Total"
}]
},
"actions": []
}
}
Drift Detection Alerts¶
Drift Detection Alert:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: fluxcd-drift-detection
namespace: monitoring
spec:
groups:
- name: fluxcd-drift
rules:
- alert: FluxCDDriftDetected
expr: |
sum(rate(fluxcd_kustomize_drift_detected_total[5m])) > 0
for: 5m
labels:
severity: warning
component: fluxcd
annotations:
summary: "FluxCD drift detected"
description: "Cluster state differs from Git state"
KQL-Based Drift Alert:
// Drift detection alert query
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "drift detected"
| where TimeGenerated > ago(5m)
| extend
Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
Resource = extract(@"resource=(\S+)", 1, LogEntry, typeof(string))
| summarize DriftCount = count() by Kustomization, Resource
| where DriftCount > 0
Deployment Failure Alerts¶
Deployment Failure Alert:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: deployment-failure
namespace: monitoring
spec:
groups:
- name: deployments
rules:
- alert: DeploymentFailure
expr: |
sum(rate(fluxcd_kustomize_reconciliation_errors_total[10m])) > 2
for: 10m
labels:
severity: critical
component: deployment
annotations:
summary: "Deployment failure detected"
description: "{{ $value }} deployment failures in the last 10 minutes"
Health Check Failure Alerts¶
Health Check Failure Alert:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: health-check-failure
namespace: monitoring
spec:
groups:
- name: health
rules:
- alert: HealthCheckFailure
expr: |
sum(rate(kube_pod_status_ready{condition="false"}[5m])) > 0
for: 5m
labels:
severity: warning
component: health
annotations:
summary: "Health check failure detected"
description: "{{ $value }} pods with failed health checks"
Alert Routing (Email, Teams, PagerDuty)¶
Alert Manager Configuration:
# alertmanager-config.yaml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'teams'
- match:
component: fluxcd
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'team@connectsoft.example'
send_resolved: true
- name: 'teams'
webhook_configs:
- url: 'https://outlook.office.com/webhook/YOUR/WEBHOOK/URL'
send_resolved: true
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#atp-alerts'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
send_resolved: true
Azure Monitor Action Groups:
{
"location": "global",
"properties": {
"groupShortName": "atp-alerts",
"enabled": true,
"emailReceivers": [{
"name": "team-email",
"emailAddress": "team@connectsoft.example",
"useCommonAlertSchema": true
}],
"smsReceivers": [{
"name": "oncall-sms",
"countryCode": "1",
"phoneNumber": "5551234567"
}],
"webhookReceivers": [{
"name": "teams-webhook",
"serviceUri": "https://outlook.office.com/webhook/YOUR/WEBHOOK/URL",
"useCommonAlertSchema": true
}]
}
}
Correlation¶
Linking Git Commits to Deployments¶
Correlation via Git Commit SHA:
// Link Git commits to deployments
let GitCommits = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| extend
GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
DeploymentTime = TimeGenerated
| project GitCommit, Service, DeploymentTime;
let ApplicationMetrics = AppRequests
| extend
GitCommit = extract(@"git_commit=(\S+)", 1, customDimensions, typeof(string)),
RequestTime = TimeGenerated
| project GitCommit, RequestTime, success, duration;
GitCommits
| join kind=inner ApplicationMetrics on GitCommit
| summarize
DeploymentCount = count(),
AvgLatency = avg(duration),
ErrorRate = (countif(success == false) * 100.0) / count()
by Service, GitCommit, bin(DeploymentTime, 1h)
Deployment Correlation Script:
#!/bin/bash
# scripts/correlate-deployment.sh
GIT_COMMIT="${1:-$(git rev-parse HEAD)}"
SERVICE_NAME="${2:-atp-ingestion}"
echo "🔗 Correlating deployment for commit: $GIT_COMMIT"
# Add annotation to deployment
kubectl annotate deployment "$SERVICE_NAME" -n atp-production \
gitops.git-commit="$GIT_COMMIT" \
gitops.deployed-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--overwrite
# Query correlation
az monitor log-analytics query \
--workspace "atp-prod-loganalytics" \
--analytics-query "
ContainerLog
| where ContainerName contains \"kustomize-controller\"
| where LogEntry contains \"$GIT_COMMIT\"
| project TimeGenerated, LogEntry
"
Linking Deployments to Application Metrics¶
Deployment-to-Metrics Correlation:
// Link deployments to application metrics
let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| extend
Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
DeploymentTime = TimeGenerated
| project Service, GitCommit, DeploymentTime;
let Metrics = AppRequests
| extend
Service = appName
| project Service, TimeGenerated, success, duration, resultCode;
Deployments
| join kind=inner Metrics on Service
| where Metrics.TimeGenerated >= DeploymentTime
| where Metrics.TimeGenerated <= DeploymentTime + 30m
| summarize
RequestCount = count(),
ErrorRate = (countif(success == false) * 100.0) / count(),
AvgLatency = avg(duration)
by Service, GitCommit, bin(DeploymentTime, 5m)
Correlation IDs Throughout Stack¶
Correlation ID Injection:
// C#: Inject correlation ID in HTTP requests
public class CorrelationIdMiddleware
{
private readonly RequestDelegate _next;
public async Task InvokeAsync(HttpContext context)
{
var correlationId = context.Request.Headers["X-Correlation-ID"].FirstOrDefault()
?? Guid.NewGuid().ToString();
context.Items["CorrelationId"] = correlationId;
context.Response.Headers["X-Correlation-ID"] = correlationId;
using (LogContext.PushProperty("CorrelationId", correlationId))
{
await _next(context);
}
}
}
Correlation ID in Kubernetes:
# Add correlation ID to pod annotations
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
metadata:
annotations:
gitops.git-commit: "abc123def456"
gitops.correlation-id: "deployment-abc123"
gitops.deployed-at: "2024-01-15T10:00:00Z"
Distributed Tracing with Azure Application Insights¶
Application Insights Integration:
// Configure Application Insights with distributed tracing
services.AddApplicationInsightsTelemetry(options =>
{
options.ConnectionString = "InstrumentationKey={key};IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/";
options.EnableDependencyTrackingTelemetryModule = true;
options.EnableRequestTrackingTelemetryModule = true;
options.EnableAdaptiveSampling = true;
options.AdaptiveSamplingInitialSamplingPercentage = 10; // 10% in production
});
// Custom telemetry with correlation
var telemetryClient = new TelemetryClient();
telemetryClient.Context.Operation.Id = correlationId;
telemetryClient.Context.Operation.Name = "Deployment";
telemetryClient.TrackEvent("DeploymentCompleted", new Dictionary<string, string>
{
{ "Service", "atp-ingestion" },
{ "GitCommit", gitCommit },
{ "Environment", "production" },
});
Trace Correlation Query:
// Distributed trace correlation
let Traces = AppTraces
| extend
CorrelationId = extract(@"correlation_id=(\S+)", 1, customDimensions, typeof(string)),
OperationId = operation_Id
| project CorrelationId, OperationId, TimeGenerated, message;
let Requests = AppRequests
| extend
CorrelationId = extract(@"correlation_id=(\S+)", 1, customDimensions, typeof(string)),
OperationId = operation_Id
| project CorrelationId, OperationId, TimeGenerated, name, duration;
Traces
| join kind=inner Requests on CorrelationId
| project CorrelationId, TraceTime = Traces.TimeGenerated, RequestTime = Requests.TimeGenerated, RequestDuration = duration
| order by CorrelationId, TraceTime
Compliance Evidence¶
Deployment Audit Trail in Log Analytics¶
Deployment Audit Trail Query:
// Complete deployment audit trail
let DeploymentEvents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied" or LogEntry contains "sync"
| extend
DeploymentId = extract(@"deployment=(\\S+)", 1, LogEntry, typeof(string)),
Service = extract(@"kustomization/(\\S+)", 1, LogEntry, typeof(string)),
GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
Status = case(
LogEntry contains "successfully", "Success",
LogEntry contains "failed", "Failed",
"InProgress"
),
DeployedBy = "FluxCD",
DeploymentTime = TimeGenerated
| project DeploymentTime, DeploymentId, Service, GitCommit, Status, DeployedBy, Namespace;
let Approvals = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DEVOPS"
| where Category == "PullRequest"
| extend
GitCommit = extract(@"commit=(\\S+)", 1, properties_s, typeof(string)),
Approver = tostring(parse_json(properties_s).approver),
ApprovalTime = TimeGenerated
| project GitCommit, Approver, ApprovalTime;
DeploymentEvents
| join kind=leftouter Approvals on GitCommit
| project
DeploymentTime,
DeploymentId,
Service,
GitCommit,
Status,
DeployedBy,
Approver,
ApprovalTime,
Namespace
| order by DeploymentTime desc
Retention for 7 Years (Compliance Requirement)¶
7-Year Retention Configuration:
// Log Analytics Workspace with 7-year retention
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
ResourceGroupName = "atp-production-rg",
Location = "eastus",
RetentionInDays = 2555, // 7 years (365 * 7)
Tags = new()
{
{ "Retention", "7years" },
{ "Compliance", "SOC2" },
},
});
// Export to Blob Storage for additional backup
var storageAccount = new Storage.Account("atp-prod-logs-backup", new()
{
ResourceGroupName = "atp-production-rg",
Location = "eastus",
Kind = "StorageV2",
SkuName = "Standard_LRS",
AccessTier = "Archive", // Cold storage for compliance
EnableHttpsTrafficOnly = true,
MinimumTlsVersion = "TLS1_2",
BlobProperties = new Storage.Inputs.BlobServicePropertiesArgs
{
DeleteRetentionPolicy = new Storage.Inputs.DeleteRetentionPolicyArgs
{
Enabled = true,
Days = 2555, // 7 years
},
VersioningEnabled = true,
},
});
Query Examples for Auditors¶
Auditor Query: Deployment History:
// Deployment history for auditors
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| where TimeGenerated > ago(365d)
| extend
Service = extract(@"kustomization/(\\S+)", 1, LogEntry, typeof(string)),
GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
DeploymentTime = TimeGenerated
| summarize
DeploymentCount = count(),
LastDeployment = max(DeploymentTime),
UniqueServices = dcount(Service)
by bin(TimeGenerated, 1d)
| order by TimeGenerated desc
Auditor Query: Change Approval:
// Change approval audit trail
let PullRequests = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DEVOPS"
| where Category == "PullRequest"
| extend
PRId = tostring(parse_json(properties_s).pullRequestId),
Approver = tostring(parse_json(properties_s).approver),
ApprovalTime = TimeGenerated,
Status = tostring(parse_json(properties_s).status)
| project PRId, Approver, ApprovalTime, Status;
let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| extend
GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
DeploymentTime = TimeGenerated
| project GitCommit, DeploymentTime;
PullRequests
| join kind=inner Deployments on $left.PRId == $right.GitCommit
| project ApprovalTime, Approver, DeploymentTime, Status
| order by ApprovalTime desc
Auditor Query: Policy Compliance:
// Policy compliance audit
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.AUTHORIZATION"
| where Category == "PolicyState"
| where TimeGenerated > ago(365d)
| extend
PolicyName = tostring(parse_json(properties_s).policyDefinitionName),
ComplianceState = tostring(parse_json(properties_s).complianceState_s),
ResourceId = tostring(parse_json(properties_s).resourceId)
| summarize
CompliantCount = countif(ComplianceState == "Compliant"),
NonCompliantCount = countif(ComplianceState == "NonCompliant"),
TotalChecks = count()
by PolicyName, bin(TimeGenerated, 1d)
| extend ComplianceRate = (CompliantCount * 100.0) / TotalChecks
| order by TimeGenerated desc
Export for eDiscovery¶
eDiscovery Export Script:
#!/bin/bash
# scripts/export-ediscovery.sh
START_DATE="${1:-$(date -u -d '7 years ago' +%Y-%m-%dT%H:%M:%SZ)}"
END_DATE="${2:-$(date -u +%Y-%m-%dT%H:%M:%SZ)}"
OUTPUT_PATH="${3:-./ediscovery-export}"
echo "📤 Exporting compliance logs for eDiscovery: $START_DATE to $END_DATE"
# Export deployment audit trail
az monitor log-analytics query \
--workspace "atp-prod-loganalytics" \
--analytics-query "
ContainerLog
| where ContainerName contains \"kustomize-controller\"
| where TimeGenerated between (datetime($START_DATE) .. datetime($END_DATE))
| where LogEntry contains \"applied\" or LogEntry contains \"sync\"
| project TimeGenerated, ContainerName, LogEntry, Namespace
" \
--output table > "$OUTPUT_PATH/deployment-audit-trail.csv"
# Export policy compliance
az monitor log-analytics query \
--workspace "atp-prod-loganalytics" \
--analytics-query "
AzureDiagnostics
| where ResourceProvider == \"MICROSOFT.AUTHORIZATION\"
| where Category == \"PolicyState\"
| where TimeGenerated between (datetime($START_DATE) .. datetime($END_DATE))
| project TimeGenerated, Category, properties_s
" \
--output table > "$OUTPUT_PATH/policy-compliance.csv"
# Export to Blob Storage for long-term storage
az storage blob upload-batch \
--destination "ediscovery-export" \
--source "$OUTPUT_PATH" \
--account-name "atpprodlogsbackup"
echo "✅ Export complete: $OUTPUT_PATH"
DORA Metrics¶
Deployment Frequency¶
Deployment Frequency Calculation:
// Deployment frequency (DORA metric)
// Definition: How often deployments are successfully released to production
let SuccessfulDeployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend
Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
DeploymentTime = TimeGenerated
| project DeploymentTime, Service;
SuccessfulDeployments
| summarize
DeploymentCount = count(),
UniqueServices = dcount(Service)
by bin(TimeGenerated, 1d)
| extend
DeploymentFrequency = DeploymentCount, // Deployments per day
DORA_Level = case(
DeploymentFrequency >= 1, "Elite", // Multiple per day
DeploymentFrequency >= 0.142, "High", // Once per week
DeploymentFrequency >= 0.033, "Medium", // Once per month
"Low" // Less than once per month
)
| order by TimeGenerated desc
Deployment Frequency Prometheus Query:
# Deployment frequency (deployments per day)
sum(increase(fluxcd_kustomize_reconciliation_total{status="success", namespace="atp-production"}[1d]))
Lead Time for Changes¶
Lead Time Calculation:
// Lead time for changes (DORA metric)
// Definition: Time from code commit to production deployment
let Commits = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "Git commit"
| extend
GitCommit = extract(@"commit=(\S+)", 1, LogEntry, typeof(string)),
CommitTime = TimeGenerated
| project GitCommit, CommitTime;
let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend
GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
DeploymentTime = TimeGenerated
| project GitCommit, DeploymentTime;
Commits
| join kind=inner Deployments on GitCommit
| extend LeadTimeHours = datetime_diff('hour', DeploymentTime, CommitTime)
| summarize
AvgLeadTime = avg(LeadTimeHours),
P50LeadTime = percentile(LeadTimeHours, 50),
P95LeadTime = percentile(LeadTimeHours, 95),
DORA_Level = case(
avg(LeadTimeHours) < 24, "Elite", // Less than 1 day
avg(LeadTimeHours) < 168, "High", // Less than 1 week
avg(LeadTimeHours) < 720, "Medium", // Less than 1 month
"Low" // More than 1 month
)
by bin(TimeGenerated, 1d)
| order by TimeGenerated desc
Mean Time to Recovery (MTTR)¶
MTTR Calculation:
// Mean Time to Recovery (MTTR) - DORA metric
// Definition: Average time to recover from a failure in production
let Incidents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "failed" or LogEntry contains "error"
| where Namespace == "atp-production"
| extend
IncidentStart = TimeGenerated,
Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project IncidentStart, Service;
let Recoveries = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend
RecoveryTime = TimeGenerated,
Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project RecoveryTime, Service;
Incidents
| join kind=inner Recoveries on Service
| where RecoveryTime >= IncidentStart
| extend RecoveryDurationMinutes = datetime_diff('minute', RecoveryTime, IncidentStart)
| summarize
MTTR = avg(RecoveryDurationMinutes),
P50MTTR = percentile(RecoveryDurationMinutes, 50),
P95MTTR = percentile(RecoveryDurationMinutes, 95),
IncidentCount = count(),
DORA_Level = case(
avg(RecoveryDurationMinutes) < 60, "Elite", // Less than 1 hour
avg(RecoveryDurationMinutes) < 1440, "High", // Less than 1 day
avg(RecoveryDurationMinutes) < 10080, "Medium", // Less than 1 week
"Low" // More than 1 week
)
by bin(TimeGenerated, 1d)
| order by TimeGenerated desc
Change Failure Rate¶
Change Failure Rate Calculation:
// Change failure rate (DORA metric)
// Definition: Percentage of deployments that result in a failure in production
let AllDeployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| where Namespace == "atp-production"
| extend
DeploymentTime = TimeGenerated,
Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
Status = case(
LogEntry contains "successfully", "Success",
LogEntry contains "failed", "Failed",
"Unknown"
)
| where Status != "Unknown"
| project DeploymentTime, Service, Status;
let Failures = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "failed" or LogEntry contains "error"
| where Namespace == "atp-production"
| extend
FailureTime = TimeGenerated,
Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project FailureTime, Service;
AllDeployments
| join kind=leftouter Failures on Service
| where FailureTime >= DeploymentTime
| where FailureTime <= DeploymentTime + 1h // Failure within 1 hour of deployment
| extend
DeploymentFailed = case(
isnotnull(FailureTime), 1,
0
)
| summarize
TotalDeployments = count(),
FailedDeployments = sum(DeploymentFailed),
ChangeFailureRate = (sum(DeploymentFailed) * 100.0) / count(),
DORA_Level = case(
(sum(DeploymentFailed) * 100.0) / count() < 5, "Elite", // Less than 5%
(sum(DeploymentFailed) * 100.0) / count() < 15, "High", // Less than 15%
(sum(DeploymentFailed) * 100.0) / count() < 45, "Medium", // Less than 45%
"Low" // More than 45%
)
by bin(TimeGenerated, 1d)
| order by TimeGenerated desc
Dashboard and Reporting¶
DORA Metrics Dashboard:
{
"dashboard": {
"title": "DORA Metrics Dashboard",
"panels": [
{
"title": "Deployment Frequency",
"targets": [{
"expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\", namespace=\"atp-production\"}[1d]))",
"legendFormat": "Deployments/Day"
}],
"type": "stat",
"thresholds": {
"steps": [
{ "value": 0, "color": "red" },
{ "value": 1, "color": "yellow" },
{ "value": 7, "color": "green" }
]
}
},
{
"title": "Lead Time for Changes",
"targets": [{
"expr": "avg(deployment_lead_time_hours)",
"legendFormat": "Avg Lead Time (hours)"
}],
"type": "stat"
},
{
"title": "Mean Time to Recovery (MTTR)",
"targets": [{
"expr": "avg(incident_recovery_time_minutes)",
"legendFormat": "MTTR (minutes)"
}],
"type": "stat"
},
{
"title": "Change Failure Rate",
"targets": [{
"expr": "sum(rate(deployment_failures_total[1d])) / sum(rate(deployments_total[1d]))",
"legendFormat": "Failure Rate %"
}],
"type": "gauge",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 0.05, "color": "yellow" },
{ "value": 0.15, "color": "red" }
]
}
}
]
}
}
DORA Metrics Report:
// Comprehensive DORA metrics report
let DeploymentFrequency = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| summarize DeploymentCount = count() by bin(TimeGenerated, 1d);
let LeadTime = // ... (from previous query)
let MTTR = // ... (from previous query)
let ChangeFailureRate = // ... (from previous query)
union DeploymentFrequency, LeadTime, MTTR, ChangeFailureRate
| project TimeGenerated, Metric, Value, DORA_Level
| order by TimeGenerated desc
Summary: Azure Monitor Integration & Observability¶
- Azure Monitor Container Insights: Enabling Container Insights on AKS, metrics collection and aggregation, Log Analytics workspace configuration, cost optimization (sampling, retention)
- Log Analytics Workspace: Workspace per environment or shared strategy, log retention policies (7 years for production), KQL query examples, custom log tables
- FluxCD Metrics Export: Prometheus metrics from FluxCD, metrics scraping configuration, key metrics to monitor, alerting on FluxCD issues
- Deployment Metrics: Sync status per application, reconciliation duration, reconciliation failure rate, drift detection events, deployment frequency
- Application Health: Readiness probe success rate, pod restart count, HTTP error rates, response latency, integration with application metrics
- Grafana Dashboards: FluxCD operational dashboard, deployment status dashboard, application health dashboard, DORA metrics dashboard
- Azure Monitor Workbooks: Custom workbooks for GitOps, compliance reporting workbooks, cost analysis workbooks
- Alerting: Sync failure alerts, drift detection alerts, deployment failure alerts, health check failure alerts, alert routing (email, Teams, PagerDuty)
- Correlation: Linking Git commits to deployments, linking deployments to application metrics, correlation IDs throughout stack, distributed tracing with Application Insights
- Compliance Evidence: Deployment audit trail in Log Analytics, 7-year retention, query examples for auditors, export for eDiscovery
- DORA Metrics: Deployment frequency, lead time for changes, mean time to recovery (MTTR), change failure rate, dashboard and reporting
Rolling Updates & Deployment Strategies¶
Purpose: Define deployment strategies for ATP services including rolling updates, blue-green deployments, canary releases, and progressive delivery with Flagger, ensuring zero-downtime deployments, automated rollback capabilities, and risk mitigation through gradual traffic shifting and validation gates.
Kubernetes Rolling Updates¶
Default Rolling Update Strategy¶
Rolling Update Overview:
graph LR
subgraph "Rolling Update Process"
V1[V1 Pods<br/>3 replicas] --> V2[V1: 2 pods<br/>V2: 1 pod]
V2 --> V3[V1: 1 pod<br/>V2: 2 pods]
V3 --> V4[V2 Pods<br/>3 replicas]
end
style V1 fill:#90EE90
style V4 fill:#90EE90
style V2 fill:#FFE5B4
style V3 fill:#FFE5B4
Default Rolling Update Configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-production
spec:
replicas: 5
strategy:
type: RollingUpdate # Default strategy
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # No downtime allowed
selector:
matchLabels:
app: atp-ingestion
template:
metadata:
labels:
app: atp-ingestion
version: v1.2.3
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
Rolling Update Strategy Types:
| Strategy Type | Description | Use Case |
|---|---|---|
| RollingUpdate | Gradually replaces old pods with new ones | ✅ Default for ATP (zero-downtime) |
| Recreate | Terminates all old pods before creating new ones | ❌ Not recommended (downtime) |
maxSurge and maxUnavailable Settings¶
maxSurge and maxUnavailable Configuration:
| Configuration | maxSurge | maxUnavailable | Effect |
|---|---|---|---|
| Zero Downtime | 1 | 0 | ✅ ATP Production - Always maintain service availability |
| Fast Rollout | 2 | 1 | ⚠️ Test/Dev - Faster updates, slight capacity reduction |
| Conservative | 1 | 1 | ⚠️ Staging - Balanced approach |
Production Configuration:
# Production: Zero-downtime rolling update
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod (total: 6 pods during update)
maxUnavailable: 0 # Always maintain 5 ready pods
Rolling Update Math:
- Total Pods: 5 replicas
- maxSurge: 1 (can have 6 pods total during update)
- maxUnavailable: 0 (must always have 5 ready pods)
- Update Process: Replace 1 pod at a time, wait for readiness, then replace next
Dev/Test Configuration (Faster Rollout):
# Dev/Test: Faster rollout with slight capacity reduction
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod
maxUnavailable: 1 # Can temporarily reduce to 1 pod
Rolling Update Process¶
Rolling Update Steps:
- Create New Pod: Kubernetes creates a new pod with new image
- Wait for Readiness: New pod must pass readiness probe
- Add to Service: New pod receives traffic from Service
- Terminate Old Pod: Old pod receives SIGTERM, drains connections
- Repeat: Process repeats for remaining pods
Rolling Update Visualization:
sequenceDiagram
participant K8s as Kubernetes
participant Old as Old Pods (v1.2.2)
participant New as New Pods (v1.2.3)
participant Svc as Service
Note over K8s: Start Rolling Update
K8s->>New: Create Pod 1 (v1.2.3)
New->>New: Wait for Readiness Probe
New->>Svc: Register with Service
Svc->>New: Route traffic to Pod 1
K8s->>Old: Terminate Pod 1 (SIGTERM)
Old->>Svc: Drain connections
Old->>Old: Graceful shutdown
K8s->>New: Create Pod 2 (v1.2.3)
New->>New: Wait for Readiness Probe
New->>Svc: Register with Service
Svc->>New: Route traffic to Pod 2
K8s->>Old: Terminate Pod 2 (SIGTERM)
Note over K8s: Repeat until all pods updated
Monitor Rolling Update Progress:
# Watch rolling update progress
kubectl rollout status deployment/atp-ingestion -n atp-production
# Get rollout history
kubectl rollout history deployment/atp-ingestion -n atp-production
# Describe rollout
kubectl describe deployment atp-ingestion -n atp-production
Monitoring Rollout Progress¶
Rollout Status Command:
# Monitor rollout in real-time
kubectl rollout status deployment/atp-ingestion -n atp-production --timeout=10m
# Output example:
# Waiting for deployment "atp-ingestion" rollout to finish: 2 of 5 updated replicas are available...
# Waiting for deployment "atp-ingestion" rollout to finish: 3 of 5 updated replicas are available...
# Waiting for deployment "atp-ingestion" rollout to finish: 4 of 5 updated replicas are available...
# deployment "atp-ingestion" successfully rolled out
Prometheus Metrics for Rollout:
# Rolling update progress
kube_deployment_status_replicas_available{deployment="atp-ingestion"}
/
kube_deployment_status_replicas{deployment="atp-ingestion"}
# Old vs new pods during rollout
kube_pod_info{pod=~"atp-ingestion-.*"}
| label_replace(label_replace(
kube_pod_info{pod=~"atp-ingestion-.*"},
"version", "$1", "pod", "(.*-v\\d+\\.\\d+\\.\\d+).*"
), "status", "$1", "pod", ".*-(running|pending|terminating).*")
KQL Query for Rollout Status:
// Rolling update status from Container Insights
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend
Deployment = extract(@"deployment=(\S+)", 1, Tags, typeof(string)),
PodVersion = extract(@"version=(v\d+\.\d+\.\d+)", 1, Tags, typeof(string))
| summarize
PodCount = sum(Val)
by Deployment, PodVersion, bin(TimeGenerated, 1m)
| order by TimeGenerated desc
Blue-Green Deployments¶
Blue-Green Concept and Benefits¶
Blue-Green Deployment Architecture:
graph TB
subgraph "Traffic Router"
ING[Ingress Controller]
end
subgraph "Blue Environment (Current)"
BLUE_NS[Namespace: atp-blue]
BLUE_SVC[Service: atp-ingestion-blue]
BLUE_PODS[Pods: v1.2.2<br/>5 replicas]
end
subgraph "Green Environment (New)"
GREEN_NS[Namespace: atp-green]
GREEN_SVC[Service: atp-ingestion-green]
GREEN_PODS[Pods: v1.2.3<br/>5 replicas]
end
ING -->|Current| BLUE_SVC
ING -.->|Switch| GREEN_SVC
BLUE_SVC --> BLUE_PODS
GREEN_SVC --> GREEN_PODS
style BLUE_PODS fill:#4A90E2
style GREEN_PODS fill:#90EE90
Blue-Green Benefits:
| Benefit | Description | ATP Use Case |
|---|---|---|
| Instant Rollback | Switch traffic back to blue instantly | ✅ Critical production updates |
| Zero Downtime | Green environment fully ready before switch | ✅ High availability requirement |
| Testing | Validate green environment before traffic | ✅ Production-like validation |
| Risk Reduction | Keep blue environment running during switch | ✅ Critical services |
Blue-Green vs Rolling Update:
| Aspect | Blue-Green | Rolling Update | ATP Decision |
|---|---|---|---|
| Downtime | ✅ Zero | ✅ Zero | ✅ Both viable |
| Rollback Speed | ✅ Instant (traffic switch) | ⚠️ Slow (re-rollout) | ✅ Blue-Green for critical |
| Resource Usage | ❌ 2x during switch | ✅ Efficient | ⚠️ Acceptable for critical services |
| Complexity | ❌ Higher | ✅ Lower | ⚠️ Blue-Green for staging/production |
Implementation with Namespace Switching¶
Blue Namespace Configuration:
# apps/atp-ingestion/overlays/production-blue/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-production-blue
labels:
environment: production
deployment-color: blue
---
# Blue Service
apiVersion: v1
kind: Service
metadata:
name: atp-ingestion-blue
namespace: atp-production-blue
spec:
selector:
app: atp-ingestion
version: v1.2.2
ports:
- port: 80
targetPort: 8080
---
# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion-blue
namespace: atp-production-blue
spec:
replicas: 5
selector:
matchLabels:
app: atp-ingestion
version: v1.2.2
template:
metadata:
labels:
app: atp-ingestion
version: v1.2.2
deployment-color: blue
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.2-def456g
readinessProbe:
httpGet:
path: /health/ready
port: 8080
Green Namespace Configuration:
# apps/atp-ingestion/overlays/production-green/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-production-green
labels:
environment: production
deployment-color: green
---
# Green Service
apiVersion: v1
kind: Service
metadata:
name: atp-ingestion-green
namespace: atp-production-green
spec:
selector:
app: atp-ingestion
version: v1.2.3
ports:
- port: 80
targetPort: 8080
---
# Green Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion-green
namespace: atp-production-green
spec:
replicas: 5
selector:
matchLabels:
app: atp-ingestion
version: v1.2.3
template:
metadata:
labels:
app: atp-ingestion
version: v1.2.3
deployment-color: green
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
readinessProbe:
httpGet:
path: /health/ready
port: 8080
Traffic Routing with Ingress¶
Ingress with Blue-Green Routing:
# Ingress routing to blue (current)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-ingestion-ingress
namespace: atp-production
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/upstream-vhost: atp-ingestion-blue.atp-production-blue.svc.cluster.local
spec:
ingressClassName: nginx
rules:
- host: atp-ingestion.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion-blue
port:
number: 80
# Cross-namespace service reference
# Requires ExternalName Service in production namespace
Cross-Namespace Service Reference:
# ExternalName Service in production namespace pointing to blue
apiVersion: v1
kind: Service
metadata:
name: atp-ingestion-blue
namespace: atp-production
spec:
type: ExternalName
externalName: atp-ingestion-blue.atp-production-blue.svc.cluster.local
---
# Switch to green (update Ingress)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-ingestion-ingress
namespace: atp-production
spec:
rules:
- host: atp-ingestion.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion-green # Switched to green
port:
number: 80
Blue-Green Switch Script:
#!/bin/bash
# scripts/blue-green-switch.sh
ENVIRONMENT="${1:-production}"
CURRENT_COLOR="${2:-blue}"
NEW_COLOR="${3:-green}"
echo "🔄 Switching from $CURRENT_COLOR to $NEW_COLOR environment"
# Update Ingress to route to new color
kubectl patch ingress atp-ingestion-ingress -n atp-$ENVIRONMENT --type=json \
-p="[{\"op\": \"replace\", \"path\": \"/spec/rules/0/http/paths/0/backend/service/name\", \"value\": \"atp-ingestion-$NEW_COLOR\"}]"
# Wait for green pods to be ready
kubectl wait --for=condition=available --timeout=5m \
deployment/atp-ingestion-$NEW_COLOR -n atp-$ENVIRONMENT-$NEW_COLOR
# Verify traffic is routing to green
kubectl get ingress atp-ingestion-ingress -n atp-$ENVIRONMENT -o jsonpath='{.spec.rules[0].http.paths[0].backend.service.name}'
echo "✅ Traffic switched to $NEW_COLOR environment"
Rollback to Blue Environment¶
Instant Rollback to Blue:
#!/bin/bash
# scripts/blue-green-rollback.sh
ENVIRONMENT="${1:-production}"
CURRENT_COLOR="${2:-green}"
ROLLBACK_COLOR="${3:-blue}"
echo "⏪ Rolling back to $ROLLBACK_COLOR environment"
# Switch traffic back to blue
kubectl patch ingress atp-ingestion-ingress -n atp-$ENVIRONMENT --type=json \
-p="[{\"op\": \"replace\", \"path\": \"/spec/rules/0/http/paths/0/backend/service/name\", \"value\": \"atp-ingestion-$ROLLBACK_COLOR\"}]"
echo "✅ Rollback complete - traffic routed to $ROLLBACK_COLOR"
# Optionally scale down green environment to save resources
# kubectl scale deployment atp-ingestion-$CURRENT_COLOR -n atp-$ENVIRONMENT-$CURRENT_COLOR --replicas=0
When to Use Blue-Green¶
Blue-Green Deployment Decision Matrix:
| Service Type | Blue-Green Recommended? | Rationale |
|---|---|---|
| Critical Services (Gateway, Authentication) | ✅ Yes | Instant rollback capability |
| Database Migrations | ✅ Yes | Test new version before traffic |
| High-Traffic Services | ✅ Yes | Reduce risk of performance issues |
| Low-Risk Updates | ⚠️ Optional | Rolling update may be sufficient |
| Resource-Constrained | ❌ No | 2x resource usage during switch |
ATP Blue-Green Strategy: - Production Critical Services: Blue-Green for gateway, authentication, ingestion - Production Standard Services: Rolling update sufficient - Staging: Blue-Green for validation before production promotion
Canary Releases¶
Canary Deployment Concept¶
Canary Release Architecture:
graph TB
subgraph "Traffic Router"
ING[Ingress Controller]
SVC[Service]
end
subgraph "Stable Version"
STABLE[Stable Pods<br/>v1.2.2<br/>90% traffic]
end
subgraph "Canary Version"
CANARY[Canary Pods<br/>v1.2.3<br/>10% traffic]
end
ING -->|90%| STABLE
ING -->|10%| CANARY
SVC --> STABLE
SVC --> CANARY
style STABLE fill:#4A90E2
style CANARY fill:#FFD700
Canary Release Benefits:
| Benefit | Description |
|---|---|
| Risk Reduction | Test new version with limited traffic |
| Gradual Rollout | Increase traffic percentage gradually (10% → 50% → 100%) |
| Automated Validation | Monitor metrics and auto-rollback on issues |
| User Impact Minimization | Only small percentage of users affected if issues occur |
Traffic Splitting Strategies¶
Traffic Splitting with Service Mesh (Istio):
# Istio VirtualService for canary traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: atp-ingestion-canary
namespace: atp-production
spec:
hosts:
- atp-ingestion.connectsoft.example
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: atp-ingestion
subset: canary
weight: 100
- route:
- destination:
host: atp-ingestion
subset: stable
weight: 90 # 90% to stable
- destination:
host: atp-ingestion
subset: canary
weight: 10 # 10% to canary
---
# DestinationRule for stable and canary subsets
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: atp-ingestion
namespace: atp-production
spec:
host: atp-ingestion
subsets:
- name: stable
labels:
version: v1.2.2
- name: canary
labels:
version: v1.2.3
Traffic Splitting with Nginx Ingress:
# Nginx Ingress with canary annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-ingestion-canary
namespace: atp-production
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # 10% traffic to canary
nginx.ingress.kubernetes.io/canary-by-header: "canary"
nginx.ingress.kubernetes.io/canary-by-header-value: "true"
spec:
ingressClassName: nginx
rules:
- host: atp-ingestion.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion-canary
port:
number: 80
---
# Main Ingress (90% traffic)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-ingestion-main
namespace: atp-production
spec:
ingressClassName: nginx
rules:
- host: atp-ingestion.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion-stable
port:
number: 80
Service Mesh Requirement (Linkerd/Istio)¶
Service Mesh Comparison:
| Feature | Istio | Linkerd | ATP Selection |
|---|---|---|---|
| Traffic Splitting | ✅ Advanced | ✅ Simple | ✅ Istio (more features) |
| Observability | ✅ Comprehensive | ✅ Good | ✅ Istio |
| Resource Usage | ❌ High | ✅ Low | ⚠️ Acceptable for production |
| Learning Curve | ❌ Steep | ✅ Easy | ⚠️ Investment required |
ATP Decision: Istio (for advanced canary features)
Gradual Traffic Shift (10% → 50% → 100%)¶
Progressive Traffic Shift:
# Stage 1: 10% canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: atp-ingestion-canary
spec:
http:
- route:
- destination:
host: atp-ingestion
subset: stable
weight: 90
- destination:
host: atp-ingestion
subset: canary
weight: 10 # 10% canary
---
# Stage 2: 50% canary (after validation)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: atp-ingestion-canary
spec:
http:
- route:
- destination:
host: atp-ingestion
subset: stable
weight: 50
- destination:
host: atp-ingestion
subset: canary
weight: 50 # 50% canary
---
# Stage 3: 100% canary (promote to stable)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: atp-ingestion-canary
spec:
http:
- route:
- destination:
host: atp-ingestion
subset: canary
weight: 100 # 100% canary (new stable)
Automated Traffic Shift Script:
#!/bin/bash
# scripts/canary-traffic-shift.sh
CANARY_WEIGHT="${1:-10}"
NAMESPACE="${2:-atp-production}"
echo "🎯 Shifting $CANARY_WEIGHT% traffic to canary"
# Update VirtualService
kubectl patch virtualservice atp-ingestion-canary -n $NAMESPACE --type=json \
-p="[{\"op\": \"replace\", \"path\": \"/spec/http/0/route/0/weight\", \"value\": $((100 - CANARY_WEIGHT))}, {\"op\": \"replace\", \"path\": \"/spec/http/0/route/1/weight\", \"value\": $CANARY_WEIGHT}]"
echo "✅ Traffic shifted: $CANARY_WEIGHT% canary, $((100 - CANARY_WEIGHT))% stable"
# Monitor for 5 minutes before next stage
sleep 300
Automated Canary Analysis¶
Canary Analysis Metrics:
# Flagger Canary with automated analysis (see Flagger section)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
namespace: atp-production
spec:
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
- name: error-rate
thresholdRange:
max: 1
interval: 1m
Progressive Delivery with Flagger¶
Flagger Overview and Architecture¶
Flagger Architecture:
graph TB
subgraph "GitOps Repository"
GIT[Git Commit<br/>New Version]
end
subgraph "Flagger Controller"
FLAGGER[Flagger<br/>Canary Controller]
METRICS[Metrics Provider<br/>Prometheus]
end
subgraph "Traffic Router"
ISTIO[Istio VirtualService]
end
subgraph "Deployment"
STABLE[Stable Deployment]
CANARY[Canary Deployment]
end
GIT -->|Triggers| FLAGGER
FLAGGER -->|Creates| CANARY
FLAGGER -->|Monitors| METRICS
METRICS -->|Validates| FLAGGER
FLAGGER -->|Updates| ISTIO
ISTIO -->|Routes| STABLE
ISTIO -->|Routes| CANARY
style FLAGGER fill:#FFD700
style CANARY fill:#90EE90
Flagger Benefits:
| Benefit | Description |
|---|---|
| Automated Rollout | Automatic canary promotion based on metrics |
| Automated Rollback | Automatic rollback on threshold violations |
| Traffic Shifting | Gradual traffic increase (10% → 50% → 100%) |
| Metric Validation | Validate latency, error rate, custom metrics |
Installation and Configuration¶
Flagger Installation via Helm:
# Add Flagger Helm repository
helm repo add flagger https://flagger.app
helm repo update
# Install Flagger with Istio support
helm upgrade --install flagger flagger/flagger \
--namespace flagger-system \
--create-namespace \
--set meshProvider=istio \
--set metricsServer=http://prometheus.monitoring:9090
Flagger Installation via Pulumi:
// Install Flagger via Helm chart
var flaggerRelease = new Pulumi.Kubernetes.Helm.V3.Release("flagger", new()
{
Chart = "flagger",
RepositoryOpts = new Pulumi.Kubernetes.Helm.V3.Inputs.RepositoryOptsArgs
{
Repo = "https://flagger.app",
},
Namespace = "flagger-system",
CreateNamespace = true,
Values = new Dictionary<string, object>
{
{ "meshProvider", "istio" },
{ "metricsServer", "http://prometheus.monitoring:9090" },
},
});
Canary Resource Definition¶
Flagger Canary Configuration:
# apps/atp-ingestion/overlays/production/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
namespace: atp-production
spec:
# Target deployment to manage
targetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
# Service to create/update
service:
port: 8080
targetPort: 8080
portDiscovery: true
# Traffic management
provider: istio
trafficRouting:
istio:
virtualService:
hosts:
- atp-ingestion.connectsoft.example
gateways:
- public-gateway
destinationRule:
host: atp-ingestion
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
# Canary analysis configuration
analysis:
interval: 1m # Check metrics every 1 minute
threshold: 5 # Number of successful validations before promotion
maxWeight: 50 # Maximum canary traffic (50%)
stepWeight: 10 # Traffic increase per step (10%)
stepWeightPromotion: 50 # Traffic increase on promotion (50%)
stepWeights: [10, 20, 30, 40, 50] # Custom traffic steps
# Metrics to validate
metrics:
# Request success rate
- name: request-success-rate
thresholdRange:
min: 99 # Minimum 99% success rate
interval: 1m
queryTemplate: |
sum(rate(istio_requests_total{
destination_workload_namespace="{{ namespace }}",
destination_workload=~"{{ target }}",
response_code!~"5.."
}[1m]))
/
sum(rate(istio_requests_total{
destination_workload_namespace="{{ namespace }}",
destination_workload=~"{{ target }}"
}[1m]))
* 100
# Request duration (p95 latency)
- name: request-duration
thresholdRange:
max: 500 # Maximum 500ms p95 latency
interval: 1m
queryTemplate: |
histogram_quantile(0.95,
sum(rate(istio_request_duration_milliseconds_bucket{
destination_workload_namespace="{{ namespace }}",
destination_workload=~"{{ target }}"
}[1m])) by (le)
)
# Error rate
- name: error-rate
thresholdRange:
max: 1 # Maximum 1% error rate
interval: 1m
queryTemplate: |
sum(rate(istio_requests_total{
destination_workload_namespace="{{ namespace }}",
destination_workload=~"{{ target }}",
response_code=~"5.."
}[1m]))
/
sum(rate(istio_requests_total{
destination_workload_namespace="{{ namespace }}",
destination_workload=~"{{ target }}"
}[1m]))
* 100
# Custom business metric
- name: business-metric-threshold
thresholdRange:
min: 95 # Minimum business metric value
interval: 1m
queryTemplate: |
avg(rate(atp_business_metric_total{
service="{{ target }}",
namespace="{{ namespace }}"
}[1m]))
# Webhooks for pre/post deployment validation
webhooks:
# Pre-rollout validation (smoke tests)
- name: smoke-tests
type: pre-rollout
url: http://smoke-tests.atp-production:8080/validate
timeout: 30s
metadata:
type: "bash"
cmd: "kubectl exec -n atp-production deployment/smoke-tests -- /bin/sh -c 'curl -f http://atp-ingestion-canary:8080/health || exit 1'"
# Post-rollout validation
- name: integration-tests
type: rollout
url: http://integration-tests.atp-production:8080/validate
timeout: 2m
metadata:
type: "bash"
cmd: "kubectl exec -n atp-production deployment/integration-tests -- /bin/sh -c 'curl -f http://atp-ingestion-canary:8080/api/health || exit 1'"
# Load testing
- name: load-test
type: rollout
url: http://load-test.atp-production:8080/start
timeout: 5m
metadata:
cmd: "kubectl exec -n atp-production deployment/load-test -- /bin/sh -c 'artillery run test.yaml'"
Automated Rollback on Metric Thresholds¶
Flagger Rollback Configuration:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
# Rollback triggers
alerts:
# Alert on high error rate
- name: "error-rate-high"
severity: error
providerRef:
name: prometheus-alerts
namespace: monitoring
# Rollback on metric threshold violation
metrics:
- name: error-rate
thresholdRange:
max: 1 # Rollback if error rate > 1%
interval: 30s
# Rollback immediately on violation
alertProviders:
- name: prometheus
severity: error
# Skip traffic increase if metrics fail
skipAnalysis: false # Don't skip validation
# Automatic rollback on failure
revertOnDeletion: true
Flagger Rollback Status:
# Check canary status
kubectl get canary atp-ingestion -n atp-production
# Watch canary rollout
kubectl describe canary atp-ingestion -n atp-production
# Check Flagger logs
kubectl logs -n flagger-system deployment/flagger -f
Integration with Service Mesh¶
Flagger with Istio Integration:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
provider: istio
trafficRouting:
istio:
virtualService:
hosts:
- atp-ingestion.connectsoft.example
gateways:
- istio-system/public-gateway
destinationRule:
host: atp-ingestion
subsets:
- name: stable
trafficPolicy:
loadBalancer:
consistentHash:
httpHeaderName: X-User-ID # Session affinity
- name: canary
trafficPolicy:
loadBalancer:
consistentHash:
httpHeaderName: X-User-ID
Feature Flags Integration¶
LaunchDarkly or Azure App Configuration¶
Feature Flags Strategy:
| Feature Flag Provider | Pros | Cons | ATP Selection |
|---|---|---|---|
| LaunchDarkly | ✅ Advanced targeting, A/B testing | ❌ Cost, external dependency | ⚠️ Consider for advanced use cases |
| Azure App Configuration | ✅ Native Azure, integrated | ⚠️ Less features than LaunchDarkly | ✅ ATP Default (cost-effective) |
| Custom Solution | ✅ Full control | ❌ Maintenance overhead | ❌ Not recommended |
ATP Decision: Azure App Configuration (native Azure integration)
Feature Flag-Based Rollout¶
Azure App Configuration Setup:
// Configure Azure App Configuration
services.AddAzureAppConfiguration(options =>
{
options.Connect(connectionString)
.Select(KeyFilter.Any, LabelFilter.Null)
.Select(KeyFilter.Any, "Production")
.ConfigureRefresh(refresh =>
{
refresh.Register("FeatureFlags:CanaryEnabled", refreshAll: true)
.SetCacheExpiration(TimeSpan.FromSeconds(30));
});
});
Feature Flag Integration in Application:
// C#: Feature flag for canary deployment
public class FeatureFlagService
{
private readonly IConfiguration _configuration;
public bool IsCanaryEnabled()
{
return _configuration.GetValue<bool>("FeatureFlags:CanaryEnabled", defaultValue: false);
}
public int GetCanaryTrafficPercentage()
{
return _configuration.GetValue<int>("FeatureFlags:CanaryTrafficPercentage", defaultValue: 0);
}
}
// Use feature flag to control behavior
[ApiController]
[Route("[controller]")]
public class IngestionController : ControllerBase
{
private readonly FeatureFlagService _featureFlags;
[HttpPost("events")]
public async Task<IActionResult> IngestEvent([FromBody] Event evt)
{
// New feature enabled via feature flag
if (_featureFlags.IsCanaryEnabled())
{
// Use new processing logic
await ProcessEventV2(evt);
}
else
{
// Use stable processing logic
await ProcessEventV1(evt);
}
return Ok();
}
}
Feature Flag Configuration:
# Azure App Configuration via ConfigMap (reference)
apiVersion: v1
kind: ConfigMap
metadata:
name: feature-flags
namespace: atp-production
data:
FeatureFlags__CanaryEnabled: "false"
FeatureFlags__CanaryTrafficPercentage: "0"
---
# External Secret for Azure App Configuration connection
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-config-connection
namespace: atp-production
spec:
secretStoreRef:
name: azure-keyvault
kind: ClusterSecretStore
data:
- secretKey: AppConfigConnectionString
remoteRef:
key: connection-strings/app-config-connection-string
Gradual Feature Enablement¶
Gradual Feature Rollout:
// Gradually enable feature for percentage of users
public class FeatureFlagService
{
public bool ShouldEnableFeature(string userId)
{
var percentage = GetFeatureRolloutPercentage();
var userHash = GetUserHash(userId);
return (userHash % 100) < percentage;
}
private int GetFeatureRolloutPercentage()
{
// Gradually increase: 10% → 25% → 50% → 100%
return _configuration.GetValue<int>("FeatureFlags:RolloutPercentage", defaultValue: 0);
}
}
Kill Switch for Problem Features¶
Kill Switch Implementation:
// Kill switch for emergency feature disable
public class FeatureFlagService
{
public bool IsFeatureKilled(string featureName)
{
return _configuration.GetValue<bool>($"FeatureFlags:KillSwitch:{featureName}", defaultValue: false);
}
}
// Use kill switch
if (_featureFlags.IsFeatureKilled("NewProcessingLogic"))
{
// Immediately fall back to stable logic
await ProcessEventV1(evt);
}
else
{
await ProcessEventV2(evt);
}
Pre-Deployment Validation¶
Smoke Tests Before Traffic Routing¶
Smoke Test Webhook:
# Flagger pre-rollout webhook
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
webhooks:
- name: smoke-tests
type: pre-rollout
url: http://smoke-tests.atp-production:8080/validate
timeout: 30s
metadata:
type: "bash"
cmd: |
kubectl exec -n atp-production deployment/smoke-tests -- /bin/sh -c '
# Health check
curl -f http://atp-ingestion-canary:8080/health/live || exit 1
curl -f http://atp-ingestion-canary:8080/health/ready || exit 1
# Basic API test
curl -f -X POST http://atp-ingestion-canary:8080/api/events \
-H "Content-Type: application/json" \
-d "{\"eventType\":\"test\"}" || exit 1
'
Smoke Test Job:
# Pre-deployment smoke test job
apiVersion: batch/v1
kind: Job
metadata:
name: smoke-tests-pre-deploy
namespace: atp-production
spec:
template:
spec:
containers:
- name: smoke-tests
image: connectsoft.azurecr.io/atp/smoke-tests:latest
env:
- name: TARGET_URL
value: "http://atp-ingestion-canary:8080"
command:
- /bin/sh
- -c
- |
echo "Running smoke tests..."
# Health checks
curl -f $TARGET_URL/health/live || exit 1
curl -f $TARGET_URL/health/ready || exit 1
# API validation
response=$(curl -s -X POST $TARGET_URL/api/events \
-H "Content-Type: application/json" \
-d '{"eventType":"test"}')
if [ $? -ne 0 ]; then
echo "Smoke tests failed"
exit 1
fi
echo "Smoke tests passed"
restartPolicy: Never
Integration Tests in Canary¶
Integration Test Webhook:
# Flagger rollout webhook for integration tests
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
webhooks:
- name: integration-tests
type: rollout
url: http://integration-tests.atp-production:8080/validate
timeout: 5m
metadata:
type: "bash"
cmd: |
kubectl exec -n atp-production deployment/integration-tests -- /bin/sh -c '
# Run integration test suite
dotnet test IntegrationTests.csproj \
--filter "Category=CanaryValidation" \
--logger "trx;LogFileName=results.trx" \
--results-directory /tmp/results
# Check test results
if [ $? -ne 0 ]; then
echo "Integration tests failed"
exit 1
fi
'
Database Migration Checks¶
Database Migration Validation:
# Pre-deployment database migration check
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration-check
namespace: atp-production
spec:
template:
spec:
containers:
- name: migration-check
image: connectsoft.azurecr.io/atp/migration-tool:latest
env:
- name: CONNECTION_STRING
valueFrom:
secretKeyRef:
name: sql-connection-string
key: connection-string
command:
- /bin/sh
- -c
- |
echo "Checking database migrations..."
# Check if pending migrations exist
dotnet ef migrations list --connection "$CONNECTION_STRING"
# Validate migration scripts (dry-run)
dotnet ef database update --connection "$CONNECTION_STRING" --dry-run
if [ $? -ne 0 ]; then
echo "Database migration validation failed"
exit 1
fi
echo "Database migrations validated"
restartPolicy: Never
Dependency Availability Checks¶
Dependency Check Script:
#!/bin/bash
# scripts/pre-deployment-checks.sh
echo "🔍 Running pre-deployment validation checks..."
# Check Redis availability
redis-cli -h redis.atp-production ping || {
echo "❌ Redis not available"
exit 1
}
# Check SQL Database connectivity
sqlcmd -S sql-server.database.windows.net -U $DB_USER -P $DB_PASSWORD -Q "SELECT 1" || {
echo "❌ SQL Database not accessible"
exit 1
}
# Check Service Bus
az servicebus queue show --namespace-name atp-servicebus --resource-group atp-production --name audit-events || {
echo "❌ Service Bus not accessible"
exit 1
}
# Check Key Vault
az keyvault secret show --vault-name atp-keyvault --name test-secret || {
echo "❌ Key Vault not accessible"
exit 1
}
echo "✅ All dependency checks passed"
Post-Deployment Validation¶
Health Check Monitoring¶
Post-Deployment Health Checks:
# Flagger post-rollout validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
webhooks:
- name: post-deployment-health
type: post-rollout
url: http://health-monitor.atp-production:8080/validate
timeout: 2m
metadata:
type: "bash"
cmd: |
kubectl exec -n atp-production deployment/health-monitor -- /bin/sh -c '
# Monitor health for 2 minutes
for i in {1..24}; do
health=$(curl -s -o /dev/null -w "%{http_code}" http://atp-ingestion-canary:8080/health/live)
if [ "$health" != "200" ]; then
echo "Health check failed: $health"
exit 1
fi
sleep 5
done
echo "Health checks passed"
'
Error Rate Thresholds¶
Error Rate Validation:
# Flagger metric for error rate validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
metrics:
- name: error-rate
thresholdRange:
max: 1 # Maximum 1% error rate
interval: 1m
queryTemplate: |
sum(rate(http_requests_total{
service="{{ target }}",
status=~"5.."
}[1m]))
/
sum(rate(http_requests_total{
service="{{ target }}"
}[1m]))
* 100
Latency Thresholds¶
Latency Validation:
# Flagger metric for latency validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
metrics:
- name: request-duration-p95
thresholdRange:
max: 500 # Maximum 500ms p95 latency
interval: 1m
queryTemplate: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{
service="{{ target }}"
}[1m])) by (le)
) * 1000
Business Metric Validation¶
Custom Business Metric:
# Flagger metric for business metric validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
metrics:
- name: event-processing-success-rate
thresholdRange:
min: 99.5 # Minimum 99.5% success rate
interval: 1m
queryTemplate: |
sum(rate(atp_events_processed_total{
service="{{ target }}",
status="success"
}[1m]))
/
sum(rate(atp_events_processed_total{
service="{{ target }}"
}[1m]))
* 100
Automatic Rollback Triggers¶
Error Rate Exceeds Threshold¶
Error Rate Rollback:
# Flagger automatic rollback on error rate
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
metrics:
- name: error-rate
thresholdRange:
max: 1 # Rollback if error rate > 1%
interval: 30s
# Rollback immediately on threshold violation
alertProviders:
- name: prometheus
severity: error
Prometheus Alert for Error Rate:
# PrometheusRule for error rate alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: canary-error-rate
namespace: monitoring
spec:
groups:
- name: canary
rules:
- alert: CanaryHighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m]))
> 0.01 # 1% error rate
for: 30s
labels:
severity: critical
annotations:
summary: "Canary error rate exceeds threshold - rollback triggered"
Latency Degrades Beyond SLO¶
Latency Rollback:
# Flagger automatic rollback on latency degradation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
metrics:
- name: request-duration-p95
thresholdRange:
max: 500 # Rollback if p95 latency > 500ms
interval: 30s
alertProviders:
- name: prometheus
severity: error
Health Checks Fail Consistently¶
Health Check Rollback:
# Flagger automatic rollback on health check failure
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
webhooks:
- name: health-check
type: rollout
url: http://health-monitor:8080/check
timeout: 10s
metadata:
cmd: |
health=$(curl -s -o /dev/null -w "%{http_code}" http://atp-ingestion-canary:8080/health/live)
if [ "$health" != "200" ]; then
echo "Health check failed"
exit 1 # Triggers rollback
fi
Custom Metric-Based Rollback¶
Custom Metric Rollback:
# Flagger automatic rollback on custom metric
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
metrics:
- name: business-metric-threshold
thresholdRange:
min: 95 # Rollback if business metric < 95
interval: 1m
queryTemplate: |
avg(rate(atp_business_metric_total{
service="{{ target }}"
}[1m]))
alertProviders:
- name: prometheus
severity: error
Flagger Rollback Status:
# Check if rollback occurred
kubectl get canary atp-ingestion -n atp-production -o jsonpath='{.status.conditions[?(@.type=="Promoted")].status}'
# View rollback reason
kubectl describe canary atp-ingestion -n atp-production | grep -A 10 "Status"
Deployment Windows¶
Scheduled Maintenance Windows¶
Maintenance Window Configuration:
# Flagger with maintenance window
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
analysis:
# Schedule deployments during maintenance window
schedule: "0 2 * * *" # 2 AM daily (UTC)
# Or use cron expression for specific windows
Azure Pipeline Deployment Window:
# Azure Pipeline with deployment window
trigger: none
schedules:
- cron: "0 2 * * *" # 2 AM UTC daily
branches:
include:
- production
always: true
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: Deploy
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/production'))
jobs:
- job: Deploy
steps:
- script: |
echo "Deploying during maintenance window (2 AM UTC)"
# Deployment steps
Change Freeze Periods¶
Change Freeze Configuration:
# Flagger with change freeze
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion
spec:
# Suspend canary during change freeze
# Annotation to prevent deployments
annotations:
flagger.app/change-freeze: "true"
Change Freeze Script:
#!/bin/bash
# scripts/change-freeze.sh
ACTION="${1:-enable}" # enable or disable
NAMESPACE="${2:-atp-production}"
if [ "$ACTION" == "enable" ]; then
echo "🔒 Enabling change freeze"
kubectl annotate canary atp-ingestion -n $NAMESPACE \
flagger.app/change-freeze="true" \
--overwrite
# Suspend FluxCD reconciliations
flux suspend kustomization apps-production -n flux-system
echo "✅ Change freeze enabled"
elif [ "$ACTION" == "disable" ]; then
echo "🔓 Disabling change freeze"
kubectl annotate canary atp-ingestion -n $NAMESPACE \
flagger.app/change-freeze- \
--overwrite
# Resume FluxCD reconciliations
flux resume kustomization apps-production -n flux-system
echo "✅ Change freeze disabled"
fi
Emergency Deployment Procedures¶
Emergency Deployment Bypass:
#!/bin/bash
# scripts/emergency-deploy.sh
SERVICE="${1:-atp-ingestion}"
VERSION="${2:-v1.2.3-abc123d}"
NAMESPACE="${3:-atp-production}"
echo "🚨 Emergency deployment: $SERVICE@$VERSION"
# Bypass change freeze
kubectl annotate canary $SERVICE -n $NAMESPACE \
flagger.app/emergency-deploy="true" \
flagger.app/change-freeze- \
--overwrite
# Force immediate rollout (skip canary)
kubectl set image deployment/$SERVICE \
atp-ingestion=connectsoft.azurecr.io/atp/$SERVICE:$VERSION \
-n $NAMESPACE
# Force rollout (bypass readiness gates)
kubectl rollout restart deployment/$SERVICE -n $NAMESPACE
echo "✅ Emergency deployment initiated"
Zero-Downtime Deployments¶
Connection Draining¶
Connection Draining Configuration:
# Service with connection draining
apiVersion: v1
kind: Service
metadata:
name: atp-ingestion
namespace: atp-production
annotations:
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
ports:
- port: 80
targetPort: 8080
Graceful Shutdown (SIGTERM Handling)¶
Graceful Shutdown Implementation:
// C#: Graceful shutdown handler
public class Program
{
public static async Task Main(string[] args)
{
var host = CreateHostBuilder(args).Build();
// Register graceful shutdown
var lifetime = host.Services.GetRequiredService<IHostApplicationLifetime>();
lifetime.ApplicationStopping.Register(() =>
{
Console.WriteLine("SIGTERM received, starting graceful shutdown...");
// Stop accepting new requests
// Wait for in-flight requests to complete
// Close connections
// Cleanup resources
Console.WriteLine("Graceful shutdown complete");
});
await host.RunAsync();
}
}
Graceful Shutdown in Kubernetes:
# Deployment with terminationGracePeriodSeconds
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
terminationGracePeriodSeconds: 60 # 60 seconds for graceful shutdown
containers:
- name: atp-ingestion
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Drain connections
sleep 10
# Stop accepting new requests
curl -X POST http://localhost:8080/admin/shutdown
Pod Disruption Budget¶
Pod Disruption Budget Configuration:
# Pod Disruption Budget for zero-downtime
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: atp-ingestion-pdb
namespace: atp-production
spec:
minAvailable: 3 # Always maintain at least 3 pods available
selector:
matchLabels:
app: atp-ingestion
---
# Alternative: maxUnavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: atp-ingestion-pdb
spec:
maxUnavailable: 1 # Allow maximum 1 pod unavailable
selector:
matchLabels:
app: atp-ingestion
Pod Disruption Budget Calculation:
- Total Pods: 5 replicas
- minAvailable: 3 pods
- During Rolling Update: Can terminate maximum 2 pods at a time
- Ensures: Always 3+ pods serving traffic (zero downtime)
PreStop Hooks¶
PreStop Hook for Graceful Shutdown:
# Deployment with preStop hook
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: atp-ingestion
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
echo "PreStop hook: Starting graceful shutdown..."
# Remove from load balancer
# Wait for connections to drain
# Send shutdown signal to application
curl -X POST http://localhost:8080/admin/drain || true
# Wait for in-flight requests
sleep 15
echo "PreStop hook: Graceful shutdown complete"
readinessProbe:
httpGet:
path: /health/ready
port: 8080
# Remove from service endpoints when readiness fails
periodSeconds: 5
Zero-Downtime Deployment Checklist:
## Zero-Downtime Deployment Checklist
### Pre-Deployment
- [ ] Readiness probe configured and tested
- [ ] Liveness probe configured
- [ ] Pod Disruption Budget configured (minAvailable or maxUnavailable)
- [ ] Graceful shutdown implemented (SIGTERM handler)
- [ ] PreStop hook configured
- [ ] Connection draining enabled
- [ ] terminationGracePeriodSeconds set appropriately (30-60s)
### During Deployment
- [ ] Rolling update strategy with maxUnavailable: 0
- [ ] maxSurge configured (1 or 2 extra pods)
- [ ] Monitor rollout progress (`kubectl rollout status`)
- [ ] Verify pods pass readiness probes
- [ ] Check traffic routing to new pods
### Post-Deployment
- [ ] Health checks passing
- [ ] Error rates within threshold
- [ ] Latency within SLO
- [ ] Business metrics validated
- [ ] Old pods terminated gracefully
Summary: Rolling Updates & Deployment Strategies¶
- Kubernetes Rolling Updates: Default rolling update strategy, maxSurge and maxUnavailable settings, rolling update process, monitoring rollout progress
- Blue-Green Deployments: Blue-green concept and benefits, implementation with namespace switching, traffic routing with Ingress, instant rollback, when to use blue-green
- Canary Releases: Canary deployment concept, traffic splitting strategies (Istio/Nginx), service mesh requirement, gradual traffic shift (10% → 50% → 100%), automated canary analysis
- Progressive Delivery with Flagger: Flagger overview and architecture, installation and configuration, canary resource definition, automated rollback on metric thresholds, integration with service mesh
- Feature Flags Integration: LaunchDarkly or Azure App Configuration, feature flag-based rollout, gradual feature enablement, kill switch for problem features
- Pre-Deployment Validation: Smoke tests before traffic routing, integration tests in canary, database migration checks, dependency availability checks
- Post-Deployment Validation: Health check monitoring, error rate thresholds, latency thresholds, business metric validation
- Automatic Rollback Triggers: Error rate exceeds threshold, latency degrades beyond SLO, health checks fail consistently, custom metric-based rollback
- Deployment Windows: Scheduled maintenance windows, change freeze periods, emergency deployment procedures
- Zero-Downtime Deployments: Connection draining, graceful shutdown (SIGTERM handling), Pod disruption budgets, PreStop hooks
Preview Environments (Ephemeral)¶
Purpose: Define how ephemeral preview environments are automatically provisioned for pull requests, used for isolated testing and validation, and automatically cleaned up after PR merge or closure, ensuring developers can test changes in a production-like environment without manual infrastructure setup while optimizing resource costs.
Preview Environment Architecture¶
Ephemeral Namespaces in Dev Cluster¶
Preview Environment Architecture:
graph TB
subgraph "Dev AKS Cluster"
subgraph "PR #123 Preview"
NS1[Namespace: atp-preview-pr123]
SVC1[Service: atp-ingestion]
PODS1[Pods: v1.2.3<br/>1 replica]
ING1[Ingress: pr123.preview.atp.connectsoft.example]
end
subgraph "PR #124 Preview"
NS2[Namespace: atp-preview-pr124]
SVC2[Service: atp-ingestion]
PODS2[Pods: v1.2.4<br/>1 replica]
ING2[Ingress: pr124.preview.atp.connectsoft.example]
end
subgraph "Shared Resources"
MON[Shared Monitoring]
DB[Shared Test DB]
end
end
NS1 --> MON
NS2 --> MON
NS1 --> DB
NS2 --> DB
ING1 --> SVC1
ING2 --> SVC2
style NS1 fill:#90EE90
style NS2 fill:#90EE90
style MON fill:#FFE5B4
style DB fill:#FFE5B4
Preview Namespace Structure:
atp-preview-pr123/
├── deployments/
│ ├── atp-ingestion/
│ ├── atp-query/
│ └── atp-gateway/
├── services/
├── ingress/
├── configmaps/
└── secrets/ (references from External Secrets)
Namespace Naming Convention:
- Format:
atp-preview-pr{PR_NUMBER} - Examples:
atp-preview-pr123atp-preview-pr456atp-preview-pr789
Resource Isolation per PR¶
Resource Isolation Strategy:
| Resource | Isolation Level | Sharing | Rationale |
|---|---|---|---|
| Namespace | ✅ Complete isolation | ❌ Per PR | Complete resource isolation |
| Deployments | ✅ Isolated | ❌ Per PR | Independent testing |
| Services | ✅ Isolated | ❌ Per PR | Independent service endpoints |
| Ingress | ✅ Isolated hostname | ❌ Per PR | Unique preview URL |
| ConfigMaps | ✅ Isolated | ❌ Per PR | PR-specific configuration |
| Secrets | ⚠️ Referenced | ✅ Shared Key Vault | Cost optimization |
| Database | ⚠️ Shared/Mocked | ✅ Shared test DB | Cost optimization |
| Redis | ⚠️ Shared | ✅ Shared test Redis | Cost optimization |
| Monitoring | ✅ Namespace labels | ✅ Shared Prometheus | Cost optimization |
Resource Isolation Configuration:
# Preview namespace with labels for isolation
apiVersion: v1
kind: Namespace
metadata:
name: atp-preview-pr123
labels:
environment: preview
preview: "true"
pr-number: "123"
created-by: "azure-pipelines"
created-at: "2024-01-15T10:00:00Z"
auto-cleanup: "true"
ttl: "24h" # Time-to-live for auto-cleanup
Cost Optimization Strategies¶
Cost Optimization Matrix:
| Strategy | Implementation | Cost Savings |
|---|---|---|
| Single Replica | 1 replica vs 3 in dev | ✅ ~67% reduction |
| Minimal Resources | 100m CPU, 256Mi memory | ✅ ~80% reduction |
| Shared Node Pool | Use dev cluster node pool | ✅ No additional nodes |
| Auto-Shutdown | Scale to zero after 4h inactivity | ✅ ~60% reduction |
| Spot Instances | Use spot node pool | ✅ ~90% cost reduction |
| Shared Dependencies | Shared test DB/Redis | ✅ Significant savings |
Cost Comparison:
| Environment | Replicas | CPU/Memory per Pod | Monthly Cost (Est.) |
|---|---|---|---|
| Dev | 3 | 500m / 1Gi | $150 |
| Preview (Standard) | 1 | 500m / 1Gi | $50 |
| Preview (Optimized) | 1 | 100m / 256Mi | $10 |
Lifecycle Management¶
Preview Environment Lifecycle:
stateDiagram-v2
[*] --> PR_Created: Developer opens PR
PR_Created --> Provisioning: Azure Pipeline triggered
Provisioning --> Active: Preview ready
Active --> Testing: Integration tests
Testing --> Active: Tests pass
Active --> Idle: 4h inactivity
Idle --> Active: New activity
Active --> Cleaning: PR merged/closed
Idle --> Cleaning: TTL expired
Cleaning --> [*]: Resources deleted
Active --> Failed: Tests fail
Failed --> Cleaning: Manual cleanup
Lifecycle States:
| State | Description | Actions |
|---|---|---|
| Provisioning | Namespace and resources being created | Create namespace, deploy manifests |
| Active | Preview environment running, receiving traffic | Monitor health, run tests |
| Testing | Integration tests executing | Execute test suite |
| Idle | No activity for 4+ hours | Scale to zero, monitor for activity |
| Cleaning | Resources being deleted | Delete namespace and all resources |
| Failed | Provisioning or testing failed | Retry or manual cleanup |
Automatic Provisioning on PR Creation¶
Azure Pipeline Triggered by PR¶
PR Trigger Configuration:
# azure-pipelines-preview.yml
trigger: none # No CI trigger
pr:
branches:
include:
- dev
- test
- staging
paths:
include:
- apps/**/*
- infrastructure/**/*
exclude:
- docs/**/*
pool:
vmImage: 'ubuntu-latest'
variables:
- group: atp-preview-env
- name: PR_NUMBER
value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
- name: PR_BRANCH
value: ${{ variables['System.PullRequest.SourceBranch'] }}
- name: PREVIEW_NAMESPACE
value: atp-preview-pr$(PR_NUMBER)
- name: PREVIEW_HOSTNAME
value: pr$(PR_NUMBER).preview.atp.connectsoft.example
stages:
- stage: ProvisionPreview
displayName: 'Provision Preview Environment'
condition: and(succeeded(), ne(variables['Build.Reason'], 'Manual'))
jobs:
- job: Provision
displayName: 'Create Preview Environment'
steps:
- task: AzureCLI@2
displayName: 'Get PR details'
inputs:
azureSubscription: 'ATP-NonProd-ServiceConnection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
echo "PR Number: $(PR_NUMBER)"
echo "PR Branch: $(PR_BRANCH)"
echo "Preview Namespace: $(PREVIEW_NAMESPACE)"
echo "Preview Hostname: $(PREVIEW_HOSTNAME)"
- task: Bash@3
displayName: 'Generate Preview Manifests'
inputs:
targetType: 'inline'
script: |
# Generate preview manifests
./scripts/generate-preview-manifests.sh \
--pr-number $(PR_NUMBER) \
--branch $(PR_BRANCH) \
--namespace $(PREVIEW_NAMESPACE) \
--hostname $(PREVIEW_HOSTNAME) \
--output-dir ./preview-manifests
Namespace Creation Script¶
Namespace Creation:
#!/bin/bash
# scripts/create-preview-namespace.sh
PR_NUMBER="${1}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"
TTL="${2:-24h}" # Default 24 hours
echo "📦 Creating preview namespace: ${NAMESPACE}"
# Create namespace
kubectl create namespace "${NAMESPACE}" --dry-run=client -o yaml | \
kubectl label --local -f - \
environment=preview \
preview=true \
pr-number="${PR_NUMBER}" \
auto-cleanup=true \
ttl="${TTL}" \
created-by=azure-pipelines \
created-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
-o yaml | \
kubectl apply -f -
# Create ResourceQuota for cost control
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
name: preview-quota
namespace: ${NAMESPACE}
spec:
hard:
requests.cpu: "2"
requests.memory: 4Gi
limits.cpu: "4"
limits.memory: 8Gi
persistentvolumeclaims: "2"
pods: "10"
services: "5"
EOF
# Create LimitRange for default resource limits
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: LimitRange
metadata:
name: preview-limits
namespace: ${NAMESPACE}
spec:
limits:
- default:
cpu: "100m"
memory: "256Mi"
defaultRequest:
cpu: "50m"
memory: "128Mi"
type: Container
EOF
echo "✅ Preview namespace created: ${NAMESPACE}"
Manifest Generation with PR-Specific Values¶
Preview Manifest Generation Script:
#!/bin/bash
# scripts/generate-preview-manifests.sh
PR_NUMBER="${1}"
BRANCH="${2}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"
HOSTNAME="${3}"
OUTPUT_DIR="${4:-./preview-manifests}"
echo "🔨 Generating preview manifests for PR #${PR_NUMBER}"
mkdir -p "${OUTPUT_DIR}"
# Generate namespace
cat > "${OUTPUT_DIR}/namespace.yaml" <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: ${NAMESPACE}
labels:
environment: preview
preview: "true"
pr-number: "${PR_NUMBER}"
branch: "${BRANCH}"
auto-cleanup: "true"
ttl: "24h"
created-by: "azure-pipelines"
created-at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
EOF
# Generate Kustomization with PR-specific values
cat > "${OUTPUT_DIR}/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: ${NAMESPACE}
resources:
- ../../apps/atp-ingestion/base
- ../../apps/atp-query/base
- ../../apps/atp-gateway/base
commonLabels:
environment: preview
pr-number: "${PR_NUMBER}"
patchesStrategicMerge:
- preview-patch.yaml
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: ${BRANCH}-$(git rev-parse --short HEAD)
- name: connectsoft.azurecr.io/atp/query
newTag: ${BRANCH}-$(git rev-parse --short HEAD)
- name: connectsoft.azurecr.io/atp/gateway
newTag: ${BRANCH}-$(git rev-parse --short HEAD)
EOF
# Generate preview-specific patches
cat > "${OUTPUT_DIR}/preview-patch.yaml" <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 1 # Single replica for preview
template:
spec:
containers:
- name: atp-ingestion
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Preview"
- name: Preview__PRNumber
value: "${PR_NUMBER}"
- name: Preview__Hostname
value: "${HOSTNAME}"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-ingestion-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-staging
spec:
rules:
- host: ${HOSTNAME}
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion
port:
number: 80
EOF
echo "✅ Preview manifests generated in ${OUTPUT_DIR}"
FluxCD Kustomization for Preview¶
Preview Kustomization Resource:
# clusters/dev/preview-kustomizations/pr123-kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: preview-pr123
namespace: flux-system
labels:
preview: "true"
pr-number: "123"
spec:
interval: 1m
path: ./apps/preview/pr123
prune: true # Auto-prune in preview
wait: false # Don't wait for readiness
timeout: 5m
sourceRef:
kind: GitRepository
name: atp-gitops-dev
dependsOn:
- name: infrastructure
kustomizeFlags:
- --load-restrictor=LoadRestrictionsNone
Dynamic Preview Kustomization Creation:
#!/bin/bash
# scripts/create-preview-kustomization.sh
PR_NUMBER="${1}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"
echo "🔧 Creating FluxCD Kustomization for preview PR #${PR_NUMBER}"
# Create preview Kustomization
cat <<EOF | kubectl apply -f -
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: preview-pr${PR_NUMBER}
namespace: flux-system
labels:
preview: "true"
pr-number: "${PR_NUMBER}"
auto-cleanup: "true"
spec:
interval: 1m
path: ./apps/preview/pr${PR_NUMBER}
prune: true
wait: false
timeout: 5m
sourceRef:
kind: GitRepository
name: atp-gitops-dev
dependsOn:
- name: infrastructure
EOF
echo "✅ Preview Kustomization created"
Dynamic Manifest Generation¶
Namespace: atp-preview-pr{number}¶
Dynamic Namespace Template:
# templates/preview-namespace-template.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-preview-pr{{ .Values.prNumber }}
labels:
environment: preview
preview: "true"
pr-number: "{{ .Values.prNumber }}"
branch: "{{ .Values.branch }}"
auto-cleanup: "true"
ttl: "{{ .Values.ttl | default "24h" }}"
created-by: "azure-pipelines"
created-at: "{{ .Values.createdAt }}"
Namespace Generation:
# Generate namespace with PR number
PR_NUMBER=123
NAMESPACE="atp-preview-pr${PR_NUMBER}"
kubectl create namespace "${NAMESPACE}" \
--dry-run=client -o yaml | \
kubectl label --local -f - \
environment=preview \
pr-number="${PR_NUMBER}" \
-o yaml | \
kubectl apply -f -
Ingress Hostname: pr{number}.preview.atp.connectsoft.example¶
Dynamic Ingress Generation:
# Generated Ingress for PR #123
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-preview-ingress
namespace: atp-preview-pr123
annotations:
cert-manager.io/cluster-issuer: letsencrypt-staging
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- pr123.preview.atp.connectsoft.example
secretName: preview-pr123-tls
rules:
- host: pr123.preview.atp.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion
port:
number: 80
Hostname Generation Script:
#!/bin/bash
# scripts/generate-preview-hostname.sh
PR_NUMBER="${1}"
BASE_DOMAIN="preview.atp.connectsoft.example"
PREVIEW_HOSTNAME="pr${PR_NUMBER}.${BASE_DOMAIN}"
echo "${PREVIEW_HOSTNAME}"
# Output: pr123.preview.atp.connectsoft.example
Resource Limits (Smaller than Dev)¶
Preview Resource Limits:
# Preview ResourceQuota (smaller than dev)
apiVersion: v1
kind: ResourceQuota
metadata:
name: preview-quota
namespace: atp-preview-pr123
spec:
hard:
requests.cpu: "2" # 2 CPU total (vs 8 in dev)
requests.memory: 4Gi # 4Gi memory (vs 16Gi in dev)
limits.cpu: "4" # 4 CPU limit (vs 16 in dev)
limits.memory: 8Gi # 8Gi limit (vs 32Gi in dev)
pods: "10" # 10 pods max (vs 50 in dev)
services: "5" # 5 services max
Preview Deployment Resource Limits:
# Preview Deployment with minimal resources
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-preview-pr123
spec:
replicas: 1 # Single replica
template:
spec:
containers:
- name: atp-ingestion
resources:
requests:
cpu: 100m # 100m CPU (vs 500m in dev)
memory: 256Mi # 256Mi memory (vs 1Gi in dev)
limits:
cpu: 500m # 500m CPU limit (vs 2000m in dev)
memory: 512Mi # 512Mi limit (vs 2Gi in dev)
Resource Comparison:
| Resource | Dev | Preview | Reduction |
|---|---|---|---|
| Replicas | 3 | 1 | 67% |
| CPU Request | 500m | 100m | 80% |
| Memory Request | 1Gi | 256Mi | 75% |
| CPU Limit | 2000m | 500m | 75% |
| Memory Limit | 2Gi | 512Mi | 75% |
Image Tag from PR Branch¶
Image Tag Strategy:
- Format:
{BRANCH_NAME}-{SHORT_COMMIT_SHA} - Examples:
feature-123-abc456dbugfix-456-def789g
Image Tag Generation:
#!/bin/bash
# scripts/generate-preview-image-tag.sh
BRANCH="${1}"
COMMIT_SHA="${2:-$(git rev-parse --short HEAD)}"
# Sanitize branch name (remove special characters)
SANITIZED_BRANCH=$(echo "${BRANCH}" | sed 's/[^a-zA-Z0-9]/-/g' | tr '[:upper:]' '[:lower:]' | cut -c1-50)
IMAGE_TAG="${SANITIZED_BRANCH}-${COMMIT_SHA}"
echo "${IMAGE_TAG}"
# Output: feature-123-abc456d
Kustomize Image Override:
# apps/preview/pr123/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: feature-123-abc456d # PR branch + commit SHA
- name: connectsoft.azurecr.io/atp/query
newTag: feature-123-abc456d
- name: connectsoft.azurecr.io/atp/gateway
newTag: feature-123-abc456d
FluxCD Configuration for Previews¶
Dynamic GitRepository per PR¶
Preview GitRepository:
# clusters/dev/preview-gitrepositories/pr123-gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: preview-pr123-git
namespace: flux-system
labels:
preview: "true"
pr-number: "123"
spec:
interval: 30s # Fast polling for preview
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: feature/123-new-feature # PR branch
secretRef:
name: gitops-credentials
ignore: |
exclude: |
^production/
^staging/
^test/
^apps/preview/pr(?!123)/
Dynamic GitRepository Creation:
#!/bin/bash
# scripts/create-preview-gitrepository.sh
PR_NUMBER="${1}"
PR_BRANCH="${2}"
echo "📂 Creating GitRepository for preview PR #${PR_NUMBER}"
cat <<EOF | kubectl apply -f -
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: preview-pr${PR_NUMBER}-git
namespace: flux-system
labels:
preview: "true"
pr-number: "${PR_NUMBER}"
auto-cleanup: "true"
spec:
interval: 30s
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: ${PR_BRANCH}
secretRef:
name: gitops-credentials
EOF
echo "✅ Preview GitRepository created"
Preview Kustomization¶
Preview Kustomization Configuration:
# clusters/dev/preview-kustomizations/pr123-kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: preview-pr123
namespace: flux-system
labels:
preview: "true"
pr-number: "123"
spec:
interval: 1m
path: ./apps/preview/pr123
prune: true # Auto-prune deleted resources
wait: false # Don't wait for readiness
timeout: 5m
retryInterval: 1m
sourceRef:
kind: GitRepository
name: preview-pr123-git
kustomizeFlags:
- --load-restrictor=LoadRestrictionsNone
dependsOn:
- name: infrastructure
Sync Policies for Preview¶
Preview Sync Policy:
| Policy | Value | Rationale |
|---|---|---|
| Auto-Sync | ✅ Enabled | Fast feedback for developers |
| Prune | ✅ Enabled | Clean up deleted resources |
| Wait | ❌ Disabled | Don't block on readiness |
| Timeout | 5m | Fast timeout for quick feedback |
| Retry Interval | 1m | Quick retries |
Sync Policy Configuration:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: preview-pr123
spec:
interval: 1m
prune: true # Auto-prune
wait: false # Don't wait
timeout: 5m # Fast timeout
retryInterval: 1m
Health Checks and Validation¶
Preview Health Check:
# Health check webhook for preview
apiVersion: v1
kind: Service
metadata:
name: preview-health-check
namespace: atp-preview-pr123
spec:
selector:
app: atp-ingestion
ports:
- port: 8080
targetPort: 8080
---
# Health check Job
apiVersion: batch/v1
kind: Job
metadata:
name: preview-health-check
namespace: atp-preview-pr123
spec:
template:
spec:
containers:
- name: health-check
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
echo "Checking preview environment health..."
# Wait for pods to be ready
sleep 30
# Check liveness
curl -f http://atp-ingestion:8080/health/live || exit 1
# Check readiness
curl -f http://atp-ingestion:8080/health/ready || exit 1
echo "✅ Preview environment is healthy"
restartPolicy: Never
Resource Cleanup¶
Auto-Delete After PR Merge¶
Auto-Cleanup on PR Merge:
# Azure Pipeline: Cleanup after merge
trigger: none
pr:
- branches:
include:
- dev
- test
autoCancel: false
pool:
vmImage: 'ubuntu-latest'
variables:
- name: PR_NUMBER
value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
- name: PREVIEW_NAMESPACE
value: atp-preview-pr$(PR_NUMBER)
stages:
- stage: CleanupPreview
displayName: 'Cleanup Preview Environment'
condition: and(succeeded(), eq(variables['System.PullRequest.Status'], 'Completed'))
jobs:
- job: Cleanup
displayName: 'Delete Preview Resources'
steps:
- task: Bash@3
displayName: 'Delete Preview Namespace'
inputs:
targetType: 'inline'
script: |
./scripts/cleanup-preview-environment.sh \
--pr-number $(PR_NUMBER) \
--reason "PR merged"
Auto-Delete After PR Close¶
Auto-Cleanup on PR Close:
#!/bin/bash
# scripts/cleanup-preview-on-close.sh
PR_NUMBER="${1}"
REASON="${2:-PR closed}"
echo "🧹 Cleaning up preview environment for PR #${PR_NUMBER}: ${REASON}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"
# Delete namespace (cascades to all resources)
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=5m || true
# Delete FluxCD Kustomization
kubectl delete kustomization preview-pr${PR_NUMBER} -n flux-system || true
# Delete FluxCD GitRepository
kubectl delete gitrepository preview-pr${PR_NUMBER}-git -n flux-system || true
# Clean up GitOps manifests in Git
./scripts/cleanup-preview-manifests.sh --pr-number "${PR_NUMBER}"
echo "✅ Preview environment cleaned up"
Manual Cleanup for Stuck Resources¶
Manual Cleanup Script:
#!/bin/bash
# scripts/manual-cleanup-preview.sh
PR_NUMBER="${1}"
if [ -z "${PR_NUMBER}" ]; then
echo "Usage: $0 <PR_NUMBER>"
echo "Example: $0 123"
exit 1
fi
NAMESPACE="atp-preview-pr${PR_NUMBER}"
echo "🧹 Manual cleanup for PR #${PR_NUMBER}"
# Force delete namespace (if stuck)
kubectl delete namespace "${NAMESPACE}" --force --grace-period=0 || true
# Wait and check if namespace still exists
sleep 10
if kubectl get namespace "${NAMESPACE}" 2>/dev/null; then
echo "⚠️ Namespace still exists, forcing deletion..."
# Patch namespace to remove finalizers
kubectl patch namespace "${NAMESPACE}" -p '{"metadata":{"finalizers":[]}}' --type=merge
# Delete again
kubectl delete namespace "${NAMESPACE}" --force --grace-period=0
fi
# Clean up FluxCD resources
kubectl delete kustomization preview-pr${PR_NUMBER} -n flux-system --ignore-not-found=true
kubectl delete gitrepository preview-pr${PR_NUMBER}-git -n flux-system --ignore-not-found=true
# Clean up any remaining pods
kubectl delete pods --all -n "${NAMESPACE}" --force --grace-period=0 2>/dev/null || true
echo "✅ Manual cleanup complete"
List All Preview Environments:
#!/bin/bash
# scripts/list-preview-environments.sh
echo "📋 Active Preview Environments:"
echo ""
kubectl get namespaces -l preview=true --no-headers | while read -r line; do
NAMESPACE=$(echo "$line" | awk '{print $1}')
CREATED=$(echo "$line" | awk '{print $2}')
PR_NUMBER=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.pr-number}')
TTL=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.ttl}')
echo "PR #${PR_NUMBER}: ${NAMESPACE}"
echo " Created: ${CREATED}"
echo " TTL: ${TTL}"
echo ""
done
Cost Tracking and Alerts¶
Cost Tracking:
#!/bin/bash
# scripts/track-preview-costs.sh
echo "💰 Preview Environment Cost Tracking"
echo ""
# Get all preview namespaces
kubectl get namespaces -l preview=true --no-headers | while read -r line; do
NAMESPACE=$(echo "$line" | awk '{print $1}')
PR_NUMBER=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.pr-number}')
CREATED=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.created-at}')
# Calculate hours since creation
CREATED_TIMESTAMP=$(date -d "${CREATED}" +%s 2>/dev/null || echo "0")
CURRENT_TIMESTAMP=$(date +%s)
HOURS=$(( (CURRENT_TIMESTAMP - CREATED_TIMESTAMP) / 3600 ))
# Estimate cost (assuming $0.10/hour per preview environment)
ESTIMATED_COST=$(echo "scale=2; $HOURS * 0.10" | bc)
echo "PR #${PR_NUMBER}: ${HOURS} hours, ~\$${ESTIMATED_COST}"
done
echo ""
echo "Total active preview environments: $(kubectl get namespaces -l preview=true --no-headers | wc -l)"
Cost Alert:
# PrometheusRule for preview cost alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: preview-cost-alert
namespace: monitoring
spec:
groups:
- name: preview-cost
rules:
- alert: TooManyPreviewEnvironments
expr: |
count(kube_namespace_labels{label_preview="true"}) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Too many preview environments active"
description: "{{ $value }} preview environments are active (threshold: 10)"
Cost Optimization¶
Shared Node Pools¶
Preview Node Pool Strategy:
| Strategy | Node Pool Type | Cost | Rationale |
|---|---|---|---|
| Shared with Dev | ✅ Dev node pool | ✅ Low | ✅ Recommended - No additional nodes |
| Dedicated Preview Pool | ❌ Separate pool | ❌ High | ❌ Not recommended (cost) |
| Spot Instance Pool | ⚠️ Spot nodes | ✅ Very Low | ⚠️ Consider for cost optimization |
Preview Node Pool Configuration:
# Use existing dev node pool (no additional cost)
# Preview pods scheduled on dev cluster nodes
# No dedicated node pool needed
Reduced Replica Counts (1 vs 3)¶
Replica Count Configuration:
# Preview Deployment: Single replica
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-preview-pr123
spec:
replicas: 1 # Single replica (vs 3 in dev)
Cost Savings:
- Dev: 3 replicas × $0.05/hour = $0.15/hour
- Preview: 1 replica × $0.05/hour = $0.05/hour
- Savings: 67% cost reduction
Minimal Resource Requests¶
Minimal Resource Configuration:
# Preview resources: Minimal requests
resources:
requests:
cpu: 100m # 100m (vs 500m in dev) - 80% reduction
memory: 256Mi # 256Mi (vs 1Gi in dev) - 75% reduction
limits:
cpu: 500m # 500m (vs 2000m in dev) - 75% reduction
memory: 512Mi # 512Mi (vs 2Gi in dev) - 75% reduction
Auto-Shutdown After 4 Hours of Inactivity¶
Inactivity Detection and Auto-Shutdown:
# CronJob to detect inactivity and scale to zero
apiVersion: batch/v1
kind: CronJob
metadata:
name: preview-inactivity-check
namespace: monitoring
spec:
schedule: "*/15 * * * *" # Every 15 minutes
jobTemplate:
spec:
template:
spec:
containers:
- name: inactivity-check
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
# Check all preview namespaces
kubectl get namespaces -l preview=true -o json | \
jq -r '.items[].metadata.name' | \
while read namespace; do
# Check last activity (last HTTP request)
LAST_ACTIVITY=$(kubectl get namespace "${namespace}" -o jsonpath='{.metadata.annotations.last-activity-time}')
if [ -z "${LAST_ACTIVITY}" ]; then
CREATED=$(kubectl get namespace "${namespace}" -o jsonpath='{.metadata.labels.created-at}')
LAST_ACTIVITY="${CREATED}"
fi
# Calculate hours since last activity
LAST_TS=$(date -d "${LAST_ACTIVITY}" +%s)
CURRENT_TS=$(date +%s)
HOURS=$(( (CURRENT_TS - LAST_TS) / 3600 ))
# Scale to zero if inactive for 4+ hours
if [ "${HOURS}" -ge 4 ]; then
echo "Scaling down ${namespace} (inactive for ${HOURS} hours)"
# Scale all deployments to zero
kubectl get deployments -n "${namespace}" -o json | \
jq -r '.items[].metadata.name' | \
while read deployment; do
kubectl scale deployment "${deployment}" -n "${namespace}" --replicas=0
done
fi
done
restartPolicy: OnFailure
Activity Tracking:
# Track activity in Ingress annotations
kubectl annotate ingress atp-preview-ingress \
-n atp-preview-pr123 \
last-activity-time="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--overwrite
Spot Instances for Preview Environments¶
Spot Node Pool Configuration:
// Pulumi: Spot node pool for preview environments
var previewNodePool = new ContainerService.KubernetesClusterNodePool("preview-spot-pool", new()
{
KubernetesClusterId = devCluster.Id,
VmSize = "Standard_D4s_v3",
NodeCount = 2,
Priority = "Spot",
EvictionPolicy = "Delete",
SpotMaxPrice = 0.05, // Max $0.05/hour (vs $0.20 for regular)
NodeTaints = new[]
{
"preview=true:NoSchedule"
},
NodeLabels = new()
{
{ "pool", "preview-spot" },
{ "preview", "true" },
},
});
Preview Pod Tolerations:
# Preview Deployment with spot tolerations
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-preview-pr123
spec:
template:
spec:
tolerations:
- key: preview
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
pool: preview-spot
Access Control¶
Preview URL Generation¶
Preview URL Format:
- Format:
https://pr{PR_NUMBER}.preview.atp.connectsoft.example - Examples:
https://pr123.preview.atp.connectsoft.examplehttps://pr456.preview.atp.connectsoft.example
URL Generation Script:
#!/bin/bash
# scripts/generate-preview-url.sh
PR_NUMBER="${1}"
BASE_DOMAIN="preview.atp.connectsoft.example"
PREVIEW_URL="https://pr${PR_NUMBER}.${BASE_DOMAIN}"
echo "${PREVIEW_URL}"
# Output: https://pr123.preview.atp.connectsoft.example
Update PR Description with Preview URL:
#!/bin/bash
# scripts/update-pr-with-preview-url.sh
PR_NUMBER="${1}"
PREVIEW_URL="${2}"
echo "🔗 Updating PR #${PR_NUMBER} with preview URL: ${PREVIEW_URL}"
# Add preview URL to PR description via Azure DevOps API
az repos pr update \
--organization "https://dev.azure.com/ConnectSoft" \
--project "ATP" \
--pull-request-id "${PR_NUMBER}" \
--description "
## 🚀 Preview Environment
Preview environment is ready for testing:
**Preview URL**: ${PREVIEW_URL}
**Status**: ✅ Active
**Services**:
- API Gateway: ${PREVIEW_URL}/gateway
- Ingestion: ${PREVIEW_URL}/ingestion
- Query: ${PREVIEW_URL}/query
**TTL**: 24 hours (auto-cleanup after PR merge/close)
"
Authentication for Preview Environments¶
Preview Authentication Options:
| Method | Implementation | Security | ATP Selection |
|---|---|---|---|
| No Auth | Public access | ❌ None | ❌ Not recommended |
| Basic Auth | Nginx basic auth | ⚠️ Low | ⚠️ Option for simple testing |
| OAuth/SSO | Azure AD integration | ✅ High | ✅ Recommended for production-like testing |
| IP Whitelist | Network policy | ⚠️ Moderate | ⚠️ Option for restricted access |
Basic Auth Configuration:
# Ingress with basic auth
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-preview-ingress
namespace: atp-preview-pr123
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: preview-basic-auth
nginx.ingress.kubernetes.io/auth-realm: "Preview Environment - PR #123"
spec:
ingressClassName: nginx
rules:
- host: pr123.preview.atp.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion
port:
number: 80
---
# Basic auth secret
apiVersion: v1
kind: Secret
metadata:
name: preview-basic-auth
namespace: atp-preview-pr123
type: Opaque
data:
auth: $(echo -n 'preview:preview123' | base64) # preview:preview123
OAuth Configuration:
# Ingress with OAuth
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-preview-ingress
namespace: atp-preview-pr123
annotations:
cert-manager.io/cluster-issuer: letsencrypt-staging
nginx.ingress.kubernetes.io/auth-url: "https://oauth2-proxy.atp-production.svc.cluster.local/oauth2/auth"
nginx.ingress.kubernetes.io/auth-signin: "https://oauth2-proxy.atp-production.svc.cluster.local/oauth2/start?rd=$scheme://$host$request_uri"
spec:
ingressClassName: nginx
rules:
- host: pr123.preview.atp.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-ingestion
port:
number: 80
Network Policies for Preview¶
Preview Network Policy:
# Network policy for preview: Allow ingress from internet
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: preview-allow-ingress
namespace: atp-preview-pr123
spec:
podSelector:
matchLabels:
app: atp-ingestion
policyTypes:
- Ingress
- Egress
ingress:
# Allow from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app: ingress-nginx
# Allow from monitoring (for metrics)
- from:
- namespaceSelector:
matchLabels:
name: monitoring
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow to shared test database
- to:
- namespaceSelector:
matchLabels:
name: atp-test-db
ports:
- protocol: TCP
port: 5432
Integration Testing in Preview¶
Running Integration Tests Against Preview¶
Integration Test Pipeline:
# azure-pipelines-preview-tests.yml
trigger: none
pr:
branches:
include:
- dev
pool:
vmImage: 'ubuntu-latest'
variables:
- name: PR_NUMBER
value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
- name: PREVIEW_URL
value: https://pr$(PR_NUMBER).preview.atp.connectsoft.example
stages:
- stage: RunIntegrationTests
displayName: 'Run Integration Tests Against Preview'
jobs:
- job: IntegrationTests
displayName: 'Integration Tests'
steps:
- task: DotNetCoreCLI@2
displayName: 'Run Integration Tests'
inputs:
command: 'test'
projects: '**/IntegrationTests.csproj'
arguments: |
--filter "Category=Preview" \
--logger "trx;LogFileName=results.trx" \
--results-directory $(Agent.TempDirectory)/test-results \
-- \
PreviewUrl=$(PREVIEW_URL)
- task: PublishTestResults@2
displayName: 'Publish Test Results'
inputs:
testResultsFiles: '**/*.trx'
testRunTitle: 'Preview Integration Tests - PR #$(PR_NUMBER)'
Integration Test Configuration:
// C#: Integration test configuration
public class PreviewIntegrationTests
{
private readonly string _previewUrl;
public PreviewIntegrationTests()
{
_previewUrl = Environment.GetEnvironmentVariable("PreviewUrl")
?? "https://pr123.preview.atp.connectsoft.example";
}
[Fact]
[Category("Preview")]
public async Task TestIngestionService()
{
var client = new HttpClient
{
BaseAddress = new Uri(_previewUrl)
};
var response = await client.GetAsync("/health/ready");
Assert.Equal(HttpStatusCode.OK, response.StatusCode);
}
}
Database/Dependency Mocking¶
Mocking Strategy:
| Dependency | Strategy | Implementation |
|---|---|---|
| Database | ⚠️ Shared test DB | ✅ Real database (isolated schema) |
| Redis | ✅ Shared test Redis | ✅ Real Redis (isolated keys) |
| External APIs | ✅ Mock | ✅ WireMock or MSW |
| Service Bus | ⚠️ Shared test queue | ✅ Real Service Bus (isolated queue) |
Mocked Dependencies Configuration:
# External API mocks
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-api-mock
namespace: atp-preview-pr123
spec:
replicas: 1
template:
spec:
containers:
- name: wiremock
image: wiremock/wiremock:latest
ports:
- containerPort: 8080
env:
- name: MAPPINGS_DIR
value: /home/wiremock/mappings
volumeMounts:
- name: mappings
mountPath: /home/wiremock/mappings
volumes:
- name: mappings
configMap:
name: wiremock-mappings
Shared Test Services¶
Shared Test Services Architecture:
graph TB
subgraph "Shared Test Namespace"
TEST_DB[(Shared Test DB<br/>Isolated schemas)]
TEST_REDIS[(Shared Test Redis<br/>Isolated keys)]
TEST_SB[Shared Test Service Bus<br/>Isolated queues]
end
subgraph "Preview PR #123"
PREVIEW1[Preview Services]
end
subgraph "Preview PR #124"
PREVIEW2[Preview Services]
end
PREVIEW1 -->|Isolated schema| TEST_DB
PREVIEW2 -->|Isolated schema| TEST_DB
PREVIEW1 -->|Isolated keys| TEST_REDIS
PREVIEW2 -->|Isolated keys| TEST_REDIS
PREVIEW1 -->|Isolated queue| TEST_SB
PREVIEW2 -->|Isolated queue| TEST_SB
style TEST_DB fill:#FFE5B4
style TEST_REDIS fill:#FFE5B4
style TEST_SB fill:#FFE5B4
Database and Dependencies¶
Mock Services vs Real Dependencies¶
Dependency Strategy Matrix:
| Dependency | Mock | Real | ATP Decision |
|---|---|---|---|
| SQL Database | ⚠️ Possible | ✅ Real (isolated schema) | ✅ Real with isolation |
| Redis | ⚠️ Possible | ✅ Real (isolated keys) | ✅ Real with isolation |
| Service Bus | ⚠️ Possible | ✅ Real (isolated queue) | ✅ Real with isolation |
| External APIs | ✅ Mock | ⚠️ Costly | ✅ Mock (WireMock) |
| Key Vault | ❌ N/A | ✅ Real | ✅ Real (shared) |
Shared Test Database Approach¶
Shared Test Database with Isolated Schemas:
# ExternalSecret for shared test database
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: sql-connection-preview
namespace: atp-preview-pr123
spec:
secretStoreRef:
name: azure-keyvault-dev
kind: ClusterSecretStore
data:
- secretKey: connectionString
remoteRef:
key: connection-strings/atp-test-db/preview-connection-string
# Connection string with PR-specific schema: atp_preview_pr123
Database Schema Isolation:
-- Create isolated schema per PR
CREATE SCHEMA IF NOT EXISTS atp_preview_pr123;
GRANT ALL PRIVILEGES ON SCHEMA atp_preview_pr123 TO atp_preview_user;
-- Connection string for PR #123
-- Server=test-db.database.windows.net;Database=atp_test;Schema=atp_preview_pr123;...
Database Cleanup:
#!/bin/bash
# scripts/cleanup-preview-database.sh
PR_NUMBER="${1}"
echo "🗑️ Cleaning up database schema for PR #${PR_NUMBER}"
SCHEMA_NAME="atp_preview_pr${PR_NUMBER}"
# Drop schema (cascades to all objects)
psql -h test-db.database.windows.net \
-U atp_admin \
-d atp_test \
-c "DROP SCHEMA IF EXISTS ${SCHEMA_NAME} CASCADE;"
echo "✅ Database schema cleaned up"
Ephemeral Database per Preview¶
Ephemeral Database Option:
# Option: Create ephemeral database per preview (costlier but more isolated)
apiVersion: v1
kind: ConfigMap
metadata:
name: preview-db-config
namespace: atp-preview-pr123
data:
database-name: "atp_preview_pr123"
create-database: "true"
ttl: "24h"
Ephemeral Database Creation:
#!/bin/bash
# scripts/create-preview-database.sh
PR_NUMBER="${1}"
echo "📦 Creating ephemeral database for PR #${PR_NUMBER}"
DB_NAME="atp_preview_pr${PR_NUMBER}"
# Create database via Azure CLI
az sql db create \
--resource-group atp-nonprod-rg \
--server atp-test-sql-server \
--name "${DB_NAME}" \
--service-objective S0 \
--tags \
Environment=Preview \
PRNumber="${PR_NUMBER}" \
AutoCleanup=true \
TTL="24h"
echo "✅ Ephemeral database created: ${DB_NAME}"
ATP Decision: Shared test database with isolated schemas (cost-effective, sufficient isolation)
Preview Environment Lifecycle¶
Creation → Testing → Validation → Deletion¶
Preview Lifecycle Flow:
sequenceDiagram
participant Dev as Developer
participant PR as Pull Request
participant Pipeline as Azure Pipeline
participant K8s as Kubernetes
participant FluxCD as FluxCD
participant Tests as Integration Tests
Dev->>PR: Create PR
PR->>Pipeline: Trigger preview pipeline
Pipeline->>K8s: Create namespace
Pipeline->>FluxCD: Create GitRepository/Kustomization
FluxCD->>K8s: Deploy preview manifests
K8s->>Pipeline: Preview ready
Pipeline->>PR: Update PR with preview URL
Pipeline->>Tests: Run integration tests
Tests->>PR: Update PR with test results
PR->>Pipeline: PR merged/closed
Pipeline->>K8s: Delete namespace
Pipeline->>FluxCD: Delete GitRepository/Kustomization
K8s->>Pipeline: Cleanup complete
Status Reporting in PR Comments¶
PR Status Comment:
#!/bin/bash
# scripts/update-pr-status.sh
PR_NUMBER="${1}"
STATUS="${2}" # provisioning, active, testing, failed, cleaning
PREVIEW_URL="${3}"
echo "📝 Updating PR #${PR_NUMBER} status: ${STATUS}"
STATUS_EMOJI=""
case "${STATUS}" in
provisioning) STATUS_EMOJI="🔄" ;;
active) STATUS_EMOJI="✅" ;;
testing) STATUS_EMOJI="🧪" ;;
failed) STATUS_EMOJI="❌" ;;
cleaning) STATUS_EMOJI="🧹" ;;
esac
COMMENT="## ${STATUS_EMOJI} Preview Environment Status
**Status**: ${STATUS}
${PREVIEW_URL:+**Preview URL**: ${PREVIEW_URL}}
**Timestamp**: $(date -u +%Y-%m-%dT%H:%M:%SZ)
"
# Add comment to PR via Azure DevOps API
az repos pr thread create \
--organization "https://dev.azure.com/ConnectSoft" \
--project "ATP" \
--pull-request-id "${PR_NUMBER}" \
--comments "[{\"content\": \"${COMMENT}\"}]"
Preview URL in PR Description¶
PR Description Update:
# Azure Pipeline: Update PR description
- task: Bash@3
displayName: 'Update PR Description'
inputs:
targetType: 'inline'
script: |
./scripts/update-pr-description.sh \
--pr-number $(PR_NUMBER) \
--preview-url $(PREVIEW_URL) \
--status active
PR Description Template:
## 🚀 Preview Environment
Preview environment has been provisioned for this PR.
### Access Information
- **Preview URL**: https://pr123.preview.atp.connectsoft.example
- **Status**: ✅ Active
- **Namespace**: `atp-preview-pr123`
### Services
- **API Gateway**: https://pr123.preview.atp.connectsoft.example/gateway
- **Ingestion Service**: https://pr123.preview.atp.connectsoft.example/ingestion
- **Query Service**: https://pr123.preview.atp.connectsoft.example/query
### Testing
Integration tests have been executed against the preview environment.
- ✅ Smoke tests: Passed
- ✅ Integration tests: Passed
- ✅ Health checks: Passed
### Cleanup
This preview environment will be automatically cleaned up when:
- PR is merged
- PR is closed
- 24 hours of inactivity (auto-shutdown)
**Created**: 2024-01-15T10:00:00Z
**TTL**: 24 hours
Summary: Preview Environments (Ephemeral)¶
- Preview Environment Architecture: Ephemeral namespaces in dev cluster, resource isolation per PR, cost optimization strategies, lifecycle management
- Automatic Provisioning: Azure Pipeline triggered by PR, namespace creation script, manifest generation with PR-specific values, FluxCD Kustomization for preview
- Dynamic Manifest Generation: Namespace naming (atp-preview-pr{number}), Ingress hostname (pr{number}.preview.atp.connectsoft.example), resource limits (smaller than dev), image tag from PR branch
- FluxCD Configuration: Dynamic GitRepository per PR, preview Kustomization, sync policies for preview, health checks and validation
- Resource Cleanup: Auto-delete after PR merge, auto-delete after PR close, manual cleanup for stuck resources, cost tracking and alerts
- Cost Optimization: Shared node pools, reduced replica counts (1 vs 3), minimal resource requests, auto-shutdown after 4 hours inactivity, spot instances for preview environments
- Access Control: Preview URL generation, authentication for preview environments (basic auth/OAuth), network policies for preview
- Integration Testing: Running integration tests against preview, database/dependency mocking, shared test services
- Database and Dependencies: Mock services vs real dependencies, shared test database approach with isolated schemas, ephemeral database per preview option
- Preview Environment Lifecycle: Creation → testing → validation → deletion flow, status reporting in PR comments, preview URL in PR description
Rollback & Disaster Recovery¶
Purpose: Define rollback procedures for ATP GitOps deployments including Git-based rollbacks, progressive rollback strategies, application state recovery, database migration rollbacks, FluxCD rollback mechanisms, Azure backup integration, disaster recovery scenarios, and incident response procedures to ensure rapid recovery from failures and minimize downtime.
Git-Based Rollback¶
Simple Rollback: Git Revert¶
Git Revert for Simple Rollback:
#!/bin/bash
# scripts/rollback-simple.sh
SERVICE="${1:-atp-ingestion}"
ENVIRONMENT="${2:-production}"
NAMESPACE="atp-${ENVIRONMENT}"
echo "⏪ Rolling back ${SERVICE} in ${ENVIRONMENT}"
# Find the last deployment commit
LAST_COMMIT=$(git log --oneline --grep="deploy.*${SERVICE}" -n 1 --format="%H")
if [ -z "${LAST_COMMIT}" ]; then
echo "❌ No deployment commit found for ${SERVICE}"
exit 1
fi
echo "📝 Last deployment commit: ${LAST_COMMIT}"
# Revert the commit
git revert --no-edit "${LAST_COMMIT}"
# Push the revert commit
git push origin ${ENVIRONMENT}
echo "✅ Rollback committed: ${SERVICE} reverted to previous state"
# FluxCD will automatically reconcile to the new Git state
Git Revert for Multiple Commits:
#!/bin/bash
# scripts/rollback-multiple.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
COMMIT_COUNT="${3:-1}" # Number of commits to revert
echo "⏪ Rolling back ${COMMIT_COUNT} commits for ${SERVICE}"
# Revert multiple commits (oldest first)
git log --oneline -n ${COMMIT_COUNT} --reverse --format="%H" | while read commit; do
echo "Reverting commit: ${commit}"
git revert --no-edit "${commit}"
done
# Push all revert commits
git push origin ${ENVIRONMENT}
echo "✅ Rolled back ${COMMIT_COUNT} commits"
Complex Rollback: Git Reset¶
Git Reset for Complex Rollback (Use with caution):
#!/bin/bash
# scripts/rollback-reset.sh
ENVIRONMENT="${1:-production}"
TARGET_COMMIT="${2}" # Commit hash or tag to rollback to
if [ -z "${TARGET_COMMIT}" ]; then
echo "Usage: $0 <environment> <commit-hash-or-tag>"
echo "Example: $0 production v1.2.2"
exit 1
fi
echo "⚠️ WARNING: Git reset will rewrite history"
echo "⏪ Rolling back ${ENVIRONMENT} to ${TARGET_COMMIT}"
# Verify target commit exists
if ! git cat-file -e "${TARGET_COMMIT}^{commit}" 2>/dev/null; then
echo "❌ Target commit ${TARGET_COMMIT} not found"
exit 1
fi
# Create backup branch before reset
BACKUP_BRANCH="${ENVIRONMENT}-backup-$(date +%Y%m%d-%H%M%S)"
git branch "${BACKUP_BRANCH}" "${ENVIRONMENT}"
echo "📦 Backup branch created: ${BACKUP_BRANCH}"
# Reset to target commit (soft reset preserves changes)
git checkout "${ENVIRONMENT}"
git reset --soft "${TARGET_COMMIT}"
# Commit the rollback
git commit -m "rollback: Revert to ${TARGET_COMMIT} for disaster recovery"
# Force push (requires branch protection override for emergency)
git push origin "${ENVIRONMENT}" --force
echo "✅ Rollback complete: ${ENVIRONMENT} reset to ${TARGET_COMMIT}"
echo "⚠️ Backup branch: ${BACKUP_BRANCH} (keep for reference)"
ATP Recommendation: Prefer git revert over git reset (preserves history, safer for audit trail)
Rollback to Specific Commit¶
Rollback to Specific Commit:
#!/bin/bash
# scripts/rollback-to-commit.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"
if [ -z "${TARGET_COMMIT}" ]; then
echo "Usage: $0 <service> <environment> <commit-hash>"
echo "Example: $0 atp-ingestion production abc123def456"
exit 1
fi
echo "⏪ Rolling back ${SERVICE} to commit ${TARGET_COMMIT}"
# Checkout the target commit
git checkout "${TARGET_COMMIT}" -- "apps/${SERVICE}/"
# Check if changes exist
if git diff --quiet "${ENVIRONMENT}" -- "apps/${SERVICE}/"; then
echo "⚠️ No changes to rollback (already at target commit)"
exit 0
fi
# Commit the rollback
git add "apps/${SERVICE}/"
git commit -m "rollback(${SERVICE}): Revert to commit ${TARGET_COMMIT}"
# Push to environment branch
git push origin "${ENVIRONMENT}"
echo "✅ Rollback complete: ${SERVICE} reverted to ${TARGET_COMMIT}"
Rollback to Commit with Validation:
#!/bin/bash
# scripts/rollback-to-commit-validated.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"
echo "⏪ Rolling back ${SERVICE} to commit ${TARGET_COMMIT}"
# Validate target commit
echo "🔍 Validating target commit..."
git show --no-patch --format="%H%n%an%n%ae%n%ad%n%s" "${TARGET_COMMIT}"
read -p "Continue with rollback? (yes/no): " confirm
if [ "${confirm}" != "yes" ]; then
echo "❌ Rollback cancelled"
exit 1
fi
# Perform rollback
./scripts/rollback-to-commit.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"
# Wait for FluxCD reconciliation
echo "⏳ Waiting for FluxCD to reconcile..."
sleep 60
# Verify rollback
./scripts/verify-rollback.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"
Rollback to Specific Tag¶
Rollback to Specific Tag:
#!/bin/bash
# scripts/rollback-to-tag.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_TAG="${3}" # e.g., v1.2.2
if [ -z "${TARGET_TAG}" ]; then
echo "Usage: $0 <service> <environment> <tag>"
echo "Example: $0 atp-ingestion production v1.2.2"
exit 1
fi
echo "⏪ Rolling back ${SERVICE} to tag ${TARGET_TAG}"
# Verify tag exists
if ! git rev-parse "${TARGET_TAG}" >/dev/null 2>&1; then
echo "❌ Tag ${TARGET_TAG} not found"
echo "Available tags:"
git tag --sort=-creatordate | head -10
exit 1
fi
# Get commit hash for tag
TARGET_COMMIT=$(git rev-parse "${TARGET_TAG}")
echo "📦 Tag ${TARGET_TAG} points to commit ${TARGET_COMMIT}"
# Rollback to the tagged commit
./scripts/rollback-to-commit.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"
echo "✅ Rollback complete: ${SERVICE} reverted to ${TARGET_TAG} (${TARGET_COMMIT})"
List Available Tags for Rollback:
#!/bin/bash
# scripts/list-rollback-tags.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
echo "📋 Available rollback tags for ${SERVICE}:"
echo ""
git tag --sort=-creatordate --format="%(refname:short)|%(creatordate:iso)|%(subject)" | \
while IFS='|' read -r tag date subject; do
# Check if tag affects the service
if git diff "${tag}~1" "${tag}" --name-only | grep -q "apps/${SERVICE}/"; then
echo " ${tag} - ${date}"
echo " ${subject}"
echo ""
fi
done
Progressive Rollback¶
Rolling Back One Service at a Time¶
Progressive Service Rollback:
#!/bin/bash
# scripts/progressive-rollback.sh
ENVIRONMENT="${1:-production}"
SERVICES="${2}" # Comma-separated: atp-ingestion,atp-query,atp-gateway
if [ -z "${SERVICES}" ]; then
echo "Usage: $0 <environment> <service1,service2,service3>"
echo "Example: $0 production atp-ingestion,atp-query,atp-gateway"
exit 1
fi
echo "🔄 Progressive rollback: ${SERVICES} in ${ENVIRONMENT}"
# Split services into array
IFS=',' read -ra SERVICE_ARRAY <<< "${SERVICES}"
for SERVICE in "${SERVICE_ARRAY[@]}"; do
echo ""
echo "⏪ Rolling back ${SERVICE}..."
# Rollback service
./scripts/rollback-simple.sh "${SERVICE}" "${ENVIRONMENT}"
# Wait for reconciliation
echo "⏳ Waiting for reconciliation (60s)..."
sleep 60
# Validate rollback
echo "🔍 Validating rollback..."
if ./scripts/verify-service-health.sh "${SERVICE}" "${ENVIRONMENT}"; then
echo "✅ ${SERVICE} rollback validated"
else
echo "❌ ${SERVICE} rollback validation failed"
read -p "Continue with next service? (yes/no): " continue
if [ "${continue}" != "yes" ]; then
echo "⚠️ Progressive rollback stopped"
exit 1
fi
fi
done
echo ""
echo "✅ Progressive rollback complete: All services rolled back"
Rollback with Canary (Gradual Revert)¶
Canary Rollback Configuration:
# Rollback with Flagger canary (gradual traffic reduction)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: atp-ingestion-rollback
namespace: atp-production
spec:
analysis:
interval: 1m
threshold: 3
stepWeight: -25 # Reduce canary traffic by 25% each step
stepWeights: [75, 50, 25, 0] # 75% → 50% → 25% → 0% (full rollback)
metrics:
- name: error-rate
thresholdRange:
max: 1
interval: 30s
Gradual Rollback Script:
#!/bin/bash
# scripts/canary-rollback.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
ROLLBACK_STEPS="${3:-4}" # Number of steps
echo "🔄 Gradual canary rollback: ${SERVICE} in ${ENVIRONMENT}"
# Current canary weight (assume 100% for rollback start)
CURRENT_WEIGHT=100
STEP_SIZE=$((100 / ROLLBACK_STEPS))
for STEP in $(seq 1 ${ROLLBACK_STEPS}); do
NEW_WEIGHT=$((CURRENT_WEIGHT - STEP_SIZE))
echo "📊 Step ${STEP}/${ROLLBACK_STEPS}: Reducing traffic to ${NEW_WEIGHT}%"
# Update Istio VirtualService to reduce canary traffic
kubectl patch virtualservice "${SERVICE}" -n "${ENVIRONMENT}" --type=json \
-p="[{\"op\": \"replace\", \"path\": \"/spec/http/0/route/1/weight\", \"value\": ${NEW_WEIGHT}}]"
# Wait and validate
echo "⏳ Waiting 2 minutes for validation..."
sleep 120
# Check error rate
ERROR_RATE=$(./scripts/get-error-rate.sh "${SERVICE}" "${ENVIRONMENT}")
echo "📈 Error rate: ${ERROR_RATE}%"
if (( $(echo "${ERROR_RATE} > 5" | bc -l) )); then
echo "❌ Error rate too high, accelerating rollback"
NEW_WEIGHT=$((NEW_WEIGHT - STEP_SIZE))
fi
CURRENT_WEIGHT=${NEW_WEIGHT}
if [ ${CURRENT_WEIGHT} -le 0 ]; then
echo "✅ Full rollback complete (0% traffic to canary)"
break
fi
done
echo "✅ Gradual rollback complete"
Validation at Each Rollback Step¶
Rollback Validation Script:
#!/bin/bash
# scripts/validate-rollback-step.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
STEP="${3}"
echo "🔍 Validating rollback step ${STEP} for ${SERVICE}"
# Health check validation
HEALTH_STATUS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
-o jsonpath='{.status.conditions[?(@.type=="Available")].status}')
if [ "${HEALTH_STATUS}" != "True" ]; then
echo "❌ Health check failed: Deployment not available"
exit 1
fi
# Error rate validation
ERROR_RATE=$(./scripts/get-error-rate.sh "${SERVICE}" "${ENVIRONMENT}")
ERROR_THRESHOLD=5
if (( $(echo "${ERROR_RATE} > ${ERROR_THRESHOLD}" | bc -l) )); then
echo "❌ Error rate validation failed: ${ERROR_RATE}% > ${ERROR_THRESHOLD}%"
exit 1
fi
# Latency validation
P95_LATENCY=$(./scripts/get-p95-latency.sh "${SERVICE}" "${ENVIRONMENT}")
LATENCY_THRESHOLD=500 # 500ms
if (( $(echo "${P95_LATENCY} > ${LATENCY_THRESHOLD}" | bc -l) )); then
echo "❌ Latency validation failed: ${P95_LATENCY}ms > ${LATENCY_THRESHOLD}ms"
exit 1
fi
# Readiness probe validation
READY_REPLICAS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
-o jsonpath='{.status.readyReplicas}')
DESIRED_REPLICAS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
-o jsonpath='{.spec.replicas}')
if [ "${READY_REPLICAS}" != "${DESIRED_REPLICAS}" ]; then
echo "❌ Replica validation failed: ${READY_REPLICAS}/${DESIRED_REPLICAS} ready"
exit 1
fi
echo "✅ All validation checks passed for step ${STEP}"
Application State Recovery¶
Handling Database Schema Changes¶
Database Schema Rollback Strategy:
| Migration Type | Rollback Strategy | ATP Decision |
|---|---|---|
| Add Column | Drop column (if nullable) | ✅ Safe rollback |
| Drop Column | Add column back | ⚠️ Data loss risk |
| Rename Column | Rename back | ✅ Safe rollback |
| Change Type | Revert type change | ⚠️ Data truncation risk |
| Add Table | Drop table | ✅ Safe rollback |
| Drop Table | Recreate table | ❌ Data loss |
Forward-Only Migrations (Preferred):
// C#: Forward-only migration (no rollback)
// Entity Framework migration: AddAuditIndex
public partial class AddAuditIndex : Migration
{
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.CreateIndex(
name: "IX_AuditTrail_Timestamp",
table: "AuditTrail",
column: "Timestamp");
}
// No Down() method - forward-only migration
// Rollback = deploy previous version that doesn't use the index
}
Database Rollback Coordination:
#!/bin/bash
# scripts/rollback-with-db.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_VERSION="${3}"
echo "🔄 Coordinated rollback: Application + Database"
# Step 1: Check if database rollback is needed
CURRENT_SCHEMA_VERSION=$(./scripts/get-db-schema-version.sh "${ENVIRONMENT}")
TARGET_SCHEMA_VERSION=$(./scripts/get-schema-version-for-tag.sh "${TARGET_VERSION}")
if [ "${CURRENT_SCHEMA_VERSION}" != "${TARGET_SCHEMA_VERSION}" ]; then
echo "⚠️ Database schema rollback required"
echo " Current: ${CURRENT_SCHEMA_VERSION}"
echo " Target: ${TARGET_SCHEMA_VERSION}"
read -p "Proceed with database rollback? (yes/no): " confirm
if [ "${confirm}" != "yes" ]; then
echo "❌ Rollback cancelled"
exit 1
fi
# Rollback database schema
./scripts/rollback-database-schema.sh "${ENVIRONMENT}" "${TARGET_SCHEMA_VERSION}"
fi
# Step 2: Rollback application
./scripts/rollback-to-tag.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_VERSION}"
echo "✅ Coordinated rollback complete"
Data Migration Rollback¶
Data Migration Rollback Strategy:
#!/bin/bash
# scripts/rollback-data-migration.sh
ENVIRONMENT="${1:-production}"
MIGRATION_ID="${2}"
echo "🔄 Rolling back data migration: ${MIGRATION_ID}"
# Check if migration has been applied
if ! ./scripts/check-migration-applied.sh "${MIGRATION_ID}" "${ENVIRONMENT}"; then
echo "⚠️ Migration ${MIGRATION_ID} not applied, skipping rollback"
exit 0
fi
# Execute rollback script (if exists)
ROLLBACK_SCRIPT="migrations/${MIGRATION_ID}/rollback.sql"
if [ -f "${ROLLBACK_SCRIPT}" ]; then
echo "📝 Executing rollback script: ${ROLLBACK_SCRIPT}"
psql -h "${DB_HOST}" -U "${DB_USER}" -d "${DB_NAME}" -f "${ROLLBACK_SCRIPT}"
else
echo "⚠️ No rollback script found: ${ROLLBACK_SCRIPT}"
echo "⚠️ Manual data recovery may be required"
exit 1
fi
# Mark migration as rolled back
./scripts/mark-migration-rolled-back.sh "${MIGRATION_ID}" "${ENVIRONMENT}"
echo "✅ Data migration rollback complete"
Stateful Application Considerations¶
StatefulSet Rollback:
#!/bin/bash
# scripts/rollback-statefulset.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
echo "⏪ Rolling back StatefulSet: ${SERVICE}"
# Get current StatefulSet revision
CURRENT_REVISION=$(kubectl get statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" \
-o jsonpath='{.status.currentRevision}')
# Get previous revision
PREVIOUS_REVISION=$(kubectl get statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" \
-o jsonpath='{.status.updateRevision}')
# Rollback to previous revision
kubectl rollout undo statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}"
# Monitor rollout
kubectl rollout status statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" --timeout=10m
echo "✅ StatefulSet rollback complete"
PVC Retention During Rollback:
# StatefulSet with PVC retention
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: atp-stateful-service
spec:
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
# PVCs are NOT deleted on StatefulSet deletion
# Data is preserved during rollback
Database Migration Rollback¶
Forward-Only Migrations (Preferred)¶
Forward-Only Migration Strategy:
| Approach | Pros | Cons | ATP Decision |
|---|---|---|---|
| Forward-Only | ✅ Simpler, safer | ⚠️ No automatic rollback | ✅ Preferred |
| Reversible Migrations | ✅ Can rollback | ❌ Complex, risky | ⚠️ Use sparingly |
| No Migrations | ✅ No risk | ❌ No schema changes | ❌ Not practical |
Forward-Only Migration Example:
// Entity Framework: Forward-only migration
public partial class AddAuditIndex : Migration
{
protected override void Up(MigrationBuilder migrationBuilder)
{
// Add index
migrationBuilder.CreateIndex(
name: "IX_AuditTrail_Timestamp",
table: "AuditTrail",
column: "Timestamp");
}
// No Down() method - rollback = deploy previous app version
}
Rollback Strategy for Forward-Only Migrations:
- Rollback Application: Deploy previous application version (doesn't use new schema)
- Schema Compatibility: New schema must be backward compatible with old application
- Cleanup Migration: Create new migration to clean up unused schema (later)
Rollback Scripts (If Necessary)¶
Reversible Migration with Rollback:
// Entity Framework: Reversible migration (use sparingly)
public partial class RenameAuditColumn : Migration
{
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.RenameColumn(
name: "EventDate",
table: "AuditTrail",
newName: "Timestamp");
}
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.RenameColumn(
name: "Timestamp",
table: "AuditTrail",
newName: "EventDate");
}
}
Rollback Script:
-- migrations/20240115_AddAuditIndex/rollback.sql
-- Rollback script for AddAuditIndex migration
-- Drop the index
DROP INDEX IF EXISTS IX_AuditTrail_Timestamp ON AuditTrail;
-- Log rollback
INSERT INTO MigrationHistory (MigrationId, AppliedAt, RolledBackAt, Status)
VALUES ('20240115_AddAuditIndex', GETDATE(), GETDATE(), 'RolledBack');
Data Loss Prevention¶
Data Loss Prevention Checklist:
#!/bin/bash
# scripts/prevent-data-loss-rollback.sh
MIGRATION_ID="${1}"
ENVIRONMENT="${2:-production}"
echo "🔒 Data Loss Prevention Check for Migration: ${MIGRATION_ID}"
# Check if migration involves data deletion
if grep -q "DELETE\|DROP\|TRUNCATE" "migrations/${MIGRATION_ID}/up.sql"; then
echo "⚠️ WARNING: Migration contains data deletion operations"
# Create backup before rollback
echo "📦 Creating database backup..."
./scripts/backup-database.sh "${ENVIRONMENT}" "pre-rollback-${MIGRATION_ID}"
# Ask for confirmation
read -p "Migration may cause data loss. Continue? (yes/no): " confirm
if [ "${confirm}" != "yes" ]; then
echo "❌ Rollback cancelled"
exit 1
fi
fi
# Check for dependent data
echo "🔍 Checking for dependent data..."
DEPENDENT_RECORDS=$(./scripts/check-dependent-data.sh "${MIGRATION_ID}")
if [ "${DEPENDENT_RECORDS}" -gt 0 ]; then
echo "⚠️ WARNING: ${DEPENDENT_RECORDS} dependent records found"
read -p "Continue with rollback? (yes/no): " confirm
if [ "${confirm}" != "yes" ]; then
echo "❌ Rollback cancelled"
exit 1
fi
fi
echo "✅ Data loss prevention checks passed"
Coordinating App Rollback with DB Rollback¶
Coordinated Rollback Procedure:
sequenceDiagram
participant Admin as Administrator
participant App as Application Rollback
participant DB as Database Rollback
participant FluxCD as FluxCD
participant K8s as Kubernetes
Admin->>App: Initiate rollback
App->>DB: Check schema compatibility
DB-->>App: Schema version check
App->>DB: Rollback database (if needed)
DB->>DB: Execute rollback script
DB-->>App: Database rolled back
App->>FluxCD: Revert Git commit
FluxCD->>K8s: Reconcile to previous state
K8s->>App: Deploy previous app version
App->>Admin: Rollback complete
Coordinated Rollback Script:
#!/bin/bash
# scripts/coordinated-rollback.sh
SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_TAG="${3}"
echo "🔄 Coordinated Application + Database Rollback"
# Step 1: Get target application version and schema version
TARGET_APP_VERSION="${TARGET_TAG}"
TARGET_SCHEMA_VERSION=$(./scripts/get-schema-version-for-tag.sh "${TARGET_TAG}")
CURRENT_SCHEMA_VERSION=$(./scripts/get-db-schema-version.sh "${ENVIRONMENT}")
echo "📊 Rollback Plan:"
echo " Application: ${TARGET_APP_VERSION}"
echo " Database Schema: ${CURRENT_SCHEMA_VERSION} → ${TARGET_SCHEMA_VERSION}"
# Step 2: Check schema compatibility
if [ "${CURRENT_SCHEMA_VERSION}" != "${TARGET_SCHEMA_VERSION}" ]; then
echo "⚠️ Database schema rollback required"
# Verify backward compatibility
if ! ./scripts/verify-schema-compatibility.sh "${TARGET_SCHEMA_VERSION}" "${TARGET_APP_VERSION}"; then
echo "❌ Schema version ${TARGET_SCHEMA_VERSION} not compatible with app ${TARGET_APP_VERSION}"
exit 1
fi
# Step 2a: Rollback database schema first
echo "🔄 Step 1/2: Rolling back database schema..."
./scripts/rollback-database-schema.sh "${ENVIRONMENT}" "${TARGET_SCHEMA_VERSION}"
# Wait for schema rollback to complete
sleep 30
else
echo "✅ No database schema rollback needed"
fi
# Step 3: Rollback application
echo "🔄 Step 2/2: Rolling back application..."
./scripts/rollback-to-tag.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_TAG}"
# Step 4: Validate rollback
echo "🔍 Validating coordinated rollback..."
./scripts/validate-rollback.sh "${SERVICE}" "${ENVIRONMENT}"
echo "✅ Coordinated rollback complete"
FluxCD Rollback¶
Reverting Kustomization¶
Revert Kustomization via Git:
#!/bin/bash
# scripts/fluxcd-rollback-kustomization.sh
KUSTOMIZATION="${1}" # e.g., apps-production
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"
echo "⏪ Rolling back Kustomization: ${KUSTOMIZATION}"
# Revert the Kustomization path in Git
git checkout "${TARGET_COMMIT}" -- "apps/" "infrastructure/"
# Commit the rollback
git add apps/ infrastructure/
git commit -m "rollback: Revert ${KUSTOMIZATION} to ${TARGET_COMMIT}"
# Push to environment branch
git push origin "${ENVIRONMENT}"
echo "✅ Kustomization rollback committed"
echo "⏳ FluxCD will reconcile automatically (polling interval: 5m)"
Suspend Kustomization for Manual Rollback:
#!/bin/bash
# scripts/suspend-kustomization.sh
KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"
echo "⏸️ Suspending Kustomization: ${KUSTOMIZATION}"
# Suspend reconciliation
flux suspend kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"
# Verify suspension
kubectl get kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" \
-o jsonpath='{.spec.suspend}'
echo "✅ Kustomization suspended (reconciliation paused)"
Resume Kustomization After Rollback:
#!/bin/bash
# scripts/resume-kustomization.sh
KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"
echo "▶️ Resuming Kustomization: ${KUSTOMization}"
# Resume reconciliation
flux resume kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"
echo "✅ Kustomization resumed (reconciliation active)"
Reverting HelmRelease¶
HelmRelease Rollback:
#!/bin/bash
# scripts/fluxcd-rollback-helmrelease.sh
HELMRELEASE="${1}"
NAMESPACE="${2:-atp-production}"
echo "⏪ Rolling back HelmRelease: ${HELMRELEASE}"
# Get current release version
CURRENT_REVISION=$(kubectl get helmrelease "${HELMRELEASE}" -n "${NAMESPACE}" \
-o jsonpath='{.status.lastReleaseRevision}')
PREVIOUS_REVISION=$((CURRENT_REVISION - 1))
echo "📊 Current revision: ${CURRENT_REVISION}"
echo "📊 Rolling back to revision: ${PREVIOUS_REVISION}"
# Update HelmRelease to previous version (via Git)
# Option 1: Revert Helm values in Git
git checkout "${PREVIOUS_COMMIT}" -- "apps/${HELMRELEASE}/values.yaml"
# Option 2: Direct Helm rollback (bypasses GitOps)
helm rollback "${HELMRELEASE}" "${PREVIOUS_REVISION}" -n "${NAMESPACE}"
# Option 3: Update HelmRelease spec
kubectl patch helmrelease "${HELMRELEASE}" -n "${NAMESPACE}" --type=json \
-p="[{\"op\": \"replace\", \"path\": \"/spec/values\", \"value\": {...previous values...}}]"
echo "✅ HelmRelease rollback initiated"
HelmRelease Git-Based Rollback:
#!/bin/bash
# scripts/fluxcd-helmrelease-git-rollback.sh
HELMRELEASE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"
echo "⏪ Git-based HelmRelease rollback: ${HELMRELEASE}"
# Revert Helm values to target commit
git checkout "${TARGET_COMMIT}" -- "apps/${HELMRELEASE}/values.yaml" \
"apps/${HELMRELEASE}/Chart.yaml"
# Commit the rollback
git add "apps/${HELMRELEASE}/"
git commit -m "rollback(helm): Revert ${HELMRELEASE} to ${TARGET_COMMIT}"
# Push to environment branch
git push origin "${ENVIRONMENT}"
echo "✅ HelmRelease rollback committed"
echo "⏳ FluxCD will reconcile and deploy previous Helm chart version"
Suspend and Resume Reconciliation¶
Suspend Reconciliation:
#!/bin/bash
# scripts/suspend-reconciliation.sh
KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"
echo "⏸️ Suspending reconciliation for: ${KUSTOMIZATION}"
# Suspend via Flux CLI
flux suspend kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"
# Or via kubectl
kubectl patch kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" --type=json \
-p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
# Verify suspension
kubectl get kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" \
-o jsonpath='{.spec.suspend}'
echo "✅ Reconciliation suspended"
Resume Reconciliation:
#!/bin/bash
# scripts/resume-reconciliation.sh
KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"
echo "▶️ Resuming reconciliation for: ${KUSTOMIZATION}"
# Resume via Flux CLI
flux resume kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"
# Or via kubectl
kubectl patch kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" --type=json \
-p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'
echo "✅ Reconciliation resumed"
Manual Intervention Procedures¶
Manual Intervention Runbook:
#!/bin/bash
# scripts/manual-intervention-runbook.sh
INCIDENT_TYPE="${1}" # deployment-failure, drift-detection, reconciliation-stuck
echo "🚨 Manual Intervention Runbook"
echo "Incident Type: ${INCIDENT_TYPE}"
case "${INCIDENT_TYPE}" in
"deployment-failure")
echo "📋 Deployment Failure Intervention:"
echo "1. Check deployment status: kubectl get deployment -n atp-production"
echo "2. Check pod logs: kubectl logs -n atp-production deployment/<service>"
echo "3. Check FluxCD status: flux get kustomizations -n flux-system"
echo "4. Suspend reconciliation: flux suspend kustomization <name> -n flux-system"
echo "5. Manually fix issue or rollback: ./scripts/rollback-simple.sh <service> production"
echo "6. Resume reconciliation: flux resume kustomization <name> -n flux-system"
;;
"drift-detection")
echo "📋 Drift Detection Intervention:"
echo "1. Check drift: flux get kustomizations --watch"
echo "2. Identify drifted resources: kubectl diff -f <manifest>"
echo "3. Option A: Fix cluster state to match Git"
echo " kubectl delete <resource> (let FluxCD recreate)"
echo "4. Option B: Update Git to match cluster state"
echo " git checkout <cluster-state>"
echo "5. Force reconciliation: flux reconcile kustomization <name>"
;;
"reconciliation-stuck")
echo "📋 Stuck Reconciliation Intervention:"
echo "1. Check Kustomization status: flux get kustomizations -n flux-system"
echo "2. Describe for details: kubectl describe kustomization <name> -n flux-system"
echo "3. Check logs: kubectl logs -n flux-system deployment/kustomize-controller"
echo "4. Suspend: flux suspend kustomization <name> -n flux-system"
echo "5. Fix issue (check GitRepository, permissions, etc.)"
echo "6. Resume: flux resume kustomization <name> -n flux-system"
echo "7. Force reconcile: flux reconcile kustomization <name> --with-source"
;;
esac
Azure Backup Integration¶
Backing Up AKS Resources (Velero)¶
Velero Installation:
# Install Velero CLI
curl -fsSL -o velero-v1.11.0-linux-amd64.tar.gz \
https://github.com/vmware-tanzu/velero/releases/download/v1.11.0/velero-v1.11.0-linux-amd64.tar.gz
tar -xvf velero-v1.11.0-linux-amd64.tar.gz
sudo mv velero-v1.11.0-linux-amd64/velero /usr/local/bin/
# Install Velero on AKS
velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.7.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=atp-production-rg,storageAccount=atpprodvelero,subscriptionId=<subscription-id> \
--snapshot-location-config apiTimeout=5m,resourceGroup=atp-production-rg,subscriptionId=<subscription-id>
Velero Backup Configuration:
# velero/backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup-production
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- atp-production
excludedResources:
- events
- events.events.k8s.io
ttl: 30d # Retain backups for 30 days
storageLocation: default
volumeSnapshotLocations:
- default
metadata:
labels:
environment: production
backup-type: scheduled
Manual Backup:
#!/bin/bash
# scripts/velero-backup.sh
BACKUP_NAME="manual-backup-$(date +%Y%m%d-%H%M%S)"
NAMESPACE="${1:-atp-production}"
echo "📦 Creating Velero backup: ${BACKUP_NAME}"
# Create backup
velero backup create "${BACKUP_NAME}" \
--include-namespaces "${NAMESPACE}" \
--ttl 30d \
--wait
# Verify backup
velero backup describe "${BACKUP_NAME}"
echo "✅ Backup created: ${BACKUP_NAME}"
PersistentVolume Snapshots¶
Volume Snapshot Configuration:
# VolumeSnapshot for StatefulSet
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: atp-stateful-data-snapshot-$(date +%Y%m%d)
namespace: atp-production
spec:
volumeSnapshotClassName: csi-azuredisk-vsc
source:
persistentVolumeClaimName: data-atp-stateful-service-0
Automated Volume Snapshots:
# Velero: Automated volume snapshots
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: volume-snapshots-production
namespace: velero
spec:
schedule: "0 */6 * * *" # Every 6 hours
template:
includedNamespaces:
- atp-production
includedResources:
- persistentvolumes
- persistentvolumeclaims
volumeSnapshotLocations:
- default
ttl: 7d # Retain snapshots for 7 days
Etcd Backup¶
Etcd Backup via AKS:
#!/bin/bash
# scripts/backup-etcd.sh
RESOURCE_GROUP="${1:-atp-production-rg}"
CLUSTER_NAME="${2:-atp-prod-eus-aks}"
echo "📦 Backing up AKS etcd"
# AKS automatically backs up etcd, but we can trigger manual snapshot
# Note: etcd backup requires Azure support or cluster admin access
# Alternative: Use Velero for cluster-level backup
velero backup create "etcd-backup-$(date +%Y%m%d)" \
--include-cluster-resources=true \
--wait
echo "✅ Etcd backup initiated"
AKS Automatic Etcd Backup:
- ✅ Automatic: AKS automatically backs up etcd every 8 hours
- ✅ Retention: 30 days
- ✅ Recovery: Available via Azure support
Backup Retention Policies¶
Backup Retention Configuration:
# Velero: Backup retention policy
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: production-backup-schedule
spec:
schedule: "0 2 * * *"
template:
ttl: 30d # Keep backups for 30 days
includedNamespaces:
- atp-production
Retention Policy by Backup Type:
| Backup Type | Retention | Rationale |
|---|---|---|
| Daily Scheduled | 30 days | Standard retention |
| Weekly Full | 90 days | Long-term retention |
| Monthly Full | 365 days | Compliance (1 year) |
| Pre-Deployment | 7 days | Short-term rollback |
| Manual Backup | 30 days | On-demand backups |
Backup Retention Cleanup:
#!/bin/bash
# scripts/cleanup-old-backups.sh
# Delete backups older than retention period
velero backup delete --all --older-than 30d --confirm
echo "🧹 Cleaned up backups older than 30 days"
Disaster Recovery Scenarios¶
Cluster Failure¶
Cluster Failure Recovery:
graph TB
subgraph "Disaster: Cluster Failure"
FAIL[AKS Cluster<br/>Failure]
end
subgraph "Recovery Process"
DETECT[Detect Failure]
ASSESS[Assess Impact]
RECREATE[Recreate Cluster<br/>from GitOps]
RESTORE[Restore Data<br/>from Velero]
VALIDATE[Validate Recovery]
end
subgraph "Backup Sources"
GIT[Git Repository<br/>Manifests]
VELERO[Velero Backups<br/>State]
ACR[ACR Images]
end
FAIL --> DETECT
DETECT --> ASSESS
ASSESS --> RECREATE
RECREATE --> GIT
RECREATE --> RESTORE
RESTORE --> VELERO
RESTORE --> VALIDATE
VALIDATE --> ACR
style FAIL fill:#FF6B6B
style RECREATE fill:#90EE90
style RESTORE fill:#90EE90
Cluster Failure Recovery Procedure:
#!/bin/bash
# scripts/recover-cluster-failure.sh
CLUSTER_NAME="${1:-atp-prod-eus-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"
echo "🚨 Cluster Failure Recovery"
echo "Cluster: ${CLUSTER_NAME}"
echo "Resource Group: ${RESOURCE_GROUP}"
# Step 1: Verify cluster is actually down
if az aks show --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}" \
--query "provisioningState" -o tsv | grep -q "Succeeded"; then
echo "⚠️ Cluster appears to be running. Verify the issue."
exit 1
fi
# Step 2: Recreate cluster from Pulumi
echo "🔄 Step 1: Recreating AKS cluster from GitOps..."
cd infrastructure/
pulumi stack select production
pulumi up --yes
# Step 3: Wait for cluster to be ready
echo "⏳ Waiting for cluster to be ready..."
az aks wait --name "${CLUSTER_NAME}" --resource-group "${RESOURCE_GROUP}" \
--created --timeout 30
# Step 4: Bootstrap FluxCD
echo "🔄 Step 2: Bootstrapping FluxCD..."
az aks get-credentials --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}"
flux bootstrap git \
--url=https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops \
--branch=production \
--path=./clusters/production
# Step 5: Restore from Velero backup
echo "🔄 Step 3: Restoring from Velero backup..."
LATEST_BACKUP=$(velero backup get --output json | \
jq -r '.items[] | select(.status.phase == "Completed") | .metadata.name' | \
head -1)
velero restore create "restore-${CLUSTER_NAME}-$(date +%Y%m%d)" \
--from-backup "${LATEST_BACKUP}" \
--wait
# Step 6: Validate recovery
echo "🔍 Step 4: Validating recovery..."
./scripts/validate-cluster-health.sh
echo "✅ Cluster recovery complete"
Region Outage¶
Multi-Region Recovery:
#!/bin/bash
# scripts/recover-region-outage.sh
PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"
echo "🚨 Region Outage Recovery"
echo "Primary Region: ${PRIMARY_REGION} (DOWN)"
echo "Secondary Region: ${SECONDARY_REGION} (DR)"
# Step 1: Failover traffic to secondary region
echo "🔄 Step 1: Failing over traffic to ${SECONDARY_REGION}..."
az network front-door backend-pool update \
--resource-group atp-production-rg \
--front-door-name atp-frontdoor \
--name primary-eus \
--backend-pool-parameters enabled=false
az network front-door backend-pool update \
--resource-group atp-production-rg \
--front-door-name atp-frontdoor \
--name secondary-weu \
--backend-pool-parameters enabled=true priority=1
# Step 2: Promote secondary database to primary
echo "🔄 Step 2: Promoting secondary database..."
az sql db update \
--resource-group atp-production-rg \
--server atp-prod-sql-server-weu \
--name atp-prod-db \
--read-scale Enabled # Promote to readable
# Step 3: Scale up secondary cluster
echo "🔄 Step 3: Scaling up secondary cluster..."
az aks scale \
--resource-group atp-production-rg \
--name atp-prod-weu-aks \
--node-count 10
# Step 4: Validate failover
echo "🔍 Step 4: Validating failover..."
./scripts/validate-failover.sh "${SECONDARY_REGION}"
echo "✅ Region failover complete"
Data Corruption¶
Data Corruption Recovery:
#!/bin/bash
# scripts/recover-data-corruption.sh
ENVIRONMENT="${1:-production}"
CORRUPTION_TIME="${2}" # ISO timestamp of when corruption occurred
echo "🚨 Data Corruption Recovery"
echo "Environment: ${ENVIRONMENT}"
echo "Corruption Detected At: ${CORRUPTION_TIME}"
# Step 1: Find backup before corruption
echo "🔍 Step 1: Finding backup before corruption..."
TARGET_BACKUP=$(velero backup get --output json | \
jq -r --arg time "${CORRUPTION_TIME}" \
'.items[] | select(.status.phase == "Completed") | select(.metadata.creationTimestamp < $time) | .metadata.name' | \
tail -1)
if [ -z "${TARGET_BACKUP}" ]; then
echo "❌ No backup found before corruption time"
exit 1
fi
echo "📦 Target backup: ${TARGET_BACKUP}"
# Step 2: Stop application to prevent further corruption
echo "🛑 Step 2: Stopping application..."
kubectl scale deployment --all --replicas=0 -n "atp-${ENVIRONMENT}"
# Step 3: Restore from backup
echo "🔄 Step 3: Restoring from backup..."
velero restore create "restore-corruption-$(date +%Y%m%d)" \
--from-backup "${TARGET_BACKUP}" \
--include-namespaces "atp-${ENVIRONMENT}" \
--wait
# Step 4: Validate data integrity
echo "🔍 Step 4: Validating data integrity..."
./scripts/validate-data-integrity.sh "${ENVIRONMENT}"
# Step 5: Restart application
echo "▶️ Step 5: Restarting application..."
kubectl scale deployment --all --replicas=5 -n "atp-${ENVIRONMENT}"
echo "✅ Data corruption recovery complete"
Complete Platform Loss¶
Complete Platform Recovery:
#!/bin/bash
# scripts/recover-complete-platform-loss.sh
echo "🚨 Complete Platform Loss Recovery"
echo "This procedure recreates the entire ATP platform from GitOps"
# Step 1: Verify GitOps repository is accessible
echo "🔍 Step 1: Verifying GitOps repository..."
if ! git ls-remote https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops >/dev/null 2>&1; then
echo "❌ GitOps repository not accessible"
exit 1
fi
# Step 2: Recreate infrastructure from Pulumi
echo "🔄 Step 2: Recreating infrastructure..."
cd infrastructure/
pulumi stack select production
pulumi up --yes
# Step 3: Create AKS clusters
echo "🔄 Step 3: Creating AKS clusters..."
./scripts/create-aks-clusters.sh production
# Step 4: Bootstrap FluxCD on all clusters
echo "🔄 Step 4: Bootstrapping FluxCD..."
for CLUSTER in atp-prod-eus-aks atp-prod-weu-aks; do
echo " Bootstrapping ${CLUSTER}..."
az aks get-credentials --resource-group atp-production-rg --name "${CLUSTER}"
flux bootstrap git \
--url=https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops \
--branch=production \
--path=./clusters/production
done
# Step 5: Restore application state from Velero
echo "🔄 Step 5: Restoring application state..."
LATEST_BACKUP=$(velero backup get --output json | \
jq -r '.items[] | select(.status.phase == "Completed") | .metadata.name' | \
head -1)
velero restore create "restore-platform-$(date +%Y%m%d)" \
--from-backup "${LATEST_BACKUP}" \
--wait
# Step 6: Validate platform
echo "🔍 Step 6: Validating platform recovery..."
./scripts/validate-platform.sh
echo "✅ Complete platform recovery complete"
RTO/RPO Targets Per Environment¶
RTO/RPO Targets Matrix:
| Environment | RTO (Recovery Time Objective) | RPO (Recovery Point Objective) | Rationale |
|---|---|---|---|
| Dev | 4 hours | 24 hours | Lower priority, acceptable downtime |
| Test | 2 hours | 12 hours | Moderate priority, faster recovery needed |
| Staging | 1 hour | 4 hours | Production-like, important for validation |
| Production | 30 minutes | 1 hour | Critical, minimal downtime required |
RTO/RPO Validation:
#!/bin/bash
# scripts/validate-rto-rpo.sh
ENVIRONMENT="${1:-production}"
echo "📊 RTO/RPO Validation for ${ENVIRONMENT}"
# Get target RTO/RPO from matrix
case "${ENVIRONMENT}" in
"dev") TARGET_RTO=14400 TARGET_RPO=86400 ;; # 4h / 24h
"test") TARGET_RTO=7200 TARGET_RPO=43200 ;; # 2h / 12h
"staging") TARGET_RTO=3600 TARGET_RPO=14400 ;; # 1h / 4h
"production") TARGET_RTO=1800 TARGET_RPO=3600 ;; # 30m / 1h
esac
echo "Target RTO: ${TARGET_RTO} seconds ($(($TARGET_RTO / 60)) minutes)"
echo "Target RPO: ${TARGET_RPO} seconds ($(($TARGET_RPO / 60)) minutes)"
# Simulate recovery and measure time
START_TIME=$(date +%s)
./scripts/recover-cluster-failure.sh
END_TIME=$(date +%s)
ACTUAL_RTO=$((END_TIME - START_TIME))
# Get latest backup timestamp
LATEST_BACKUP_TIME=$(velero backup get --output json | \
jq -r '.items[] | select(.status.phase == "Completed") | .metadata.creationTimestamp' | \
head -1 | xargs -I {} date -d {} +%s)
CURRENT_TIME=$(date +%s)
ACTUAL_RPO=$((CURRENT_TIME - LATEST_BACKUP_TIME))
# Validate
if [ ${ACTUAL_RTO} -le ${TARGET_RTO} ]; then
echo "✅ RTO Met: ${ACTUAL_RTO}s <= ${TARGET_RTO}s"
else
echo "❌ RTO Exceeded: ${ACTUAL_RTO}s > ${TARGET_RTO}s"
fi
if [ ${ACTUAL_RPO} -le ${TARGET_RPO} ]; then
echo "✅ RPO Met: ${ACTUAL_RPO}s <= ${TARGET_RPO}s"
else
echo "❌ RPO Exceeded: ${ACTUAL_RPO}s > ${TARGET_RPO}s"
fi
DR Testing and Drills¶
Quarterly DR Drills for Production¶
DR Drill Schedule:
| Frequency | Environment | Drill Type | Rationale |
|---|---|---|---|
| Quarterly | Production | Full DR drill | Validate production recovery procedures |
| Monthly | Staging | Partial DR drill | Test recovery procedures in production-like environment |
| Bi-weekly | Test | Automated DR test | Continuous validation |
Quarterly DR Drill Plan:
#!/bin/bash
# scripts/dr-drill-production.sh
DRILL_DATE="${1:-$(date +%Y%m%d)}"
SCENARIO="${2:-cluster-failure}" # cluster-failure, region-outage, data-corruption
echo "🎯 Quarterly DR Drill - Production"
echo "Date: ${DRILL_DATE}"
echo "Scenario: ${SCENARIO}"
# Pre-drill checklist
echo "📋 Pre-Drill Checklist:"
echo " [ ] Notify stakeholders"
echo " [ ] Backup current state"
echo " [ ] Prepare recovery scripts"
echo " [ ] Verify backup availability"
echo " [ ] Document baseline metrics"
# Execute drill scenario
case "${SCENARIO}" in
"cluster-failure")
echo "🔄 Executing cluster failure drill..."
./scripts/dr-drill-cluster-failure.sh
;;
"region-outage")
echo "🔄 Executing region outage drill..."
./scripts/dr-drill-region-outage.sh
;;
"data-corruption")
echo "🔄 Executing data corruption drill..."
./scripts/dr-drill-data-corruption.sh
;;
esac
# Post-drill validation
echo "🔍 Post-Drill Validation:"
./scripts/validate-dr-drill.sh
# Generate drill report
echo "📝 Generating drill report..."
./scripts/generate-dr-drill-report.sh "${DRILL_DATE}" "${SCENARIO}"
echo "✅ DR Drill complete"
Drill Scenarios and Checklists¶
DR Drill Scenarios:
| Scenario | Description | Recovery Procedure | Frequency |
|---|---|---|---|
| Cluster Failure | Complete AKS cluster failure | Recreate cluster, restore from Velero | Quarterly |
| Region Outage | Primary region unavailable | Failover to secondary region | Quarterly |
| Data Corruption | Database corruption detected | Restore from point-in-time backup | Quarterly |
| Network Isolation | Network connectivity issues | Route traffic via secondary path | Monthly |
| Storage Failure | PersistentVolume failures | Restore from volume snapshots | Monthly |
Cluster Failure Drill Checklist:
## DR Drill: Cluster Failure
### Pre-Drill
- [ ] Schedule drill with stakeholders
- [ ] Create backup before drill
- [ ] Document baseline metrics
- [ ] Notify on-call team
### Drill Execution
- [ ] Simulate cluster failure (scale cluster to 0 nodes)
- [ ] Measure detection time
- [ ] Execute recovery procedure
- [ ] Recreate cluster from Pulumi
- [ ] Bootstrap FluxCD
- [ ] Restore from Velero backup
- [ ] Measure recovery time (RTO)
- [ ] Validate application health
- [ ] Verify data integrity (RPO)
### Post-Drill
- [ ] Restore cluster to normal state
- [ ] Document actual RTO/RPO
- [ ] Identify improvement opportunities
- [ ] Update runbooks
- [ ] Generate drill report
Region Outage Drill:
#!/bin/bash
# scripts/dr-drill-region-outage.sh
echo "🎯 DR Drill: Region Outage"
# Simulate region outage (disable primary region)
echo "🔄 Simulating region outage..."
az network front-door backend-pool update \
--resource-group atp-production-rg \
--front-door-name atp-frontdoor \
--name primary-eus \
--backend-pool-parameters enabled=false
# Execute failover
echo "🔄 Executing failover..."
./scripts/recover-region-outage.sh eastus westeurope
# Measure failover time
FAILOVER_START=$(date +%s)
# ... failover procedure ...
FAILOVER_END=$(date +%s)
FAILOVER_TIME=$((FAILOVER_END - FAILOVER_START))
echo "⏱️ Failover time: ${FAILOVER_TIME} seconds"
# Validate
./scripts/validate-failover.sh westeurope
# Restore (post-drill)
echo "🔄 Restoring primary region..."
az network front-door backend-pool update \
--resource-group atp-production-rg \
--front-door-name atp-frontdoor \
--name primary-eus \
--backend-pool-parameters enabled=true priority=1
echo "✅ DR Drill complete"
Drill Report and Improvements¶
DR Drill Report Template:
# DR Drill Report
## Drill Information
- **Date**: 2024-01-15
- **Scenario**: Cluster Failure
- **Environment**: Production
- **Duration**: 45 minutes
## Objectives Met
- [x] RTO Target: 30 minutes (Actual: 28 minutes) ✅
- [x] RPO Target: 1 hour (Actual: 45 minutes) ✅
- [x] All services recovered successfully ✅
## Issues Identified
1. Velero restore took longer than expected (15 minutes)
2. Database restore required manual intervention
## Improvements
1. Optimize Velero restore process
2. Automate database restore procedure
3. Update runbooks with lessons learned
## Action Items
- [ ] Update recovery scripts
- [ ] Improve backup frequency
- [ ] Add automated validation steps
Generate DR Drill Report:
#!/bin/bash
# scripts/generate-dr-drill-report.sh
DRILL_DATE="${1}"
SCENARIO="${2}"
REPORT_FILE="dr-drill-report-${DRILL_DATE}-${SCENARIO}.md"
echo "📝 Generating DR Drill Report..."
cat > "${REPORT_FILE}" <<EOF
# DR Drill Report
**Date**: ${DRILL_DATE}
**Scenario**: ${SCENARIO}
**Environment**: Production
## Results
### RTO/RPO Metrics
- **Target RTO**: 30 minutes
- **Actual RTO**: $(./scripts/get-actual-rto.sh)
- **Target RPO**: 1 hour
- **Actual RPO**: $(./scripts/get-actual-rpo.sh)
### Recovery Steps
1. $(./scripts/get-recovery-step.sh 1)
2. $(./scripts/get-recovery-step.sh 2)
3. $(./scripts/get-recovery-step.sh 3)
## Lessons Learned
$(./scripts/get-drill-lessons.sh)
## Action Items
$(./scripts/get-drill-action-items.sh)
EOF
echo "✅ Report generated: ${REPORT_FILE}"
Lessons Learned Process¶
Lessons Learned Template:
#!/bin/bash
# scripts/capture-dr-drill-lessons.sh
DRILL_DATE="${1}"
SCENARIO="${2}"
echo "📚 Capturing Lessons Learned from DR Drill..."
cat >> "dr-lessons-learned.md" <<EOF
## DR Drill: ${SCENARIO} - ${DRILL_DATE}
### What Went Well
- Automated cluster recreation from Pulumi worked seamlessly
- FluxCD bootstrap completed quickly
- Application recovery was faster than expected
### What Could Be Improved
- Velero restore process needs optimization
- Database restore requires more automation
- Communication during drill could be better
### Action Items
1. [ ] Optimize Velero restore scripts
2. [ ] Automate database restore procedure
3. [ ] Update incident response runbook
4. [ ] Schedule follow-up drill in 3 months
### Updated Procedures
- Recovery procedure updated: ./scripts/recover-cluster-failure.sh
- Runbook updated: docs/operations/disaster-recovery.md
---
EOF
echo "✅ Lessons learned captured"
Incident Response Integration¶
Automated Rollback on Critical Alerts¶
Automated Rollback Trigger:
# PrometheusRule: Trigger automated rollback on critical alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: auto-rollback-trigger
namespace: monitoring
spec:
groups:
- name: auto-rollback
rules:
- alert: CriticalErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.10 # 10% error rate
for: 2m
labels:
severity: critical
auto-rollback: "true"
annotations:
summary: "Critical error rate detected - triggering automated rollback"
description: "Error rate: {{ $value | humanizePercentage }}"
Automated Rollback Webhook:
# AlertManager: Configure webhook for automated rollback
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
route:
receiver: 'default'
routes:
- match:
auto-rollback: "true"
receiver: 'auto-rollback'
receivers:
- name: 'auto-rollback'
webhook_configs:
- url: 'http://auto-rollback-service.monitoring:8080/rollback'
send_resolved: false
Auto-Rollback Service:
// C#: Auto-rollback service
[ApiController]
[Route("[controller]")]
public class AutoRollbackController : ControllerBase
{
[HttpPost("rollback")]
public async Task<IActionResult> TriggerRollback([FromBody] Alert alert)
{
// Parse alert to determine service
var service = ExtractServiceFromAlert(alert);
var environment = ExtractEnvironmentFromAlert(alert);
// Check if auto-rollback is enabled for this service
if (!await IsAutoRollbackEnabled(service, environment))
{
return Ok(new { message = "Auto-rollback disabled for this service" });
}
// Execute rollback
var rollbackResult = await ExecuteRollback(service, environment);
// Notify team
await NotifyTeam($"Auto-rollback triggered for {service}: {rollbackResult.Status}");
return Ok(rollbackResult);
}
}
Incident Commander Decision Making¶
Incident Commander Decision Tree:
graph TD
START[Incident Detected] --> ASSESS{Assess Impact}
ASSESS -->|High Impact| ROLLBACK{Can Rollback?}
ASSESS -->|Low Impact| INVESTIGATE[Investigate Root Cause]
ROLLBACK -->|Yes| EXECUTE[Execute Rollback]
ROLLBACK -->|No| MITIGATE[Apply Mitigation]
EXECUTE --> VALIDATE[Validate Rollback]
VALIDATE -->|Success| MONITOR[Monitor Recovery]
VALIDATE -->|Failure| ESCALATE[Escalate to Senior]
MITIGATE --> INVESTIGATE
INVESTIGATE --> FIX[Develop Fix]
FIX --> DEPLOY[Deploy Fix]
DEPLOY --> VALIDATE
MONITOR --> CLOSE[Close Incident]
style START fill:#FF6B6B
style EXECUTE fill:#FFD700
style VALIDATE fill:#90EE90
style CLOSE fill:#90EE90
Incident Commander Decision Matrix:
| Impact | Error Rate | Decision | Action |
|---|---|---|---|
| Critical | > 10% | ✅ Immediate Rollback | Execute rollback, investigate later |
| High | 5-10% | ⚠️ Investigate + Prepare Rollback | Investigate, rollback if no fix in 15min |
| Medium | 1-5% | ⚠️ Investigate First | Investigate, rollback if worsens |
| Low | < 1% | ✅ Monitor | Monitor, no immediate action |
Rollback vs Forward Fix Decision Tree¶
Rollback vs Forward Fix Decision:
#!/bin/bash
# scripts/rollback-vs-fix-decision.sh
ERROR_RATE="${1}" # Percentage
AFFECTED_USERS="${2}" # Number of users
HAS_FIX="${3}" # yes/no - Do we have a fix ready?
echo "🤔 Rollback vs Forward Fix Decision"
echo "Error Rate: ${ERROR_RATE}%"
echo "Affected Users: ${AFFECTED_USERS}"
echo "Has Fix Ready: ${HAS_FIX}"
# Decision logic
if (( $(echo "${ERROR_RATE} > 10" | bc -l) )); then
DECISION="ROLLBACK"
REASON="Critical error rate (>10%)"
elif (( $(echo "${ERROR_RATE} > 5" | bc -l) )) && [ "${HAS_FIX}" != "yes" ]; then
DECISION="ROLLBACK"
REASON="High error rate (>5%) and no fix ready"
elif (( $(echo "${ERROR_RATE} > 5" | bc -l) )) && [ "${HAS_FIX}" == "yes" ]; then
DECISION="FORWARD_FIX"
REASON="High error rate but fix available"
elif [ "${AFFECTED_USERS}" -gt 10000 ]; then
DECISION="ROLLBACK"
REASON="Large number of affected users"
else
DECISION="FORWARD_FIX"
REASON="Low impact, proceed with fix"
fi
echo "📊 Decision: ${DECISION}"
echo "📝 Reason: ${REASON}"
case "${DECISION}" in
"ROLLBACK")
echo "🔄 Executing rollback..."
./scripts/rollback-simple.sh
;;
"FORWARD_FIX")
echo "🔧 Proceeding with forward fix..."
./scripts/deploy-fix.sh
;;
esac
Post-Incident Review¶
Post-Incident Review Template:
# Post-Incident Review
## Incident Summary
- **Incident ID**: INC-2024-001
- **Date**: 2024-01-15
- **Duration**: 45 minutes
- **Impact**: 5% of users affected
- **Resolution**: Rollback to previous version
## Timeline
- 10:00 AM: Incident detected (error rate spike)
- 10:05 AM: Incident declared, on-call engaged
- 10:10 AM: Root cause identified (deployment issue)
- 10:15 AM: Rollback decision made
- 10:20 AM: Rollback executed
- 10:30 AM: Rollback validated, services restored
- 10:45 AM: Incident resolved
## Root Cause
Deployment of v1.2.3 introduced memory leak causing pod restarts and increased error rate.
## Actions Taken
1. Rolled back to v1.2.2
2. Validated service health
3. Investigated root cause
## Lessons Learned
- Need better pre-deployment testing for memory issues
- Rollback procedure worked well (RTO: 20 minutes)
## Action Items
- [ ] Add memory leak detection to CI pipeline
- [ ] Improve error rate monitoring
- [ ] Update deployment procedures
Post-Incident Review Script:
#!/bin/bash
# scripts/generate-post-incident-review.sh
INCIDENT_ID="${1}"
INCIDENT_DATE="${2}"
echo "📝 Generating Post-Incident Review..."
cat > "post-incident-review-${INCIDENT_ID}.md" <<EOF
# Post-Incident Review: ${INCIDENT_ID}
**Date**: ${INCIDENT_DATE}
**Status**: Resolved
## Timeline
$(./scripts/get-incident-timeline.sh "${INCIDENT_ID}")
## Root Cause
$(./scripts/get-root-cause.sh "${INCIDENT_ID}")
## Impact
- Users Affected: $(./scripts/get-affected-users.sh "${INCIDENT_ID}")
- Error Rate: $(./scripts/get-max-error-rate.sh "${INCIDENT_ID}")%
- Duration: $(./scripts/get-incident-duration.sh "${INCIDENT_ID}")
## Resolution
$(./scripts/get-resolution.sh "${INCIDENT_ID}")
## Lessons Learned
$(./scripts/get-lessons-learned.sh "${INCIDENT_ID}")
## Action Items
$(./scripts/get-action-items.sh "${INCIDENT_ID}")
EOF
echo "✅ Post-incident review generated"
Summary: Rollback & Disaster Recovery¶
- Git-Based Rollback: Simple rollback (git revert), complex rollback (git reset), rollback to specific commit, rollback to specific tag
- Progressive Rollback: Rolling back one service at a time, rollback with canary (gradual revert), validation at each rollback step
- Application State Recovery: Handling database schema changes, data migration rollback, stateful application considerations
- Database Migration Rollback: Forward-only migrations (preferred), rollback scripts (if necessary), data loss prevention, coordinating app rollback with DB rollback
- FluxCD Rollback: Reverting Kustomization, reverting HelmRelease, suspend and resume reconciliation, manual intervention procedures
- Azure Backup Integration: Backing up AKS resources (Velero), PersistentVolume snapshots, Etcd backup, backup retention policies
- Disaster Recovery Scenarios: Cluster failure, region outage, data corruption, complete platform loss
- RTO/RPO Targets: Dev (RTO 4h, RPO 24h), Test (RTO 2h, RPO 12h), Staging (RTO 1h, RPO 4h), Production (RTO 30m, RPO 1h)
- DR Testing and Drills: Quarterly DR drills for production, drill scenarios and checklists, drill report and improvements, lessons learned process
- Incident Response Integration: Automated rollback on critical alerts, incident commander decision making, rollback vs forward fix decision tree, post-incident review
Helm Chart Development for ATP Services¶
Purpose: Define the standards, best practices, and procedures for developing, testing, versioning, and publishing Helm charts for ATP microservices, ensuring consistent deployment patterns, maintainable chart structures, and reliable application deployments across all environments.
Helm Chart Structure¶
Chart.yaml: Metadata, Version, Dependencies¶
Chart.yaml for ATP Service:
# charts/atp-ingestion/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: A Helm chart for ATP Ingestion Service - Collects and processes audit trail events
type: application
version: 1.2.3 # Chart version (SemVer)
appVersion: "1.2.3" # Application version (from source code)
home: https://github.com/ConnectSoft/ATP
sources:
- https://github.com/ConnectSoft/ATP/ConnectSoft.Audit.Ingestion
maintainers:
- name: ATP Team
email: atp-team@connectsoft.example
keywords:
- audit-trail
- atp
- ingestion
- microservice
annotations:
category: Microservice
architecture: microservices
dependencies:
- name: redis
version: "17.15.0"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
- name: postgresql
version: "12.1.9"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
Chart Metadata Standards:
| Field | Required | Description | ATP Convention |
|---|---|---|---|
| apiVersion | ✅ Yes | Chart API version | v2 (Helm 3+) |
| name | ✅ Yes | Chart name | atp-{service-name} (kebab-case) |
| version | ✅ Yes | Chart version | SemVer (MAJOR.MINOR.PATCH) |
| appVersion | ✅ Yes | Application version | Matches source code version |
| description | ✅ Yes | Chart description | One-line service description |
| type | ⚠️ Recommended | Chart type | application (default) |
| keywords | ⚠️ Recommended | Search keywords | Include audit-trail, atp, service name |
| maintainers | ⚠️ Recommended | Maintainer info | ATP Team contact |
| dependencies | ⚠️ Optional | Chart dependencies | External charts (Redis, PostgreSQL) |
values.yaml: Default Values¶
Complete values.yaml:
# charts/atp-ingestion/values.yaml
# Default values for atp-ingestion
# This is a YAML-formatted file
# Application Configuration
replicaCount: 3
image:
repository: connectsoft.azurecr.io/atp/ingestion
pullPolicy: IfNotPresent
tag: "" # Override via --set image.tag=v1.2.3
imagePullSecrets:
- name: acr-pull-secret
nameOverride: ""
fullnameOverride: ""
serviceAccount:
create: true
annotations: {}
name: ""
podAnnotations: {}
podSecurityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
service:
type: ClusterIP
port: 80
targetPort: 8080
ingress:
enabled: false
className: "nginx"
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: ingestion.atp.connectsoft.example
paths:
- path: /
pathType: Prefix
tls: []
resources:
limits:
cpu: 2000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
nodeSelector: {}
tolerations: []
affinity: {}
# External Secrets
externalSecrets:
enabled: true
secrets:
- name: sql-connection-string
keyVaultName: atp-prod-kv
secretName: connection-strings/atp-db/production
# Database Configuration
database:
host: ""
port: 5432
name: atp_production
schema: public
# Redis Configuration
redis:
enabled: false # Use managed Redis
host: "" # External Redis host
port: 6379
# Environment Variables
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: Logging__LogLevel__Default
value: "Information"
envFrom:
- secretRef:
name: app-secrets
# Health Checks
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 30
# Pod Disruption Budget
podDisruptionBudget:
enabled: true
minAvailable: 2
# Network Policy
networkPolicy:
enabled: true
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Service Bus Configuration
serviceBus:
connectionString: "" # From ExternalSecret
queueName: audit-events
# Monitoring
monitoring:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
templates/: Resource Templates¶
Helm Chart Directory Structure:
charts/atp-ingestion/
├── Chart.yaml # Chart metadata
├── values.yaml # Default values
├── values-dev.yaml # Dev environment overrides
├── values-test.yaml # Test environment overrides
├── values-staging.yaml # Staging environment overrides
├── values-production.yaml # Production environment overrides
├── .helmignore # Files to exclude
├── README.md # Chart documentation
├── charts/ # Sub-charts (dependencies)
│ └── .gitkeep
├── templates/ # Kubernetes resource templates
│ ├── _helpers.tpl # Named templates and helpers
│ ├── deployment.yaml # Deployment resource
│ ├── service.yaml # Service resource
│ ├── ingress.yaml # Ingress resource (conditional)
│ ├── serviceaccount.yaml # ServiceAccount resource
│ ├── configmap.yaml # ConfigMap resource
│ ├── networkpolicy.yaml # NetworkPolicy resource (conditional)
│ ├── poddisruptionbudget.yaml # PDB resource (conditional)
│ ├── servicemonitor.yaml # ServiceMonitor for Prometheus (conditional)
│ ├── externalsecret.yaml # ExternalSecret resource (conditional)
│ ├── NOTES.txt # Post-install notes
│ ├── tests/ # Helm test templates
│ │ ├── test-connection.yaml
│ │ └── test-health.yaml
│ └── hooks/ # Helm hooks
│ ├── pre-install-migration.yaml
│ └── post-install-verification.yaml
└── schemas/ # JSON Schema validation
└── values.schema.json
Chart Structure Diagram:
graph TB
subgraph "Helm Chart: atp-ingestion"
CHART[Chart.yaml<br/>Metadata & Dependencies]
VALUES[values.yaml<br/>Default Configuration]
VALUES_DEV[values-dev.yaml<br/>Dev Overrides]
VALUES_PROD[values-production.yaml<br/>Prod Overrides]
subgraph "templates/"
HELPERS[_helpers.tpl<br/>Named Templates]
DEPLOY[deployment.yaml]
SVC[service.yaml]
INGRESS[ingress.yaml]
SA[serviceaccount.yaml]
NETPOL[networkpolicy.yaml]
subgraph "tests/"
TEST_CONN[test-connection.yaml]
TEST_HEALTH[test-health.yaml]
end
subgraph "hooks/"
HOOK_PRE[pre-install-migration.yaml]
HOOK_POST[post-install-verification.yaml]
end
end
subgraph "charts/"
DEP_REDIS[redis/]
DEP_POSTGRES[postgresql/]
end
end
CHART --> DEPLOY
VALUES --> DEPLOY
VALUES_DEV --> DEPLOY
VALUES_PROD --> DEPLOY
HELPERS --> DEPLOY
DEPLOY --> SVC
SVC --> INGRESS
CHART --> DEP_REDIS
CHART --> DEP_POSTGRES
style CHART fill:#FFE5B4
style VALUES fill:#FFE5B4
style HELPERS fill:#90EE90
charts/: Sub-charts (Dependencies)¶
Sub-chart Dependencies:
# Chart.yaml dependencies section
dependencies:
- name: redis
version: "17.15.0"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
tags:
- cache
- name: postgresql
version: "12.1.9"
repository: "https://charts.bitnami.com/bitnaml/bitnami"
condition: postgresql.enabled
tags:
- database
Sub-chart Values Override:
# values.yaml - Sub-chart value overrides
redis:
enabled: false # Use managed Redis in production
architecture: standalone
auth:
enabled: true
master:
persistence:
enabled: true
size: 8Gi
resources:
requests:
memory: 256Mi
cpu: 250m
postgresql:
enabled: false # Use managed PostgreSQL
auth:
database: atp_production
username: atp_user
primary:
persistence:
enabled: true
size: 20Gi
resources:
requests:
memory: 512Mi
cpu: 500m
Managing Dependencies:
# Update dependencies
helm dependency update charts/atp-ingestion/
# Build dependencies
helm dependency build charts/atp-ingestion/
# List dependencies
helm dependency list charts/atp-ingestion/
.helmignore: Files to Exclude¶
.helmignore File:
# charts/atp-ingestion/.helmignore
# Patterns to ignore when building packages
# Git
.git/
.gitignore
.gitattributes
# CI/CD
.azuredevops/
.github/
.gitlab-ci.yml
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# Documentation (keep README.md)
docs/
*.md
!README.md
# Tests (not part of chart)
tests/
*.test.go
# Build artifacts
bin/
obj/
*.dll
*.exe
# Dependencies (managed via Chart.yaml)
charts/*.tgz
# Temporary files
*.tmp
*.log
.DS_Store
Template Best Practices¶
Named Templates and Helpers (_helpers.tpl)¶
_helpers.tpl:
# templates/_helpers.tpl
{{/*
Expand the name of the chart.
*/}}
{{- define "atp-ingestion.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Create a default fully qualified app name.
*/}}
{{- define "atp-ingestion.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "atp-ingestion.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Common labels
*/}}
{{- define "atp-ingestion.labels" -}}
helm.sh/chart: {{ include "atp-ingestion.chart" . }}
{{ include "atp-ingestion.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
app.kubernetes.io/part-of: atp-platform
{{- end }}
{{/*
Selector labels
*/}}
{{- define "atp-ingestion.selectorLabels" -}}
app.kubernetes.io/name: {{ include "atp-ingestion.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
{{/*
Create the name of the service account to use
*/}}
{{- define "atp-ingestion.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "atp-ingestion.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}
{{/*
Image reference
*/}}
{{- define "atp-ingestion.image" -}}
{{- $tag := .Values.image.tag | default .Chart.AppVersion }}
{{- printf "%s:%s" .Values.image.repository $tag }}
{{- end }}
{{/*
Environment variables from ConfigMap
*/}}
{{- define "atp-ingestion.envFromConfigMap" -}}
{{- if .Values.envFrom }}
{{- range .Values.envFrom }}
{{- if .configMapRef }}
- configMapRef:
name: {{ .configMapRef.name }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Environment variables from Secrets
*/}}
{{- define "atp-ingestion.envFromSecret" -}}
{{- if .Values.envFrom }}
{{- range .Values.envFrom }}
{{- if .secretRef }}
- secretRef:
name: {{ .secretRef.name }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Security context
*/}}
{{- define "atp-ingestion.securityContext" -}}
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: {{ .Values.securityContext.readOnlyRootFilesystem | default true }}
runAsNonRoot: {{ .Values.securityContext.runAsNonRoot | default true }}
{{- if .Values.securityContext.runAsUser }}
runAsUser: {{ .Values.securityContext.runAsUser }}
{{- end }}
{{- end }}
{{/*
Pod security context
*/}}
{{- define "atp-ingestion.podSecurityContext" -}}
{{- if .Values.podSecurityContext }}
fsGroup: {{ .Values.podSecurityContext.fsGroup }}
runAsNonRoot: {{ .Values.podSecurityContext.runAsNonRoot | default true }}
{{- if .Values.podSecurityContext.runAsUser }}
runAsUser: {{ .Values.podSecurityContext.runAsUser }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Resource requests and limits
*/}}
{{- define "atp-ingestion.resources" -}}
{{- if .Values.resources }}
{{- toYaml .Values.resources }}
{{- else }}
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
{{- end }}
{{- end }}
Template Functions (include, default, required)¶
Using Template Functions:
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "atp-ingestion.fullname" . }}
labels:
{{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount | default 3 }}
selector:
matchLabels:
{{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
{{- with .Values.podAnnotations }}
{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
spec:
serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
securityContext:
{{- include "atp-ingestion.podSecurityContext" . | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
image: "{{ include "atp-ingestion.image" . }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
securityContext:
{{- include "atp-ingestion.securityContext" . | nindent 10 }}
ports:
- name: http
containerPort: {{ .Values.service.targetPort | default 8080 }}
protocol: TCP
env:
{{- range .Values.env }}
- name: {{ .name }}
value: {{ .value | quote }}
{{- end }}
{{- include "atp-ingestion.envFromConfigMap" . | nindent 8 }}
{{- include "atp-ingestion.envFromSecret" . | nindent 8 }}
resources:
{{- include "atp-ingestion.resources" . | nindent 10 }}
livenessProbe:
{{- toYaml .Values.livenessProbe | nindent 10 }}
readinessProbe:
{{- toYaml .Values.readinessProbe | nindent 10 }}
{{- if .Values.startupProbe }}
startupProbe:
{{- toYaml .Values.startupProbe | nindent 10 }}
{{- end }}
Using required Function:
# Require critical values
image:
repository: {{ required "image.repository is required" .Values.image.repository }}
tag: {{ required "image.tag is required" .Values.image.tag }}
Using default and coalesce:
# Default values with fallback chain
replicas: {{ .Values.replicaCount | default 3 }}
namespace: {{ .Values.namespace | default .Release.Namespace }}
tag: {{ coalesce .Values.image.tag .Chart.AppVersion "latest" }}
Flow Control (if, with, range)¶
Conditional Rendering:
# templates/ingress.yaml
{{- if .Values.ingress.enabled -}}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: {{ include "atp-ingestion.fullname" . }}
labels:
{{- include "atp-ingestion.labels" . | nindent 4 }}
{{- with .Values.ingress.annotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
{{- if .Values.ingress.className }}
ingressClassName: {{ .Values.ingress.className }}
{{- end }}
{{- if .Values.ingress.tls }}
tls:
{{- range .Values.ingress.tls }}
- hosts:
{{- range .hosts }}
- {{ . | quote }}
{{- end }}
secretName: {{ .secretName }}
{{- end }}
{{- end }}
rules:
{{- range .Values.ingress.hosts }}
- host: {{ .host | quote }}
http:
paths:
{{- range .paths }}
- path: {{ .path }}
pathType: {{ .pathType }}
backend:
service:
name: {{ include "atp-ingestion.fullname" $ }}
port:
number: {{ $.Values.service.port }}
{{- end }}
{{- end }}
{{- end }}
Using with for Scoped Values:
{{- with .Values.monitoring.serviceMonitor }}
{{- if .enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "atp-ingestion.fullname" $ }}
spec:
selector:
matchLabels:
{{- include "atp-ingestion.selectorLabels" $ | nindent 6 }}
endpoints:
- port: http
interval: {{ .interval | default "30s" }}
scrapeTimeout: {{ .scrapeTimeout | default "10s" }}
{{- end }}
{{- end }}
Variable Scoping¶
Understanding Variable Scoping:
# Scoping with $ (root context)
{{- range .Values.env }}
- name: {{ .name }}
value: {{ .value }}
# Use $ to access root context
namespace: {{ $.Release.Namespace }}
{{- end }}
# Scoping with with
{{- with .Values.resources }}
limits:
cpu: {{ .limits.cpu }}
memory: {{ .limits.memory }}
{{- end }}
# Preserving root context in nested scopes
{{- range .Values.env }}
{{- if eq .name "DATABASE_HOST" }}
{{- with $.Values.database }}
value: {{ .host }}
{{- end }}
{{- end }}
{{- end }}
Whitespace Management¶
Whitespace Control:
# Remove leading/trailing whitespace
{{- include "atp-ingestion.labels" . | nindent 4 }}
{{- if .Values.ingress.enabled -}}
# ... content ...
{{- end }}
# Trim left whitespace
{{- include "template" . }}
# Trim right whitespace
{{ include "template" . -}}
# Trim both sides
{{- include "template" . -}}
# Preserve whitespace (default)
{{ include "template" . }}
# Indent (nindent adds newline before)
{{- include "labels" . | nindent 4 }}
# Output raw (without escaping)
{{- .Values.script | nindent 8 | trim }}
Values File Organization¶
Hierarchical Values Structure¶
Values Hierarchy:
# Base values.yaml
replicaCount: 3
resources:
limits:
cpu: 2000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
# Environment-specific override (values-production.yaml)
replicaCount: 5
resources:
limits:
cpu: 4000m
memory: 4Gi
requests:
cpu: 1000m
memory: 2Gi
Values Precedence:
- User-provided values (
--set,--set-file) - Environment-specific values (
values-production.yaml) - Default values (
values.yaml)
Environment Overrides¶
values-dev.yaml:
# charts/atp-ingestion/values-dev.yaml
replicaCount: 1
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
autoscaling:
enabled: false
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Development"
- name: Logging__LogLevel__Default
value: "Debug"
ingress:
enabled: true
className: "nginx"
hosts:
- host: ingestion.dev.atp.connectsoft.example
paths:
- path: /
values-production.yaml:
# charts/atp-ingestion/values-production.yaml
replicaCount: 5
resources:
limits:
cpu: 4000m
memory: 4Gi
requests:
cpu: 1000m
memory: 2Gi
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 20
targetCPUUtilizationPercentage: 70
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: Logging__LogLevel__Default
value: "Warning"
ingress:
enabled: true
className: "nginx"
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: ingestion.atp.connectsoft.example
paths:
- path: /
tls:
- secretName: ingestion-tls
hosts:
- ingestion.atp.connectsoft.example
Secret References (Never Plaintext)¶
External Secret Reference in Values:
# values.yaml - NEVER include plaintext secrets
externalSecrets:
enabled: true
secrets:
- name: sql-connection-string
keyVaultName: atp-prod-kv
secretName: connection-strings/atp-db/production
secretKey: connectionString
- name: redis-connection-string
keyVaultName: atp-prod-kv
secretName: cache/redis/connection-string
secretKey: connectionString
# ❌ BAD: Plaintext secret in values
# secrets:
# sqlConnectionString: "Server=..."
# ✅ GOOD: Reference to ExternalSecret
envFrom:
- secretRef:
name: app-secrets # Created by ExternalSecret operator
ExternalSecret Template:
# templates/externalsecret.yaml
{{- if .Values.externalSecrets.enabled }}
{{- range .Values.externalSecrets.secrets }}
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: {{ .name }}
namespace: {{ $.Release.Namespace }}
spec:
secretStoreRef:
name: azure-keyvault-{{ $.Values.externalSecrets.keyVaultName }}
kind: ClusterSecretStore
target:
name: {{ .name }}
creationPolicy: Owner
data:
- secretKey: {{ .secretKey | default "value" }}
remoteRef:
key: {{ .secretName }}
{{- end }}
{{- end }}
Documentation in values.yaml Comments¶
Documented values.yaml:
# charts/atp-ingestion/values.yaml
# Default values for atp-ingestion Helm chart
# -- Number of replicas
replicaCount: 3
# -- Image configuration
image:
# -- Image repository
repository: connectsoft.azurecr.io/atp/ingestion
# -- Image pull policy (IfNotPresent, Always, Never)
pullPolicy: IfNotPresent
# -- Image tag (defaults to appVersion)
tag: ""
# -- Service account configuration
serviceAccount:
# -- Create service account
create: true
# -- Service account annotations
annotations: {}
# -- Service account name (defaults to fullname)
name: ""
# -- Resource requests and limits
resources:
limits:
# -- CPU limit (e.g., 2000m, 2)
cpu: 2000m
# -- Memory limit (e.g., 2Gi, 2048Mi)
memory: 2Gi
requests:
# -- CPU request (e.g., 500m, 0.5)
cpu: 500m
# -- Memory request (e.g., 1Gi, 1024Mi)
memory: 1Gi
# -- Horizontal Pod Autoscaler configuration
autoscaling:
# -- Enable HPA
enabled: true
# -- Minimum replicas
minReplicas: 3
# -- Maximum replicas
maxReplicas: 10
# -- Target CPU utilization percentage
targetCPUUtilizationPercentage: 70
# -- Target memory utilization percentage
targetMemoryUtilizationPercentage: 80
Chart Dependencies¶
Depending on Other Charts¶
Defining Dependencies:
# Chart.yaml
dependencies:
- name: redis
version: "17.15.0"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
alias: cache
- name: postgresql
version: "12.1.9"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
alias: database
Dependency Management Workflow:
sequenceDiagram
participant Dev as Developer
participant Chart as Chart.yaml
participant Helm as Helm CLI
participant Repo as Chart Repository
Dev->>Chart: Add dependency to Chart.yaml
Dev->>Helm: helm dependency update
Helm->>Repo: Fetch dependency chart
Repo-->>Helm: Return chart.tgz
Helm->>Chart: Extract to charts/ directory
Chart-->>Dev: Dependencies ready
Sub-chart Values Override¶
Overriding Sub-chart Values:
# values.yaml - Override Redis sub-chart values
redis:
enabled: true
architecture: standalone
auth:
enabled: true
password: "" # From ExternalSecret
master:
persistence:
enabled: true
storageClass: managed-premium
size: 8Gi
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 512Mi
cpu: 500m
# Override PostgreSQL sub-chart values
postgresql:
enabled: true
auth:
database: atp_production
username: atp_user
password: "" # From ExternalSecret
primary:
persistence:
enabled: true
storageClass: managed-premium
size: 20Gi
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 1Gi
cpu: 1000m
Conditional Dependencies¶
Conditional Dependency Rendering:
# Chart.yaml
dependencies:
- name: redis
version: "17.15.0"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
tags:
- cache
- name: postgresql
version: "12.1.9"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
tags:
- database
# values.yaml
redis:
enabled: false # Don't install Redis (use managed)
postgresql:
enabled: false # Don't install PostgreSQL (use managed)
Enable/Disable by Tag:
# Install with cache tag only
helm install my-release ./chart --set tags.cache=true
# Install with database tag only
helm install my-release ./chart --set tags.database=true
Dependency Management Commands¶
Dependency Management:
# Update dependencies (download latest)
helm dependency update charts/atp-ingestion/
# Build dependencies (rebuild from Chart.lock)
helm dependency build charts/atp-ingestion/
# List dependencies
helm dependency list charts/atp-ingestion/
# Output:
# NAME VERSION REPOSITORY STATUS
# redis 17.15.0 https://charts.bitnami.com/bitnami ok
# postgresql 12.1.9 https://charts.bitnami.com/bitnami ok
# Verify dependencies
helm dependency verify charts/atp-ingestion/
# Check for updates
helm dependency update --verify charts/atp-ingestion/
Chart Versioning and Publishing¶
Chart Versioning Strategy (SemVer)¶
Semantic Versioning:
| Version Component | When to Increment | Example |
|---|---|---|
| MAJOR | Breaking changes (incompatible values, removed features) | 1.2.3 → 2.0.0 |
| MINOR | New features (backward compatible) | 1.2.3 → 1.3.0 |
| PATCH | Bug fixes (backward compatible) | 1.2.3 → 1.2.4 |
Chart Version Examples:
# Chart.yaml
version: 1.2.3 # Chart version (SemVer)
appVersion: "1.2.3" # Application version
# Version bump examples:
# 1.2.3 → 1.2.4 (patch: bug fix)
# 1.2.3 → 1.3.0 (minor: new feature added)
# 1.2.3 → 2.0.0 (major: breaking change)
Publishing to Azure Container Registry¶
ACR Helm Repository Setup:
# Login to ACR
az acr login --name connectsoft
# Add ACR as Helm repository
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm
Publishing Chart to ACR:
#!/bin/bash
# scripts/publish-chart-to-acr.sh
CHART_NAME="${1}"
CHART_PATH="charts/${CHART_NAME}"
ACR_NAME="connectsoft"
ACR_REPO="oci://${ACR_NAME}.azurecr.io/helm"
echo "📦 Publishing ${CHART_NAME} to ACR"
# Package chart
helm package "${CHART_PATH}" --destination ./dist/
# Get package name
PACKAGE=$(ls -t ./dist/${CHART_NAME}-*.tgz | head -1)
# Push to ACR
helm push "${PACKAGE}" "${ACR_REPO}"
echo "✅ Chart published: ${PACKAGE}"
Installing from ACR:
# Add ACR Helm repo
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm
helm repo update
# Install chart
helm install atp-ingestion connectsoft-helm/atp-ingestion \
--version 1.2.3 \
-f values-production.yaml
Chart Repository Structure¶
ACR OCI Repository Structure:
connectsoft.azurecr.io/helm/
├── atp-ingestion/
│ ├── 1.0.0/
│ │ └── atp-ingestion-1.0.0.tgz
│ ├── 1.1.0/
│ │ └── atp-ingestion-1.1.0.tgz
│ └── 1.2.3/
│ └── atp-ingestion-1.2.3.tgz
├── atp-query/
│ └── ...
└── atp-gateway/
└── ...
Helm Hooks¶
Pre-Install: Run Before Installation¶
Pre-Install Hook (Database Migration):
# templates/hooks/pre-install-migration.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "atp-ingestion.fullname" . }}-pre-install-migration
annotations:
"helm.sh/hook": pre-install
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
{{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
template:
metadata:
labels:
{{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
spec:
restartPolicy: Never
serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
containers:
- name: migration
image: {{ include "atp-ingestion.image" . }}
command:
- dotnet
- ConnectSoft.Audit.Ingestion.Migrations.dll
env:
- name: ASPNETCORE_ENVIRONMENT
value: {{ .Values.env | first | default "Production" | quote }}
{{- range .Values.env }}
- name: {{ .name }}
value: {{ .value | quote }}
{{- end }}
envFrom:
{{- include "atp-ingestion.envFromSecret" . | nindent 8 }}
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Post-Install: Run After Installation¶
Post-Install Hook (Verification):
# templates/hooks/post-install-verification.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "atp-ingestion.fullname" . }}-post-install-verification
annotations:
"helm.sh/hook": post-install
"helm.sh/hook-weight": "5"
"helm.sh/hook-delete-policy": hook-succeeded
labels:
{{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
template:
metadata:
labels:
{{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
spec:
restartPolicy: Never
containers:
- name: verification
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
echo "Verifying service health..."
sleep 10
curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/ready || exit 1
echo "✅ Service is healthy"
Pre-Upgrade: Run Before Upgrade¶
Pre-Upgrade Hook (Backup):
# templates/hooks/pre-upgrade-backup.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "atp-ingestion.fullname" . }}-pre-upgrade-backup
annotations:
"helm.sh/hook": pre-upgrade
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation
labels:
{{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
template:
metadata:
labels:
{{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
spec:
restartPolicy: Never
containers:
- name: backup
image: mcr.microsoft.com/azure-cli:latest
command:
- /bin/bash
- -c
- |
echo "Creating backup before upgrade..."
# Backup logic here
az storage blob upload-batch \
--destination backup \
--source /data \
--account-name atpstorage
echo "✅ Backup complete"
Post-Upgrade: Run After Upgrade¶
Post-Upgrade Hook (Smoke Tests):
# templates/hooks/post-upgrade-smoke-tests.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "atp-ingestion.fullname" . }}-post-upgrade-smoke-tests
annotations:
"helm.sh/hook": post-upgrade
"helm.sh/hook-weight": "5"
"helm.sh/hook-delete-policy": hook-succeeded
spec:
template:
spec:
restartPolicy: Never
containers:
- name: smoke-tests
image: mcr.microsoft.com/dotnet/sdk:8.0
command:
- dotnet
- test
- --filter "Category=Smoke"
env:
- name: API_URL
value: http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}
Pre-Delete: Run Before Deletion¶
Pre-Delete Hook (Cleanup):
# templates/hooks/pre-delete-cleanup.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "atp-ingestion.fullname" . }}-pre-delete-cleanup
annotations:
"helm.sh/hook": pre-delete
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
template:
spec:
restartPolicy: Never
containers:
- name: cleanup
image: mcr.microsoft.com/azure-cli:latest
command:
- /bin/bash
- -c
- |
echo "Cleaning up resources..."
# Cleanup logic
echo "✅ Cleanup complete"
Hook Use Cases¶
Hook Execution Flow:
sequenceDiagram
participant Helm as Helm
participant PreInstall as Pre-Install Hook
participant Install as Installation
participant PostInstall as Post-Install Hook
participant PreUpgrade as Pre-Upgrade Hook
participant Upgrade as Upgrade
participant PostUpgrade as Post-Upgrade Hook
participant PreDelete as Pre-Delete Hook
participant Delete as Deletion
Note over Helm,Delete: Installation Flow
Helm->>PreInstall: Execute pre-install hooks
PreInstall-->>Helm: Migration complete
Helm->>Install: Install resources
Install-->>Helm: Installed
Helm->>PostInstall: Execute post-install hooks
PostInstall-->>Helm: Verification complete
Note over Helm,Delete: Upgrade Flow
Helm->>PreUpgrade: Execute pre-upgrade hooks
PreUpgrade-->>Helm: Backup complete
Helm->>Upgrade: Upgrade resources
Upgrade-->>Helm: Upgraded
Helm->>PostUpgrade: Execute post-upgrade hooks
PostUpgrade-->>Helm: Smoke tests passed
Note over Helm,Delete: Deletion Flow
Helm->>PreDelete: Execute pre-delete hooks
PreDelete-->>Helm: Cleanup complete
Helm->>Delete: Delete resources
Hook Use Cases Table:
| Hook | Use Case | Example |
|---|---|---|
| pre-install | Database migrations, schema setup | Run EF migrations before deploying app |
| post-install | Verification, smoke tests | Verify service is healthy after install |
| pre-upgrade | Backup, data migration | Backup database before upgrade |
| post-upgrade | Smoke tests, validation | Run integration tests after upgrade |
| pre-delete | Cleanup, data export | Export data before deleting service |
| post-delete | Final cleanup | Remove temporary resources |
Testing Helm Charts¶
helm lint: Syntax and Best Practices¶
Linting Helm Charts:
# Lint chart
helm lint charts/atp-ingestion/
# Lint with strict mode
helm lint charts/atp-ingestion/ --strict
# Lint with values file
helm lint charts/atp-ingestion/ -f values-production.yaml
# Lint all charts
for chart in charts/*/; do
echo "Linting $chart"
helm lint "$chart"
done
helm template: Render Templates Locally¶
Rendering Templates:
# Render all templates
helm template my-release charts/atp-ingestion/
# Render with values
helm template my-release charts/atp-ingestion/ -f values-production.yaml
# Render specific template
helm template my-release charts/atp-ingestion/ -s templates/deployment.yaml
# Dry-run (validate without installing)
helm install my-release charts/atp-ingestion/ --dry-run --debug
# Output to file
helm template my-release charts/atp-ingestion/ > rendered-manifests.yaml
Template Validation Script:
#!/bin/bash
# scripts/validate-helm-chart.sh
CHART_PATH="${1}"
VALUES_FILE="${2}"
echo "🔍 Validating Helm chart: ${CHART_PATH}"
# Lint
echo "1. Running helm lint..."
helm lint "${CHART_PATH}" ${VALUES_FILE:+-f "${VALUES_FILE}"} || exit 1
# Template rendering
echo "2. Rendering templates..."
helm template test-release "${CHART_PATH}" ${VALUES_FILE:+-f "${VALUES_FILE}"} > /tmp/rendered.yaml || exit 1
# Validate with kubeval
echo "3. Validating Kubernetes manifests..."
kubeval /tmp/rendered.yaml || exit 1
# Validate with kube-score
echo "4. Scoring manifests..."
kube-score score /tmp/rendered.yaml || exit 1
echo "✅ Chart validation passed"
helm test: Run Tests in Cluster¶
Helm Test Templates:
# templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
name: "{{ include "atp-ingestion.fullname" . }}-test-connection"
annotations:
"helm.sh/hook": test
labels:
{{- include "atp-ingestion.selectorLabels" . | nindent 4 }}
spec:
restartPolicy: Never
containers:
- name: wget
image: busybox:1.35
command: ['wget']
args: ['{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}']
# templates/tests/test-health.yaml
apiVersion: v1
kind: Pod
metadata:
name: "{{ include "atp-ingestion.fullname" . }}-test-health"
annotations:
"helm.sh/hook": test
spec:
restartPolicy: Never
containers:
- name: curl-test
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/ready || exit 1
curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/live || exit 1
echo "✅ Health checks passed"
Running Helm Tests:
# Run tests
helm test my-release
# Run tests with logs
helm test my-release --logs
# Run tests with timeout
helm test my-release --timeout 5m
chart-testing Tool (ct)¶
Install chart-testing:
# Install ct
curl -LO https://github.com/helm/chart-testing/releases/download/v3.9.0/chart-testing_3.9.0_linux_amd64.tar.gz
tar -xzf chart-testing_3.9.0_linux_amd64.tar.gz
sudo mv ct /usr/local/bin/
chart-testing Configuration:
# .github/ct.yaml
chart-dirs:
- charts
chart-repos:
- bitnami=https://charts.bitnami.com/bitnami
target-branch: main
validate-maintainers: true
check-version-increment: true
Using chart-testing:
# Lint and validate
ct lint --charts charts/atp-ingestion/
# Install and test
ct install --charts charts/atp-ingestion/
# Lint all changed charts
ct lint --target-branch main
Integration with CI Pipeline¶
Azure Pipeline for Chart Testing:
# azure-pipelines-helm-charts.yml
trigger:
branches:
include:
- main
paths:
include:
- charts/**/*
pool:
vmImage: 'ubuntu-latest'
steps:
- task: HelmInstaller@1
displayName: 'Install Helm'
inputs:
helmVersionToInstall: '3.12.0'
- task: Bash@3
displayName: 'Install chart-testing'
inputs:
targetType: 'inline'
script: |
curl -LO https://github.com/helm/chart-testing/releases/download/v3.9.0/chart-testing_3.9.0_linux_amd64.tar.gz
tar -xzf chart-testing_3.9.0_linux_amd64.tar.gz
sudo mv ct /usr/local/bin/
- task: Bash@3
displayName: 'Lint Charts'
inputs:
targetType: 'inline'
script: |
for chart in charts/*/; do
echo "Linting $chart"
helm lint "$chart"
ct lint --charts "$chart"
done
- task: Bash@3
displayName: 'Render Templates'
inputs:
targetType: 'inline'
script: |
for chart in charts/*/; do
echo "Rendering $chart"
helm template test-release "$chart" -f "$chart/values-production.yaml" > /dev/null
done
- task: Bash@3
displayName: 'Package Charts'
inputs:
targetType: 'inline'
script: |
mkdir -p dist
for chart in charts/*/; do
helm package "$chart" --destination ./dist/
done
Helm Chart CI Pipeline¶
Complete CI Pipeline:
# azure-pipelines-helm-chart-ci.yml
trigger:
branches:
include:
- main
- feature/*
paths:
include:
- charts/**/*
pr:
branches:
include:
- main
paths:
include:
- charts/**/*
pool:
vmImage: 'ubuntu-latest'
variables:
- group: atp-helm-charts
- name: ACR_NAME
value: 'connectsoft'
stages:
- stage: Lint
displayName: 'Lint Charts'
jobs:
- job: Lint
displayName: 'Lint Helm Charts'
steps:
- task: HelmInstaller@1
displayName: 'Install Helm'
inputs:
helmVersionToInstall: '3.12.0'
- task: Bash@3
displayName: 'Install kubeval and kube-score'
inputs:
targetType: 'inline'
script: |
wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
tar xf kubeval-linux-amd64.tar.gz
sudo mv kubeval /usr/local/bin/
wget https://github.com/zegl/kube-score/releases/download/v1.17.0/kube-score_1.17.0_linux_amd64.tar.gz
tar xf kube-score_1.17.0_linux_amd64.tar.gz
sudo mv kube-score /usr/local/bin/
- task: Bash@3
displayName: 'Lint and Validate Charts'
inputs:
targetType: 'inline'
script: |
for chart in charts/*/; do
CHART_NAME=$(basename "$chart")
echo "Linting ${CHART_NAME}..."
# Helm lint
helm lint "$chart" || exit 1
# Render and validate
helm template test-release "$chart" -f "$chart/values-production.yaml" | \
kubeval --strict || exit 1
# Score
helm template test-release "$chart" -f "$chart/values-production.yaml" | \
kube-score score - || exit 1
done
- stage: Package
displayName: 'Package Charts'
condition: succeeded()
jobs:
- job: Package
displayName: 'Package Helm Charts'
steps:
- task: HelmInstaller@1
displayName: 'Install Helm'
- task: Bash@3
displayName: 'Package Charts'
inputs:
targetType: 'inline'
script: |
mkdir -p dist
for chart in charts/*/; do
helm package "$chart" --destination ./dist/
done
echo "##vso[task.setVariable variable=CHARTS_PACKAGED]true"
- stage: Publish
displayName: 'Publish to ACR'
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: Publish
displayName: 'Publish Charts to ACR'
environment: 'Production'
strategy:
runOnce:
deploy:
steps:
- task: AzureCLI@2
displayName: 'Login to ACR'
inputs:
azureSubscription: 'ATP-Prod-ServiceConnection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az acr login --name $(ACR_NAME)
- task: HelmInstaller@1
displayName: 'Install Helm'
- task: Bash@3
displayName: 'Publish Charts'
inputs:
targetType: 'inline'
script: |
for package in dist/*.tgz; do
CHART_NAME=$(basename "$package" .tgz | cut -d- -f1-2)
echo "Publishing ${CHART_NAME}..."
helm push "$package" "oci://$(ACR_NAME).azurecr.io/helm"
done
Chart Documentation¶
README.md with Usage Instructions¶
Chart README Template:
# atp-ingestion
A Helm chart for ATP Ingestion Service - Collects and processes audit trail events.
## Introduction
This chart deploys the ATP Ingestion Service on a Kubernetes cluster using the Helm package manager.
## Prerequisites
- Kubernetes 1.24+
- Helm 3.8+
- Azure Container Registry access
- External Secrets Operator (for secret management)
## Installing the Chart
To install the chart with the release name `atp-ingestion`:
```bash
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm
helm repo update
helm install atp-ingestion connectsoft-helm/atp-ingestion \
--version 1.2.3 \
-f values-production.yaml
Uninstalling the Chart¶
To uninstall/delete the atp-ingestion deployment:
Configuration¶
The following table lists the configurable parameters:
| Parameter | Description | Default |
|---|---|---|
replicaCount |
Number of replicas | 3 |
image.repository |
Image repository | connectsoft.azurecr.io/atp/ingestion |
image.tag |
Image tag | "" (defaults to appVersion) |
resources.limits.cpu |
CPU limit | 2000m |
resources.limits.memory |
Memory limit | 2Gi |
resources.requests.cpu |
CPU request | 500m |
resources.requests.memory |
Memory request | 1Gi |
autoscaling.enabled |
Enable HPA | true |
autoscaling.minReplicas |
Minimum replicas | 3 |
autoscaling.maxReplicas |
Maximum replicas | 10 |
Values Files¶
values.yaml: Default valuesvalues-dev.yaml: Development environmentvalues-test.yaml: Test environmentvalues-staging.yaml: Staging environmentvalues-production.yaml: Production environment
Dependencies¶
- Redis (optional, via Bitnami chart)
- PostgreSQL (optional, via Bitnami chart)
Hooks¶
- pre-install: Runs database migrations
- post-install: Verifies service health
- pre-upgrade: Creates backup
- post-upgrade: Runs smoke tests
#### Values Schema (JSON Schema) **values.schema.json**: ```json { "$schema": "http://json-schema.org/schema#", "type": "object", "properties": { "replicaCount": { "type": "integer", "minimum": 1, "maximum": 100, "description": "Number of replicas" }, "image": { "type": "object", "properties": { "repository": { "type": "string", "description": "Image repository" }, "tag": { "type": "string", "description": "Image tag" }, "pullPolicy": { "type": "string", "enum": ["IfNotPresent", "Always", "Never"], "description": "Image pull policy" } }, "required": ["repository"] }, "resources": { "type": "object", "properties": { "limits": { "type": "object", "properties": { "cpu": { "type": "string", "pattern": "^[0-9]+m?$|^[0-9]+\\.[0-9]+$" }, "memory": { "type": "string", "pattern": "^[0-9]+(Mi|Gi|Ti|Pi|Ei|m|K|M|G|T|P|E)$" } } }, "requests": { "type": "object", "properties": { "cpu": { "type": "string", "pattern": "^[0-9]+m?$|^[0-9]+\\.[0-9]+$" }, "memory": { "type": "string", "pattern": "^[0-9]+(Mi|Gi|Ti|Pi|Ei|m|K|M|G|T|P|E)$" } } } } } }, "required": ["image"] }
Chart Security¶
Scanning Charts for Vulnerabilities¶
Chart Security Scanning:
# Install checkov for Helm chart scanning
pip install checkov
# Scan Helm chart
checkov -d charts/atp-ingestion/ --framework helm
Policy Validation¶
OPA Policy for Helm Charts:
# policies/helm-chart-policy.rego
package helm
deny[msg] {
input.kind == "Deployment"
not input.spec.template.spec.securityContext.runAsNonRoot
msg := "Deployment must set runAsNonRoot to true"
}
deny[msg] {
input.kind == "Deployment"
not input.spec.template.spec.securityContext.allowPrivilegeEscalation == false
msg := "Deployment must disable privilege escalation"
}
Image Scanning in Chart Images¶
Image Scanning in CI:
# Azure Pipeline: Scan chart images
- task: Bash@3
displayName: 'Scan Images'
inputs:
targetType: 'inline'
script: |
# Extract images from chart
IMAGES=$(helm template test-release charts/atp-ingestion/ | \
grep -E 'image:' | \
awk '{print $2}' | \
tr -d '"')
# Scan each image with Trivy
for image in $IMAGES; do
echo "Scanning $image"
trivy image --severity HIGH,CRITICAL "$image" || exit 1
done
Summary: Helm Chart Development for ATP Services¶
- Helm Chart Structure: Chart.yaml metadata, values.yaml defaults, templates/ directory, charts/ dependencies, .helmignore exclusions
- Template Best Practices: Named templates and helpers (_helpers.tpl), template functions (include, default, required), flow control (if, with, range), variable scoping, whitespace management
- Values File Organization: Hierarchical values structure, default values, environment overrides (dev/test/staging/production), secret references (never plaintext), documentation in comments
- Chart Dependencies: Depending on other charts, sub-chart values override, conditional dependencies, dependency management commands
- Chart Versioning and Publishing: Semantic versioning (SemVer), publishing to Azure Container Registry (ACR), chart repository structure
- Helm Hooks: Pre-install (migrations), post-install (verification), pre-upgrade (backup), post-upgrade (smoke tests), pre-delete (cleanup), hook use cases and execution flow
- Testing Helm Charts: helm lint, helm template (render locally), helm test (run in cluster), chart-testing tool (ct), CI pipeline integration
- Helm Chart CI Pipeline: Lint charts on PR, package charts, publish to ACR, version management
- Chart Documentation: README.md with usage instructions, values schema (JSON Schema), changelog for versions
- Chart Security: Scanning charts for vulnerabilities, policy validation (OPA), image scanning in chart images
Kustomize Advanced Patterns¶
Purpose: Define advanced Kustomize patterns, strategies, and best practices for ATP GitOps deployments including strategic merge patches, JSON patches, generators, transformers, component composition, remote bases, and FluxCD integration to enable flexible, maintainable, and reusable Kubernetes configuration management.
Kustomize Architecture¶
Base, Overlays, Components¶
Kustomize Architecture Overview:
graph TB
subgraph "Base"
BASE[kustomization.yaml<br/>Base Resources]
DEPLOY_BASE[deployment.yaml]
SVC_BASE[service.yaml]
CM_BASE[configmap.yaml]
end
subgraph "Overlays"
subgraph "Dev Overlay"
DEV_KUST[dev/kustomization.yaml]
DEV_PATCH[dev/patch.yaml]
end
subgraph "Prod Overlay"
PROD_KUST[production/kustomization.yaml]
PROD_PATCH[production/patch.yaml]
end
end
subgraph "Components"
COMP_KUST[components/monitoring/kustomization.yaml]
COMP_RESOURCES[components/monitoring/resources/]
end
BASE --> DEPLOY_BASE
BASE --> SVC_BASE
BASE --> CM_BASE
DEV_KUST -.references.-> BASE
DEV_KUST -.uses.-> DEV_PATCH
DEV_KUST -.includes.-> COMP_KUST
PROD_KUST -.references.-> BASE
PROD_KUST -.uses.-> PROD_PATCH
PROD_KUST -.includes.-> COMP_KUST
style BASE fill:#FFE5B4
style DEV_KUST fill:#90EE90
style PROD_KUST fill:#FFB6C1
style COMP_KUST fill:#87CEEB
Directory Structure:
apps/atp-ingestion/
├── base/
│ ├── kustomization.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ └── configmap.yaml
├── overlays/
│ ├── dev/
│ │ ├── kustomization.yaml
│ │ ├── deployment-patch.yaml
│ │ └── configmap-patch.yaml
│ ├── test/
│ │ ├── kustomization.yaml
│ │ └── deployment-patch.yaml
│ ├── staging/
│ │ ├── kustomization.yaml
│ │ └── deployment-patch.yaml
│ └── production/
│ ├── kustomization.yaml
│ ├── deployment-patch.yaml
│ └── configmap-patch.yaml
└── components/
├── monitoring/
│ ├── kustomization.yaml
│ └── servicemonitor.yaml
└── networking/
├── kustomization.yaml
└── networkpolicy.yaml
Kustomization File Structure¶
Base kustomization.yaml:
# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
name: atp-ingestion-base
namespace: default
resources:
- deployment.yaml
- service.yaml
- configmap.yaml
commonLabels:
app: atp-ingestion
component: ingestion
managed-by: kustomize
commonAnnotations:
description: "ATP Ingestion Service Base Configuration"
namespace: default
Overlay kustomization.yaml:
# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
name: atp-ingestion-production
resources:
- ../../base
patchesStrategicMerge:
- deployment-patch.yaml
- configmap-patch.yaml
patches:
- path: service-patch.json
target:
kind: Service
images:
- name: connectsoft.azurecr.io/atp/ingestion
newName: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3
replicas:
- name: atp-ingestion
count: 5
namespace: atp-production
commonLabels:
environment: production
configMapGenerator:
- name: app-config
literals:
- ASPNETCORE_ENVIRONMENT=Production
Resource Selection¶
Resource Selection in Kustomization:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
# Resources to include
resources:
- deployment.yaml
- service.yaml
- ../../base # Include entire base
# Components to include
components:
- ../../components/monitoring
# Exclude resources (via selector)
# Note: Kustomize doesn't support exclude directly,
# use patches to remove resources
# Select resources by label
# (requires custom transformer or post-processing)
Transformation Order¶
Kustomize Transformation Pipeline:
graph LR
BASE[Base Resources] --> COMMON[Common Labels/Annotations]
COMMON --> NAMESPACE[Namespace Transform]
NAMESPACE --> PREFIX[Name Prefix/Suffix]
PREFIX --> IMAGES[Image Transform]
IMAGES --> REPLICAS[Replica Transform]
REPLICAS --> PATCHES[Strategic Merge Patches]
PATCHES --> JSON[JSON Patches]
JSON --> GENERATORS[ConfigMap/Secret Generators]
GENERATORS --> OUTPUT[Final Output]
style BASE fill:#FFE5B4
style OUTPUT fill:#90EE90
Transformation Order:
- Load Resources: Read base resources and all referenced resources
- Common Labels/Annotations: Apply common labels and annotations
- Namespace Transform: Set namespace on all resources
- Name Prefix/Suffix: Apply name transformations
- Image Transform: Replace image references
- Replica Transform: Update replica counts
- Strategic Merge Patches: Apply strategic merge patches
- JSON Patches: Apply JSON patches
- ConfigMap/Secret Generators: Generate ConfigMaps and Secrets
- Replacements: Apply replacements transformations
- Final Output: Emit transformed resources
Strategic Merge Patches¶
How Strategic Merge Works¶
Strategic Merge Patch Overview:
Strategic merge patches use Kubernetes's strategic merge patch logic to merge patches into base resources, following Kubernetes-specific semantics for merging lists and maps.
Strategic Merge Process:
graph TB
BASE[Base Resource]
PATCH[Strategic Merge Patch]
BASE --> MERGE{Strategic Merge}
PATCH --> MERGE
MERGE --> RESULT[Merged Resource]
subgraph "Merge Semantics"
REPLACE[Replace<br/>Explicit values]
ADD[Add<br/>New fields]
DELETE[Delete<br/>null values]
ARRAY[Array Merge<br/>Strategic merge keys]
end
MERGE -.uses.-> REPLACE
MERGE -.uses.-> ADD
MERGE -.uses.-> DELETE
MERGE -.uses.-> ARRAY
Merge Semantics (Replace, Add, Delete)¶
Strategic Merge Examples:
Base Deployment:
# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 3
template:
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:latest
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Development"
Strategic Merge Patch (Replace):
# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 5 # Replace: 3 → 5
template:
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3 # Replace image
resources:
requests:
cpu: 1000m # Replace: 500m → 1000m
memory: 2Gi # Replace: 1Gi → 2Gi
limits:
cpu: 4000m # Replace: 2000m → 4000m
memory: 4Gi # Replace: 2Gi → 4Gi
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production" # Replace: Development → Production
Strategic Merge Patch (Add):
# overlays/production/deployment-patch-add.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
containers:
- name: atp-ingestion
env:
# Add new environment variables
- name: Logging__LogLevel__Default
value: "Warning"
- name: Telemetry__SamplingRate
value: "0.1"
resources:
limits:
# Add new resource limit
ephemeral-storage: 10Gi
Strategic Merge Patch (Delete):
# overlays/minimal/deployment-patch-delete.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
containers:
- name: atp-ingestion
env:
# Delete by setting to null
- name: Telemetry__SamplingRate
value: null
Array Merging Strategies¶
Array Merging with Strategic Merge Keys:
Kubernetes uses strategic merge keys to identify array items for merging:
| Resource Type | Strategic Merge Key |
|---|---|
| Deployment.spec.template.spec.containers | name |
| Deployment.spec.template.spec.initContainers | name |
| Service.spec.ports | port |
| ConfigMap.data | Key name |
| Pod.spec.volumes | name |
Container Array Merge Example:
Base Deployment:
# base/deployment.yaml
spec:
template:
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:latest
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Development"
- name: sidecar
image: connectsoft.azurecr.io/atp/sidecar:latest
Production Patch (Update Existing Container, Add New):
# overlays/production/deployment-patch.yaml
spec:
template:
spec:
containers:
# Update existing container (matched by name: "atp-ingestion")
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: Logging__LogLevel__Default
value: "Warning"
# Add new container
- name: metrics-exporter
image: prom/node-exporter:latest
Result: The atp-ingestion container is updated, sidecar remains unchanged, and metrics-exporter is added.
Common Patterns¶
Common Strategic Merge Patterns:
Pattern 1: Update Replicas and Resources:
# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 5
template:
spec:
containers:
- name: atp-ingestion
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 4000m
memory: 4Gi
Pattern 2: Add Environment Variables:
# overlays/production/deployment-patch-env.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
containers:
- name: atp-ingestion
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: Logging__LogLevel__Default
value: "Warning"
- name: Telemetry__SamplingRate
value: "0.1"
Pattern 3: Add Volume Mounts:
# overlays/production/deployment-patch-volumes.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
containers:
- name: atp-ingestion
volumeMounts:
- name: config
mountPath: /app/config
volumes:
- name: config
configMap:
name: app-config
Pattern 4: Update Service Type:
# overlays/production/service-patch.yaml
apiVersion: v1
kind: Service
metadata:
name: atp-ingestion
spec:
type: LoadBalancer # Change from ClusterIP to LoadBalancer
ports:
- port: 80
targetPort: 8080
JSON Patches¶
JSON Patch Operations (Add, Replace, Remove)¶
JSON Patch Operations:
| Operation | Description | Use Case |
|---|---|---|
| add | Add field or array element | Add new annotation, add new container |
| replace | Replace existing field value | Update replica count, change image tag |
| remove | Remove field or array element | Remove environment variable, remove port |
| copy | Copy value from one path to another | Copy annotation value |
| move | Move value from one path to another | Move label |
| test | Test value equality | Validate before patch |
JSON Patch Example:
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
patches:
- target:
kind: Deployment
name: atp-ingestion
path: deployment-patch.json
deployment-patch.json:
[
{
"op": "replace",
"path": "/spec/replicas",
"value": 5
},
{
"op": "replace",
"path": "/spec/template/spec/containers/0/image",
"value": "connectsoft.azurecr.io/atp/ingestion:v1.2.3"
},
{
"op": "add",
"path": "/spec/template/metadata/annotations/prometheus.io~1scrape",
"value": "true"
},
{
"op": "add",
"path": "/spec/template/spec/containers/0/env/-",
"value": {
"name": "Logging__LogLevel__Default",
"value": "Warning"
}
},
{
"op": "remove",
"path": "/spec/template/spec/containers/0/env/0"
}
]
Path Targeting¶
JSON Patch Path Examples:
[
// Replace replica count
{
"op": "replace",
"path": "/spec/replicas",
"value": 5
},
// Replace image in first container
{
"op": "replace",
"path": "/spec/template/spec/containers/0/image",
"value": "new-image:tag"
},
// Add annotation (use ~1 for /)
{
"op": "add",
"path": "/metadata/annotations/prometheus.io~1scrape",
"value": "true"
},
// Add to array (use - to append)
{
"op": "add",
"path": "/spec/template/spec/containers/0/env/-",
"value": {
"name": "NEW_VAR",
"value": "value"
}
},
// Remove array element by index
{
"op": "remove",
"path": "/spec/template/spec/containers/0/env/0"
},
// Remove field
{
"op": "remove",
"path": "/spec/template/spec/containers/0/resources/limits/cpu"
}
]
Path Escaping:
- Use
~1for/in path - Use
~0for~in path - Example:
prometheus.io/scrape→prometheus.io~1scrape
When to Use JSON Patches vs Strategic Merge¶
Comparison:
| Feature | Strategic Merge | JSON Patch |
|---|---|---|
| Simplicity | ✅ Easier to write and read | ⚠️ More verbose |
| Type Safety | ✅ YAML-native | ❌ JSON only |
| Array Operations | ✅ Smart merging with keys | ⚠️ Index-based |
| Precision | ⚠️ Can be ambiguous | ✅ Very precise |
| Removal | ⚠️ Requires null | ✅ Direct remove |
| ATP Preference | ✅ Preferred for most cases | ⚠️ Use for complex cases |
ATP Decision Matrix:
| Use Case | Recommended Approach |
|---|---|
| Update replicas | ✅ Strategic merge |
| Update image tag | ✅ Strategic merge |
| Add environment variables | ✅ Strategic merge |
| Remove specific array element | ✅ JSON patch |
Add annotation with / in key |
⚠️ JSON patch (or use quotes in YAML) |
| Precise field replacement | ✅ Strategic merge |
| Complex array manipulation | ⚠️ JSON patch |
ConfigMap and Secret Generators¶
Generating ConfigMaps from Literals¶
ConfigMap Generator from Literals:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
configMapGenerator:
- name: app-config
literals:
- ASPNETCORE_ENVIRONMENT=Production
- Logging__LogLevel__Default=Warning
- Telemetry__SamplingRate=0.1
options:
labels:
app: atp-ingestion
annotations:
description: "Application configuration"
disableNameSuffixHash: false # Include hash suffix for updates
Generated ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-abc123 # Hash suffix added
labels:
app: atp-ingestion
annotations:
description: "Application configuration"
data:
ASPNETCORE_ENVIRONMENT: Production
Logging__LogLevel__Default: Warning
Telemetry__SamplingRate: "0.1"
Generating ConfigMaps from Files¶
ConfigMap Generator from Files:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
configMapGenerator:
- name: app-config
files:
- appsettings.json
- logging.json
options:
disableNameSuffixHash: false
Directory Structure:
File Contents:
// appsettings.json
{
"Logging": {
"LogLevel": {
"Default": "Warning"
}
},
"Telemetry": {
"SamplingRate": 0.1
}
}
Generated ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-xyz789
data:
appsettings.json: |
{
"Logging": {
"LogLevel": {
"Default": "Warning"
}
}
}
logging.json: |
{...}
Generating Secrets (Encrypted)¶
Secret Generator:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
secretGenerator:
- name: app-secrets
type: Opaque
literals:
- connectionString=Server=...
- apiKey=secret-key-123
options:
labels:
app: atp-ingestion
disableNameSuffixHash: false
⚠️ Security Warning: Secrets in kustomization.yaml are base64 encoded, not encrypted. Always use External Secrets Operator or Azure Key Vault CSI Driver for production secrets.
Recommended: Secret Generator from File (Base64 Encoded):
# kustomization.yaml
secretGenerator:
- name: app-secrets
type: Opaque
files:
- connectionString.txt # Base64 encoded content
- apiKey.txt
Generate Base64 Encoded Secret File:
Hash Suffixes for Updates¶
Hash Suffix Behavior:
# kustomization.yaml
configMapGenerator:
- name: app-config
literals:
- KEY=VALUE
options:
disableNameSuffixHash: false # Default: include hash
Hash Suffix Purpose:
- With Hash (
disableNameSuffixHash: false): - ConfigMap name:
app-config-abc123 - Changing content generates new hash:
app-config-xyz789 -
Forces Pod restart when ConfigMap changes (rolling update)
-
Without Hash (
disableNameSuffixHash: true): - ConfigMap name:
app-config - Changing content updates same ConfigMap
- Pods may not automatically restart (depends on implementation)
ATP Recommendation: Use hash suffixes (disableNameSuffixHash: false) to ensure Pods restart when ConfigMaps change.
Variable Substitution¶
Defining Variables in kustomization.yaml¶
Variable Definition:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
vars:
- name: SERVICE_NAME
objref:
kind: Service
name: atp-ingestion
fieldref:
fieldpath: metadata.name
- name: SERVICE_PORT
objref:
kind: Service
name: atp-ingestion
fieldref:
fieldpath: spec.ports[0].port
- name: REPLICA_COUNT
objref:
kind: Deployment
name: atp-ingestion
fieldref:
fieldpath: spec.replicas
Using Variables in Resources¶
Using Variables in Deployment:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: $(SERVICE_NAME)
spec:
replicas: $(REPLICA_COUNT)
template:
spec:
containers:
- name: atp-ingestion
env:
- name: SERVICE_NAME
value: $(SERVICE_NAME)
- name: SERVICE_PORT
value: "$(SERVICE_PORT)"
Variable Substitution Process:
graph LR
DEFINE[Define Variables<br/>in kustomization.yaml]
REF[Reference Resources<br/>via objref]
EXTRACT[Extract Values<br/>via fieldref]
SUBSTITUTE[Substitute<br/>$(VAR_NAME)]
OUTPUT[Final Resource]
DEFINE --> REF
REF --> EXTRACT
EXTRACT --> SUBSTITUTE
SUBSTITUTE --> OUTPUT
style DEFINE fill:#FFE5B4
style OUTPUT fill:#90EE90
Environment-Specific Variables¶
Environment-Specific Variable Configuration:
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
vars:
- name: ENVIRONMENT
objref:
kind: ConfigMap
name: app-config
fieldref:
fieldpath: data.ASPNETCORE_ENVIRONMENT
configMapGenerator:
- name: app-config
literals:
- ASPNETCORE_ENVIRONMENT=Production
Replacements¶
Replacing Values Across Resources¶
Replacement Configuration:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
replacements:
- source:
kind: ConfigMap
name: app-config
fieldPath: data.database-host
targets:
- select:
kind: Deployment
fieldPaths:
- spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value
Example: Replace Database Host:
# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
database-host: "atp-db.database.windows.net"
# Deployment (before replacement)
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: atp-ingestion
env:
- name: DATABASE_HOST
value: "placeholder"
# Replacement configuration
replacements:
- source:
kind: ConfigMap
name: app-config
fieldPath: data.database-host
targets:
- select:
kind: Deployment
fieldPaths:
- spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value
# Deployment (after replacement)
# DATABASE_HOST value becomes: "atp-db.database.windows.net"
Source and Target Configuration¶
Replacement Source Options:
replacements:
- source:
# Option 1: Reference ConfigMap/Secret
kind: ConfigMap
name: app-config
fieldPath: data.key-name
# Option 2: Literal value
value: "literal-value"
# Option 3: Reference another resource
kind: Service
name: atp-ingestion
fieldPath: spec.clusterIP
Replacement Target Options:
targets:
- select:
# Select resources by kind
kind: Deployment
# Optional: name filter
name: atp-ingestion
# Optional: label selector
labelSelector: "app=atp-ingestion"
fieldPaths:
# Target field path (supports array selectors)
- spec.template.spec.containers.[name=atp-ingestion].env.[name=KEY].value
# Multiple targets
- metadata.annotations.config-hash
options:
create: true # Create field if missing
delimiter: "/" # Path delimiter
Complex Replacement Patterns¶
Multiple Replacements:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
configMapGenerator:
- name: app-config
literals:
- database-host=atp-db.database.windows.net
- database-port=5432
- redis-host=atp-redis.redis.cache.windows.net
replacements:
# Replace database host
- source:
kind: ConfigMap
name: app-config
fieldPath: data.database-host
targets:
- select:
kind: Deployment
fieldPaths:
- spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value
# Replace database port
- source:
kind: ConfigMap
name: app-config
fieldPath: data.database-port
targets:
- select:
kind: Deployment
fieldPaths:
- spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_PORT].value
# Replace Redis host
- source:
kind: ConfigMap
name: app-config
fieldPath: data.redis-host
targets:
- select:
kind: Deployment
fieldPaths:
- spec.template.spec.containers.[name=atp-ingestion].env.[name=REDIS_HOST].value
Replacement with Transformation:
replacements:
- source:
kind: ConfigMap
name: app-config
fieldPath: data.api-url
targets:
- select:
kind: Ingress
fieldPaths:
- spec.rules.[host].host
options:
create: true
replacements:
- source:
value: "http://"
target:
value: "" # Remove prefix
Remote Bases¶
Referencing Remote Kustomizations¶
Remote Base Configuration:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
# Git repository as base
- git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3
# HTTPS URL
- https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main
Git Repository as Base¶
Git Base with SSH:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3
Git Base with HTTPS:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3
HTTPS URLs for Bases¶
HTTPS Base URL Format:
Examples:
resources:
# GitHub
- https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main
# Azure Repos
- https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops//apps/atp-ingestion/base?ref=production
# GitLab
- https://gitlab.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main
Version Pinning¶
Version Pinning Strategies:
# Option 1: Pin to tag (recommended)
resources:
- git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3
# Option 2: Pin to branch (less stable)
resources:
- git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=production
# Option 3: Pin to commit SHA (most stable)
resources:
- git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=abc123def456
ATP Recommendation: Pin to Git tags for stability, update tags during releases.
Component Composition¶
Creating Reusable Components¶
Component Structure:
components/
├── monitoring/
│ ├── kustomization.yaml
│ └── servicemonitor.yaml
├── networking/
│ ├── kustomization.yaml
│ └── networkpolicy.yaml
└── security/
├── kustomization.yaml
├── podsecuritypolicy.yaml
└── rbac.yaml
Component kustomization.yaml:
# components/monitoring/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1alpha1
kind: Component
metadata:
name: monitoring
resources:
- servicemonitor.yaml
- prometheusrule.yaml
commonLabels:
component: monitoring
Including Components in Overlays¶
Using Components in Overlay:
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
components:
- ../../components/monitoring
- ../../components/networking
- ../../components/security
patchesStrategicMerge:
- deployment-patch.yaml
Component Composition Diagram:
graph TB
BASE[Base<br/>kustomization.yaml]
subgraph "Components"
MON[Monitoring<br/>Component]
NET[Networking<br/>Component]
SEC[Security<br/>Component]
end
OVERLAY[Production Overlay<br/>kustomization.yaml]
PATCHES[Strategic Merge<br/>Patches]
BASE --> OVERLAY
MON --> OVERLAY
NET --> OVERLAY
SEC --> OVERLAY
PATCHES --> OVERLAY
OVERLAY --> OUTPUT[Final Resources]
style BASE fill:#FFE5B4
style OVERLAY fill:#90EE90
style OUTPUT fill:#87CEEB
Component Library for ATP¶
ATP Component Library:
components/
├── monitoring/
│ ├── kustomization.yaml
│ ├── servicemonitor.yaml
│ └── prometheusrule.yaml
├── networking/
│ ├── kustomization.yaml
│ ├── networkpolicy.yaml
│ └── ingress-policy.yaml
├── security/
│ ├── kustomization.yaml
│ ├── podsecuritypolicy.yaml
│ └── rbac.yaml
├── autoscaling/
│ ├── kustomization.yaml
│ └── hpa.yaml
└── observability/
├── kustomization.yaml
├── servicemonitor.yaml
└── log-forwarding.yaml
Reusable Monitoring Component:
# components/monitoring/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1alpha1
kind: Component
metadata:
name: monitoring
resources:
- servicemonitor.yaml
commonLabels:
component: monitoring
configMapGenerator:
- name: monitoring-config
literals:
- scrape-interval=30s
Monitoring Component Template:
# components/monitoring/servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: $(name)-servicemonitor
spec:
selector:
matchLabels:
app: $(name)
endpoints:
- port: http
interval: 30s
Transformers¶
Label Injectors¶
Common Labels:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
commonLabels:
app: atp-ingestion
component: ingestion
environment: production
managed-by: kustomize
version: v1.2.3
Labels Added to All Resources:
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
labels:
app: atp-ingestion # ← Added
component: ingestion # ← Added
environment: production # ← Added
managed-by: kustomize # ← Added
version: v1.2.3 # ← Added
Namespace Transformers¶
Namespace Configuration:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: atp-production
# All resources get namespace: atp-production
Name Prefix/Suffix Transformers¶
Name Prefix/Suffix:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namePrefix: prod- # Prefix: prod-atp-ingestion
nameSuffix: -v1 # Suffix: atp-ingestion-v1
# ATP Recommendation: Use labels/annotations for versioning instead
Image Transformers¶
Image Transform:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
images:
- name: connectsoft.azurecr.io/atp/ingestion
newName: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3
- name: redis
newName: connectsoft.azurecr.io/atp/redis
digest: sha256:abc123... # Use digest for immutability
Image Transform Process:
graph LR
BASE[Base Resources<br/>image: latest]
TRANSFORM[Image Transform<br/>newTag: v1.2.3]
OUTPUT[Output Resources<br/>image: v1.2.3]
BASE --> TRANSFORM
TRANSFORM --> OUTPUT
style BASE fill:#FFE5B4
style OUTPUT fill:#90EE90
Replica Transformers¶
Replica Transform:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
replicas:
- name: atp-ingestion
count: 5
- name: atp-query
count: 3
Replica Transform Example:
# Base deployment (replicas: 3)
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 3
# After replica transform (replicas: 5)
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 5 # ← Updated
Kustomize with Helm¶
Combining Helm and Kustomize¶
Helm + Kustomize Workflow:
graph LR
HELM[Helm Chart<br/>helm template]
OUTPUT1[Helm Output<br/>YAML Manifests]
KUST[Kustomize<br/>kustomize build]
OUTPUT2[Final Output<br/>Patched Manifests]
HELM --> OUTPUT1
OUTPUT1 --> KUST
KUST --> OUTPUT2
style HELM fill:#FFE5B4
style OUTPUT2 fill:#90EE90
helm template → kustomize build¶
Post-Rendering Helm Output with Kustomize:
# Step 1: Render Helm templates
helm template my-release ./charts/atp-ingestion \
-f values-production.yaml \
> /tmp/helm-output.yaml
# Step 2: Use Kustomize to patch Helm output
mkdir -p kustomize-overlay
cat > kustomize-overlay/kustomization.yaml <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- /tmp/helm-output.yaml
patchesStrategicMerge:
- production-patch.yaml
EOF
# Step 3: Build final output
kustomize build kustomize-overlay > final-manifests.yaml
Kustomization for Helm Output:
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- helm-output.yaml # Generated from: helm template
patchesStrategicMerge:
- production-patch.yaml
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3
Post-Rendering with Kustomize¶
Helm Post-Renderer Script:
#!/bin/bash
# scripts/helm-post-render-kustomize.sh
KUSTOMIZE_DIR="${1:-overlays/production}"
# Kustomize build the Helm output
kustomize build "${KUSTOMIZE_DIR}"
Use Post-Renderer in Helm:
# Install with post-renderer
helm install atp-ingestion ./charts/atp-ingestion \
-f values-production.yaml \
--post-renderer ./scripts/helm-post-render-kustomize.sh \
--post-renderer-executable-args "overlays/production"
FluxCD Kustomization CRD¶
FluxCD-Specific Configuration¶
FluxCD Kustomization CRD:
# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 5m
path: ./apps/atp-ingestion/overlays/production
prune: true
wait: true
timeout: 5m
retryInterval: 1m
sourceRef:
kind: GitRepository
name: atp-gitops-production
kustomizeFlags:
- --load-restrictor=LoadRestrictionsNone
dependsOn:
- name: infrastructure
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-production
postBuild:
substitute:
IMAGE_TAG: v1.2.3
Kustomization CRD Fields¶
Kustomization CRD Reference:
| Field | Type | Description | Required |
|---|---|---|---|
| interval | duration | Reconciliation interval | ✅ Yes |
| path | string | Path to kustomization.yaml | ✅ Yes |
| prune | boolean | Delete resources not in Git | ⚠️ Optional |
| wait | boolean | Wait for resources to be ready | ⚠️ Optional |
| timeout | duration | Wait timeout | ⚠️ Optional |
| retryInterval | duration | Retry interval on failure | ⚠️ Optional |
| sourceRef | object | GitRepository reference | ✅ Yes |
| kustomizeFlags | array | Kustomize CLI flags | ⚠️ Optional |
| dependsOn | array | Dependency Kustomizations | ⚠️ Optional |
| healthChecks | array | Health check resources | ⚠️ Optional |
| postBuild | object | Post-build substitutions | ⚠️ Optional |
| suspend | boolean | Suspend reconciliation | ⚠️ Optional |
Integration with Git¶
FluxCD Kustomization with Git:
# GitRepository (source)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: atp-gitops-production
namespace: flux-system
spec:
interval: 1m
url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
ref:
branch: production
secretRef:
name: gitops-credentials
---
# Kustomization (deployment target)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 5m
path: ./apps/atp-ingestion/overlays/production
sourceRef:
kind: GitRepository
name: atp-gitops-production # ← References GitRepository
prune: true
wait: true
FluxCD Integration Diagram:
sequenceDiagram
participant Git as Git Repository
participant GitRepo as GitRepository CRD
participant FluxCD as FluxCD Controller
participant Kust as Kustomization CRD
participant K8s as Kubernetes API
GitRepo->>Git: Poll for changes (1m)
Git-->>GitRepo: New commit detected
GitRepo->>FluxCD: Trigger reconciliation
FluxCD->>Git: Fetch kustomization.yaml
Git-->>FluxCD: Return kustomization
FluxCD->>Git: Fetch base + overlays
Git-->>FluxCD: Return resources
FluxCD->>Kust: Build kustomize (kustomize build)
Kust->>K8s: Apply resources
K8s-->>FluxCD: Resources applied
Testing Kustomize Configurations¶
kustomize build for Validation¶
Validate Kustomize Build:
# Build and validate
kustomize build apps/atp-ingestion/overlays/production
# Validate syntax
kustomize build apps/atp-ingestion/overlays/production > /dev/null && echo "✅ Valid"
# Validate with kubeval
kustomize build apps/atp-ingestion/overlays/production | kubeval --strict
# Validate with kube-score
kustomize build apps/atp-ingestion/overlays/production | kube-score score -
Validation Script:
#!/bin/bash
# scripts/validate-kustomize.sh
OVERLAY_PATH="${1}"
echo "🔍 Validating Kustomize: ${OVERLAY_PATH}"
# Build
echo "1. Building kustomization..."
kustomize build "${OVERLAY_PATH}" > /tmp/kustomize-output.yaml || exit 1
# Validate Kubernetes syntax
echo "2. Validating Kubernetes syntax..."
kubeval /tmp/kustomize-output.yaml --strict || exit 1
# Score manifests
echo "3. Scoring manifests..."
kube-score score /tmp/kustomize-output.yaml || exit 1
echo "✅ Kustomize validation passed"
Diff Validation Against Expected Output¶
Diff Validation:
#!/bin/bash
# scripts/validate-kustomize-diff.sh
OVERLAY_PATH="${1}"
EXPECTED_OUTPUT="${2}"
echo "🔍 Validating Kustomize output against expected..."
# Build current output
kustomize build "${OVERLAY_PATH}" > /tmp/current.yaml
# Compare with expected
if diff -u "${EXPECTED_OUTPUT}" /tmp/current.yaml; then
echo "✅ Output matches expected"
else
echo "❌ Output differs from expected"
exit 1
fi
Golden File Testing:
# Generate golden file (expected output)
kustomize build apps/atp-ingestion/overlays/production > \
tests/golden/production-expected.yaml
# Validate against golden file
kustomize build apps/atp-ingestion/overlays/production > /tmp/actual.yaml
diff tests/golden/production-expected.yaml /tmp/actual.yaml
CI Pipeline Integration¶
Azure Pipeline for Kustomize Testing:
# azure-pipelines-kustomize-test.yml
trigger:
branches:
include:
- main
paths:
include:
- apps/**/*
pool:
vmImage: 'ubuntu-latest'
steps:
- task: Bash@3
displayName: 'Install kustomize'
inputs:
targetType: 'inline'
script: |
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
- task: Bash@3
displayName: 'Install kubeval and kube-score'
inputs:
targetType: 'inline'
script: |
wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
tar xf kubeval-linux-amd64.tar.gz
sudo mv kubeval /usr/local/bin/
wget https://github.com/zegl/kube-score/releases/download/v1.17.0/kube-score_1.17.0_linux_amd64.tar.gz
tar xf kube-score_1.17.0_linux_amd64.tar.gz
sudo mv kube-score /usr/local/bin/
- task: Bash@3
displayName: 'Validate Kustomize Configurations'
inputs:
targetType: 'inline'
script: |
for overlay in apps/*/overlays/*/; do
OVERLAY_NAME=$(basename "$overlay")
echo "Validating overlay: ${OVERLAY_NAME}"
# Build
kustomize build "$overlay" > /tmp/${OVERLAY_NAME}.yaml || exit 1
# Validate
kubeval /tmp/${OVERLAY_NAME}.yaml --strict || exit 1
kube-score score /tmp/${OVERLAY_NAME}.yaml || exit 1
done
Summary: Kustomize Advanced Patterns¶
- Kustomize Architecture: Base, overlays, components structure, kustomization file structure, resource selection, transformation order
- Strategic Merge Patches: How strategic merge works, merge semantics (replace, add, delete), array merging strategies, common patterns
- JSON Patches: JSON patch operations (add, replace, remove), path targeting, when to use JSON patches vs strategic merge
- ConfigMap and Secret Generators: Generating ConfigMaps from literals and files, generating secrets (encrypted), hash suffixes for updates
- Variable Substitution: Defining variables in kustomization.yaml, using variables in resources, environment-specific variables
- Replacements: Replacing values across resources, source and target configuration, complex replacement patterns
- Remote Bases: Referencing remote kustomizations, Git repository as base, HTTPS URLs for bases, version pinning
- Component Composition: Creating reusable components, including components in overlays, component library for ATP
- Transformers: Label injectors, namespace transformers, name prefix/suffix transformers, image transformers, replica transformers
- Kustomize with Helm: Combining Helm and Kustomize, helm template → kustomize build, post-rendering with Kustomize
- FluxCD Kustomization CRD: FluxCD-specific configuration, Kustomization CRD fields reference, integration with Git
- Testing Kustomize Configurations: kustomize build for validation, diff validation against expected output, CI pipeline integration
Multi-Tenancy in GitOps¶
Purpose: Define multi-tenancy strategies, tenant isolation mechanisms, tenant-specific configurations, automated onboarding/offboarding procedures, and compliance controls for ATP's GitOps deployments, ensuring complete tenant isolation, secure resource management, and adherence to data residency and regulatory requirements (GDPR, HIPAA, SOC 2).
Tenant Isolation Strategies¶
Namespace per Tenant (ATP Approach)¶
Multi-Tenant Architecture with Namespace Isolation:
graph TB
subgraph "Production AKS Cluster"
subgraph "Tenant A Namespace"
NS_A[Namespace: atp-tenant-a]
DEPLOY_A[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
SVC_A[Services<br/>ClusterIP]
DB_A[(Database Schema<br/>tenant_a)]
SECRETS_A[Secrets<br/>tenant-a-secrets]
end
subgraph "Tenant B Namespace"
NS_B[Namespace: atp-tenant-b]
DEPLOY_B[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
SVC_B[Services<br/>ClusterIP]
DB_B[(Database Schema<br/>tenant_b)]
SECRETS_B[Secrets<br/>tenant-b-secrets]
end
subgraph "Tenant C Namespace"
NS_C[Namespace: atp-tenant-c]
DEPLOY_C[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
SVC_C[Services<br/>ClusterIP]
DB_C[(Database Schema<br/>tenant_c)]
SECRETS_C[Secrets<br/>tenant-c-secrets]
end
subgraph "Platform Services"
MON[Monitoring<br/>Shared]
INGRESS[Ingress Controller<br/>Shared]
end
end
NS_A --> MON
NS_B --> MON
NS_C --> MON
INGRESS --> SVC_A
INGRESS --> SVC_B
INGRESS --> SVC_C
style NS_A fill:#90EE90
style NS_B fill:#FFE5B4
style NS_C fill:#87CEEB
style MON fill:#DDA0DD
Namespace per Tenant Benefits:
| Aspect | Benefit | ATP Justification |
|---|---|---|
| Resource Isolation | ✅ Complete resource isolation | Prevents resource contention |
| Network Isolation | ✅ Network policies per namespace | Ensures tenant data isolation |
| RBAC Isolation | ✅ Per-namespace RBAC | Tenant-specific access control |
| Quota Management | ✅ Resource quotas per namespace | Cost control per tenant |
| Compliance | ✅ Isolated audit logs | GDPR/HIPAA compliance |
| Data Residency | ✅ Deploy to specific regions | EU/US data residency requirements |
ATP Decision: Namespace per tenant - Complete isolation, best security, compliance-friendly
Cluster per Tenant (Not Used in ATP)¶
Cluster per Tenant Comparison:
| Aspect | Cluster per Tenant | Namespace per Tenant | ATP Decision |
|---|---|---|---|
| Isolation | ✅ Maximum isolation | ⚠️ Good isolation | Namespace (sufficient) |
| Cost | ❌ Very high (separate clusters) | ✅ Low (shared cluster) | Namespace (cost-effective) |
| Management | ❌ Complex (many clusters) | ✅ Simple (one cluster) | Namespace (operational simplicity) |
| Resource Utilization | ❌ Poor (underutilized clusters) | ✅ Good (shared resources) | Namespace (efficiency) |
| Compliance | ✅ Maximum compliance | ✅ Good compliance | Namespace (sufficient) |
ATP Rationale: Cluster per tenant is overkill for ATP's requirements. Namespace isolation provides sufficient security and compliance while maintaining cost efficiency.
Shared Namespace with Labels (Not Recommended)¶
Shared Namespace with Labels:
# ❌ NOT RECOMMENDED: Shared namespace with labels
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-production
labels:
tenant: tenant-a # Label-based separation
spec:
# ...
Why Not Recommended:
| Issue | Impact | ATP Decision |
|---|---|---|
| No Resource Isolation | ❌ Resource contention between tenants | ❌ Not acceptable |
| Network Policy Complexity | ❌ Complex label selectors | ❌ Error-prone |
| RBAC Complexity | ❌ Difficult to enforce tenant boundaries | ❌ Security risk |
| Audit Trail | ❌ Harder to isolate tenant activities | ❌ Compliance issue |
ATP Decision: Not used - Insufficient isolation for audit trail platform requirements
Tenant-Specific Configurations¶
Resource Limits per Tenant¶
Resource Quota per Tenant:
# tenants/tenant-a/resources/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-a-quota
namespace: atp-tenant-a
labels:
tenant: tenant-a
managed-by: kustomize
spec:
hard:
requests.cpu: "8" # 8 CPU cores
requests.memory: 16Gi # 16Gi memory
limits.cpu: "16" # 16 CPU cores
limits.memory: 32Gi # 32Gi memory
persistentvolumeclaims: "5"
pods: "20"
services: "10"
configmaps: "20"
secrets: "10"
Tenant Resource Limits Matrix:
| Tenant Tier | CPU Requests | Memory Requests | CPU Limits | Memory Limits | Pods Max | Monthly Cost (Est.) |
|---|---|---|---|---|---|---|
| Basic | 2 cores | 4Gi | 4 cores | 8Gi | 10 | $500 |
| Standard | 8 cores | 16Gi | 16 cores | 32Gi | 20 | $2,000 |
| Premium | 32 cores | 64Gi | 64 cores | 128Gi | 50 | $8,000 |
| Enterprise | 128 cores | 256Gi | 256 cores | 512Gi | 200 | $32,000 |
Data Residency Requirements (EU vs US)¶
Data Residency Configuration:
# tenants/tenant-eu/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-tenant-eu
labels:
tenant: tenant-eu
data-residency: "eu-west" # EU data residency
compliance: "gdpr"
region: "westeurope"
managed-by: kustomize
annotations:
data-residency-policy: "EU-only"
compliance-requirements: "GDPR"
Regional Deployment Strategy:
| Region | Tenants | Compliance | AKS Cluster |
|---|---|---|---|
| East US | US-based tenants | US regulations | atp-prod-eus-aks |
| West Europe | EU-based tenants | GDPR | atp-prod-weu-aks |
Tenant Region Assignment:
# tenants/tenant-eu/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- resources/namespace.yaml
- resources/resource-quota.yaml
- resources/network-policy.yaml
commonLabels:
tenant: tenant-eu
region: "westeurope" # EU region
data-residency: "eu-west"
Compliance Controls (GDPR, HIPAA)¶
Compliance Labels and Annotations:
# tenants/tenant-hipaa/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-tenant-hipaa
labels:
tenant: tenant-hipaa
compliance: "hipaa"
data-classification: "phi" # Protected Health Information
encryption-required: "true"
annotations:
compliance-policy: "HIPAA"
encryption-at-rest: "required"
encryption-in-transit: "required"
audit-logging: "required"
data-retention: "6-years"
Compliance Configuration Matrix:
| Compliance Type | Labels | Annotations | Requirements |
|---|---|---|---|
| GDPR | compliance: gdpr, data-residency: eu-west |
data-residency-policy, right-to-be-forgotten: true |
EU data residency, data deletion on request |
| HIPAA | compliance: hipaa, data-classification: phi |
encryption-at-rest: required, audit-logging: required |
Encryption, audit logs, 6-year retention |
| SOC 2 | compliance: soc2 |
audit-logging: required, access-control: required |
Audit logs, access controls |
Custom Ingestion Rules¶
Tenant-Specific Ingestion Configuration:
# tenants/tenant-a/config/configmap-ingestion.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-a-ingestion-config
namespace: atp-tenant-a
data:
ingestion-rules.yaml: |
rules:
- eventType: "audit"
batchSize: 100
batchTimeout: "30s"
maxRetries: 3
- eventType: "compliance"
batchSize: 50
batchTimeout: "60s"
maxRetries: 5
rateLimits:
requestsPerSecond: 1000
burstSize: 2000
retention:
audit: "7-years"
compliance: "10-years"
GitOps Structure for Tenants¶
/tenants/{tenant-id}/ Directory Structure¶
Tenant Directory Structure:
atp-gitops/
├── tenants/
│ ├── tenant-a/
│ │ ├── kustomization.yaml
│ │ ├── resources/
│ │ │ ├── namespace.yaml
│ │ │ ├── resource-quota.yaml
│ │ │ ├── network-policy.yaml
│ │ │ ├── rbac.yaml
│ │ │ └── serviceaccount.yaml
│ │ ├── apps/
│ │ │ ├── ingestion/
│ │ │ │ ├── kustomization.yaml
│ │ │ │ └── deployment.yaml
│ │ │ ├── query/
│ │ │ │ ├── kustomization.yaml
│ │ │ │ └── deployment.yaml
│ │ │ └── gateway/
│ │ │ ├── kustomization.yaml
│ │ │ └── deployment.yaml
│ │ ├── config/
│ │ │ ├── configmap-ingestion.yaml
│ │ │ └── configmap-query.yaml
│ │ └── values/
│ │ ├── values-tenant-a.yaml
│ │ └── values-production.yaml
│ ├── tenant-b/
│ │ └── ...
│ └── tenant-eu/
│ └── ...
Tenant Namespace Manifest¶
Tenant Namespace:
# tenants/tenant-a/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-tenant-a
labels:
tenant: tenant-a
environment: production
managed-by: kustomize
compliance: "soc2"
data-residency: "us-east"
region: "eastus"
annotations:
description: "ATP Tenant A - Production Environment"
created-by: "tenant-onboarding"
created-at: "2024-01-15T10:00:00Z"
owner: "tenant-a-admin@example.com"
tier: "standard"
cost-center: "sales"
Tenant Resource Quota¶
Resource Quota Configuration:
# tenants/tenant-a/resources/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-a-quota
namespace: atp-tenant-a
labels:
tenant: tenant-a
spec:
hard:
# CPU and Memory
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
# Storage
persistentvolumeclaims: "5"
requests.storage: 100Gi
# Pod and Service Limits
pods: "20"
services: "10"
# Object Counts
configmaps: "20"
secrets: "10"
services.nodeports: "0"
services.loadbalancers: "2"
Tier-Based Quota Templates:
# tenants/_templates/resource-quota-basic.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: ${TENANT_ID}-quota
namespace: atp-${TENANT_ID}
spec:
hard:
requests.cpu: "2"
requests.memory: 4Gi
limits.cpu: "4"
limits.memory: 8Gi
pods: "10"
# tenants/_templates/resource-quota-enterprise.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: ${TENANT_ID}-quota
namespace: atp-${TENANT_ID}
spec:
hard:
requests.cpu: "128"
requests.memory: 256Gi
limits.cpu: "256"
limits.memory: 512Gi
pods: "200"
Tenant Network Policy¶
Tenant Network Policy (Complete Isolation):
# tenants/tenant-a/resources/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: tenant-a-isolation
namespace: atp-tenant-a
spec:
podSelector: {} # Apply to all pods
policyTypes:
- Ingress
- Egress
ingress:
# Allow from ingress controller (shared platform)
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app: ingress-nginx
# Allow from monitoring namespace (shared)
- from:
- namespaceSelector:
matchLabels:
name: monitoring
# Deny all other ingress (including other tenant namespaces)
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow to monitoring
- to:
- namespaceSelector:
matchLabels:
name: monitoring
# Allow to database (external)
- to:
- ipBlock:
cidr: 10.0.0.0/16 # Database subnet
ports:
- protocol: TCP
port: 5432
# Deny egress to other tenant namespaces
Network Policy Isolation Diagram:
graph TB
subgraph "Tenant A Namespace"
POD_A1[Pod A1]
POD_A2[Pod A2]
NP_A[Network Policy<br/>Deny cross-tenant]
end
subgraph "Tenant B Namespace"
POD_B1[Pod B1]
POD_B2[Pod B2]
NP_B[Network Policy<br/>Deny cross-tenant]
end
subgraph "Platform Namespaces"
INGRESS[Ingress Controller]
MON[Monitoring]
DNS[Kube DNS]
end
INGRESS -->|Allowed| POD_A1
INGRESS -->|Allowed| POD_B1
POD_A1 -.->|Blocked| POD_B1
POD_B1 -.->|Blocked| POD_A1
POD_A1 -->|Allowed| DNS
POD_B1 -->|Allowed| DNS
POD_A1 -->|Allowed| MON
POD_B1 -->|Allowed| MON
style NP_A fill:#FF6B6B
style NP_B fill:#FF6B6B
style POD_A1 fill:#90EE90
style POD_B1 fill:#FFE5B4
Tenant RBAC¶
Tenant-Specific RBAC:
# tenants/tenant-a/resources/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: tenant-a-sa
namespace: atp-tenant-a
labels:
tenant: tenant-a
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: tenant-a-role
namespace: atp-tenant-a
rules:
# Allow read/write to tenant namespace resources
- apiGroups: [""]
resources: ["configmaps", "secrets", "pods", "services"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
# Deny access to other namespaces (implicitly denied)
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: tenant-a-rolebinding
namespace: atp-tenant-a
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: tenant-a-role
subjects:
- kind: ServiceAccount
name: tenant-a-sa
namespace: atp-tenant-a
- kind: User
name: tenant-a-admin@example.com
apiGroup: rbac.authorization.k8s.io
Tenant Admin RBAC (Limited Cluster Role):
# tenants/tenant-a/resources/cluster-role-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: tenant-a-admin
rules:
# Read-only access to cluster resources
- apiGroups: [""]
resources: ["namespaces"]
resourceNames: ["atp-tenant-a"] # Only tenant namespace
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: tenant-a-admin-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: tenant-a-admin
subjects:
- kind: User
name: tenant-a-admin@example.com
apiGroup: rbac.authorization.k8s.io
Dynamic Tenant Provisioning¶
Onboarding Script¶
Tenant Onboarding Automation Script:
#!/bin/bash
# scripts/onboard-tenant.sh
TENANT_ID="${1}"
TENANT_TIER="${2:-standard}" # basic, standard, premium, enterprise
REGION="${3:-eastus}" # eastus, westeurope
COMPLIANCE="${4:-soc2}" # soc2, gdpr, hipaa
OWNER_EMAIL="${5}"
if [ -z "${TENANT_ID}" ] || [ -z "${OWNER_EMAIL}" ]; then
echo "Usage: $0 <tenant-id> [tier] [region] [compliance] <owner-email>"
echo "Example: $0 tenant-a standard eastus soc2 admin@tenant-a.example.com"
exit 1
fi
TENANT_DIR="tenants/${TENANT_ID}"
NAMESPACE="atp-${TENANT_ID}"
echo "🏢 Onboarding tenant: ${TENANT_ID}"
echo " Tier: ${TENANT_TIER}"
echo " Region: ${REGION}"
echo " Compliance: ${COMPLIANCE}"
echo " Owner: ${OWNER_EMAIL}"
# Step 1: Create tenant directory structure
echo "📁 Step 1: Creating tenant directory structure..."
mkdir -p "${TENANT_DIR}/resources"
mkdir -p "${TENANT_DIR}/apps/ingestion"
mkdir -p "${TENANT_DIR}/apps/query"
mkdir -p "${TENANT_DIR}/apps/gateway"
mkdir -p "${TENANT_DIR}/config"
mkdir -p "${TENANT_DIR}/values"
# Step 2: Generate namespace
echo "📦 Step 2: Generating namespace..."
cat > "${TENANT_DIR}/resources/namespace.yaml" <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: ${NAMESPACE}
labels:
tenant: ${TENANT_ID}
tier: ${TENANT_TIER}
environment: production
region: ${REGION}
compliance: ${COMPLIANCE}
managed-by: kustomize
annotations:
description: "ATP Tenant ${TENANT_ID} - Production Environment"
created-by: "tenant-onboarding"
created-at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
owner: "${OWNER_EMAIL}"
tier: "${TENANT_TIER}"
EOF
# Step 3: Generate resource quota based on tier
echo "📊 Step 3: Generating resource quota..."
./scripts/generate-tenant-quota.sh "${TENANT_ID}" "${TENANT_TIER}" > "${TENANT_DIR}/resources/resource-quota.yaml"
# Step 4: Generate network policy
echo "🔒 Step 4: Generating network policy..."
./scripts/generate-tenant-network-policy.sh "${TENANT_ID}" > "${TENANT_DIR}/resources/network-policy.yaml"
# Step 5: Generate RBAC
echo "🔐 Step 5: Generating RBAC..."
./scripts/generate-tenant-rbac.sh "${TENANT_ID}" "${OWNER_EMAIL}" > "${TENANT_DIR}/resources/rbac.yaml"
# Step 6: Generate application manifests
echo "🚀 Step 6: Generating application manifests..."
./scripts/generate-tenant-apps.sh "${TENANT_ID}" "${TENANT_TIER}" "${REGION}"
# Step 7: Generate kustomization.yaml
echo "📝 Step 7: Generating kustomization.yaml..."
cat > "${TENANT_DIR}/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
name: ${TENANT_ID}
namespace: ${NAMESPACE}
namespace: ${NAMESPACE}
resources:
- resources/namespace.yaml
- resources/resource-quota.yaml
- resources/network-policy.yaml
- resources/rbac.yaml
- apps/ingestion/kustomization.yaml
- apps/query/kustomization.yaml
- apps/gateway/kustomization.yaml
commonLabels:
tenant: ${TENANT_ID}
tier: ${TENANT_TIER}
region: ${REGION}
compliance: ${COMPLIANCE}
EOF
echo "✅ Tenant directory structure created: ${TENANT_DIR}"
echo ""
echo "📋 Next steps:"
echo "1. Review generated manifests: ${TENANT_DIR}/"
echo "2. Commit to Git: git add ${TENANT_DIR} && git commit -m 'feat: onboard tenant ${TENANT_ID}'"
echo "3. Push to repository: git push origin main"
echo "4. FluxCD will automatically reconcile and deploy tenant resources"
Automated Manifest Generation¶
Generate Tenant Quota Script:
#!/bin/bash
# scripts/generate-tenant-quota.sh
TENANT_ID="${1}"
TIER="${2:-standard}"
case "${TIER}" in
"basic")
CPU_REQ="2"
MEM_REQ="4Gi"
CPU_LIM="4"
MEM_LIM="8Gi"
PODS="10"
;;
"standard")
CPU_REQ="8"
MEM_REQ="16Gi"
CPU_LIM="16"
MEM_LIM="32Gi"
PODS="20"
;;
"premium")
CPU_REQ="32"
MEM_REQ="64Gi"
CPU_LIM="64"
MEM_LIM="128Gi"
PODS="50"
;;
"enterprise")
CPU_REQ="128"
MEM_REQ="256Gi"
CPU_LIM="256"
MEM_LIM="512Gi"
PODS="200"
;;
esac
cat <<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
name: ${TENANT_ID}-quota
namespace: atp-${TENANT_ID}
spec:
hard:
requests.cpu: "${CPU_REQ}"
requests.memory: ${MEM_REQ}
limits.cpu: "${CPU_LIM}"
limits.memory: ${MEM_LIM}
pods: "${PODS}"
persistentvolumeclaims: "5"
services: "10"
EOF
Generate Tenant Network Policy Script:
#!/bin/bash
# scripts/generate-tenant-network-policy.sh
TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"
cat <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ${TENANT_ID}-isolation
namespace: ${NAMESPACE}
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
# Allow from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
# Allow from monitoring
- from:
- namespaceSelector:
matchLabels:
name: monitoring
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow to monitoring
- to:
- namespaceSelector:
matchLabels:
name: monitoring
# Allow to database (external)
- to:
- ipBlock:
cidr: 10.0.0.0/16
ports:
- protocol: TCP
port: 5432
EOF
Generate Tenant Apps Script:
#!/bin/bash
# scripts/generate-tenant-apps.sh
TENANT_ID="${1}"
TIER="${2}"
REGION="${3}"
TENANT_DIR="tenants/${TENANT_ID}"
# Generate ingestion app kustomization
cat > "${TENANT_DIR}/apps/ingestion/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../../apps/atp-ingestion/base
patchesStrategicMerge:
- deployment-patch.yaml
images:
- name: connectsoft.azurecr.io/atp/ingestion
newTag: v1.2.3
namespace: atp-${TENANT_ID}
commonLabels:
tenant: ${TENANT_ID}
EOF
# Generate deployment patch
cat > "${TENANT_DIR}/apps/ingestion/deployment-patch.yaml" <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
replicas: 3
template:
spec:
containers:
- name: atp-ingestion
env:
- name: TENANT_ID
value: "${TENANT_ID}"
- name: REGION
value: "${REGION}"
EOF
echo "✅ Application manifests generated"
Git Commit for New Tenant¶
Commit Tenant Configuration:
#!/bin/bash
# scripts/commit-tenant-config.sh
TENANT_ID="${1}"
if [ -z "${TENANT_ID}" ]; then
echo "Usage: $0 <tenant-id>"
exit 1
fi
TENANT_DIR="tenants/${TENANT_ID}"
echo "📝 Committing tenant configuration: ${TENANT_ID}"
# Add tenant directory
git add "${TENANT_DIR}"
# Commit with conventional commit format
git commit -m "feat(tenant): onboard tenant ${TENANT_ID}
- Add namespace: atp-${TENANT_ID}
- Add resource quota and network policy
- Add tenant-specific RBAC
- Add application deployments
Signed-off-by: $(git config user.name) <$(git config user.email)>" \
--gpg-sign
# Push to repository
git push origin main
echo "✅ Tenant configuration committed and pushed"
echo "⏳ FluxCD will reconcile and deploy tenant resources automatically"
FluxCD Applies Tenant Resources¶
FluxCD Kustomization for Tenant:
# clusters/production/kustomizations/tenants/tenant-a.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: tenant-a
namespace: flux-system
labels:
tenant: tenant-a
spec:
interval: 5m
path: ./tenants/tenant-a
prune: true
wait: true
timeout: 10m
sourceRef:
kind: GitRepository
name: atp-gitops-production
dependsOn:
- name: infrastructure
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-tenant-a
- apiVersion: apps/v1
kind: Deployment
name: atp-query
namespace: atp-tenant-a
- apiVersion: apps/v1
kind: Deployment
name: atp-gateway
namespace: atp-tenant-a
Auto-Create FluxCD Kustomization for Tenant:
#!/bin/bash
# scripts/create-tenant-fluxcd-kustomization.sh
TENANT_ID="${1}"
KUST_FILE="clusters/production/kustomizations/tenants/${TENANT_ID}.yaml"
cat > "${KUST_FILE}" <<EOF
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: ${TENANT_ID}
namespace: flux-system
labels:
tenant: ${TENANT_ID}
spec:
interval: 5m
path: ./tenants/${TENANT_ID}
prune: true
wait: true
timeout: 10m
sourceRef:
kind: GitRepository
name: atp-gitops-production
dependsOn:
- name: infrastructure
EOF
kubectl apply -f "${KUST_FILE}"
echo "✅ FluxCD Kustomization created for tenant: ${TENANT_ID}"
Tenant Onboarding Automation¶
Step-by-Step Onboarding Process¶
Tenant Onboarding Workflow:
sequenceDiagram
participant Admin as Administrator
participant Script as Onboarding Script
participant Git as Git Repository
participant FluxCD as FluxCD
participant K8s as Kubernetes
Admin->>Script: Execute onboard-tenant.sh
Script->>Script: Create directory structure
Script->>Script: Generate namespace
Script->>Script: Generate resource quota
Script->>Script: Generate network policy
Script->>Script: Generate RBAC
Script->>Script: Generate app manifests
Script->>Git: Commit tenant config
Git->>FluxCD: Poll for changes
FluxCD->>Git: Fetch tenant manifests
FluxCD->>K8s: Apply namespace
FluxCD->>K8s: Apply resource quota
FluxCD->>K8s: Apply network policy
FluxCD->>K8s: Apply RBAC
FluxCD->>K8s: Deploy applications
K8s-->>FluxCD: Resources ready
FluxCD-->>Admin: Tenant onboarded
Complete Onboarding Automation:
#!/bin/bash
# scripts/onboard-tenant-complete.sh
TENANT_ID="${1}"
TENANT_TIER="${2:-standard}"
REGION="${3:-eastus}"
COMPLIANCE="${4:-soc2}"
OWNER_EMAIL="${5}"
if [ -z "${TENANT_ID}" ] || [ -z "${OWNER_EMAIL}" ]; then
echo "Usage: $0 <tenant-id> [tier] [region] [compliance] <owner-email>"
exit 1
fi
echo "🏢 Complete Tenant Onboarding: ${TENANT_ID}"
echo ""
# Step 1: Create tenant directory
echo "📁 Step 1: Creating tenant directory..."
./scripts/onboard-tenant.sh "${TENANT_ID}" "${TENANT_TIER}" "${REGION}" "${COMPLIANCE}" "${OWNER_EMAIL}" || exit 1
# Step 2: Validate manifests
echo "🔍 Step 2: Validating manifests..."
./scripts/validate-kustomize.sh "tenants/${TENANT_ID}" || exit 1
# Step 3: Commit to Git
echo "📝 Step 3: Committing to Git..."
./scripts/commit-tenant-config.sh "${TENANT_ID}" || exit 1
# Step 4: Create FluxCD Kustomization
echo "⚙️ Step 4: Creating FluxCD Kustomization..."
./scripts/create-tenant-fluxcd-kustomization.sh "${TENANT_ID}" || exit 1
# Step 5: Wait for FluxCD reconciliation
echo "⏳ Step 5: Waiting for FluxCD reconciliation..."
sleep 60
# Step 6: Verify tenant resources
echo "✅ Step 6: Verifying tenant resources..."
./scripts/verify-tenant-onboarding.sh "${TENANT_ID}" || exit 1
echo ""
echo "🎉 Tenant onboarding complete: ${TENANT_ID}"
Verify Tenant Onboarding:
#!/bin/bash
# scripts/verify-tenant-onboarding.sh
TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"
echo "🔍 Verifying tenant onboarding: ${TENANT_ID}"
# Check namespace exists
if ! kubectl get namespace "${NAMESPACE}" >/dev/null 2>&1; then
echo "❌ Namespace ${NAMESPACE} does not exist"
exit 1
fi
# Check resource quota
if ! kubectl get resourcequota -n "${NAMESPACE}" >/dev/null 2>&1; then
echo "❌ Resource quota not found"
exit 1
fi
# Check network policy
if ! kubectl get networkpolicy -n "${NAMESPACE}" >/dev/null 2>&1; then
echo "❌ Network policy not found"
exit 1
fi
# Check deployments are ready
DEPLOYMENTS=("atp-ingestion" "atp-query" "atp-gateway")
for DEPLOYMENT in "${DEPLOYMENTS[@]}"; do
if ! kubectl wait --for=condition=available --timeout=5m deployment/"${DEPLOYMENT}" -n "${NAMESPACE}"; then
echo "❌ Deployment ${DEPLOYMENT} not ready"
exit 1
fi
done
echo "✅ Tenant onboarding verified: ${TENANT_ID}"
Tenant-Specific Helm Values¶
values-tenant-{id}.yaml¶
Tenant-Specific Helm Values:
# tenants/tenant-a/values/values-tenant-a.yaml
# Tenant-specific Helm values for tenant-a
replicaCount: 3
image:
tag: v1.2.3
resources:
limits:
cpu: 4000m
memory: 4Gi
requests:
cpu: 1000m
memory: 2Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
env:
- name: TENANT_ID
value: "tenant-a"
- name: TENANT_TIER
value: "standard"
- name: REGION
value: "eastus"
ingress:
enabled: true
hosts:
- host: tenant-a.atp.connectsoft.example
paths:
- path: /
database:
host: atp-db-tenant-a.database.windows.net
name: atp_tenant_a
redis:
host: atp-redis-tenant-a.redis.cache.windows.net
featureFlags:
enableAdvancedQuerying: true
enableRealTimeEvents: true
enableComplianceReports: true
Override Replicas, Resources, Endpoints¶
Environment-Specific Tenant Values:
# tenants/tenant-a/values/values-production.yaml
# Production-specific overrides for tenant-a
replicaCount: 5
resources:
limits:
cpu: 8000m
memory: 8Gi
requests:
cpu: 2000m
memory: 4Gi
autoscaling:
minReplicas: 5
maxReplicas: 20
service:
type: LoadBalancer
ingress:
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
tls:
- secretName: tenant-a-tls
hosts:
- tenant-a.atp.connectsoft.example
Tenant-Specific Feature Flags¶
Feature Flags Configuration:
# tenants/tenant-a/config/configmap-feature-flags.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-a-feature-flags
namespace: atp-tenant-a
data:
feature-flags.json: |
{
"enableAdvancedQuerying": true,
"enableRealTimeEvents": true,
"enableComplianceReports": true,
"enableDataExport": false,
"enableCustomDashboards": true,
"maxRetentionDays": 2555,
"auditLogLevel": "Detailed"
}
Multi-Tenancy and FluxCD¶
Per-Tenant Kustomization¶
FluxCD Kustomization Per Tenant:
# clusters/production/kustomizations/tenants/tenant-a.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: tenant-a
namespace: flux-system
labels:
tenant: tenant-a
type: tenant
spec:
interval: 5m
path: ./tenants/tenant-a
prune: true
wait: true
timeout: 10m
retryInterval: 2m
sourceRef:
kind: GitRepository
name: atp-gitops-production
dependsOn:
- name: infrastructure
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
namespace: atp-tenant-a
kustomizeFlags:
- --load-restrictor=LoadRestrictionsNone
List All Tenant Kustomizations:
kubectl get kustomizations -n flux-system -l type=tenant
# Output:
# NAME READY STATUS AGE
# tenant-a True Applied 5d
# tenant-b True Applied 10d
# tenant-eu True Applied 2d
Tenant Isolation in Reconciliation¶
Isolation Benefits:
| Aspect | Isolation Benefit | ATP Implementation |
|---|---|---|
| Reconciliation Failure | ✅ One tenant failure doesn't affect others | Separate Kustomization per tenant |
| Resource Conflicts | ✅ Namespace isolation prevents conflicts | Namespace per tenant |
| Network Isolation | ✅ Network policies prevent cross-tenant traffic | Tenant-specific network policies |
| RBAC Isolation | ✅ Tenant admin can only access their namespace | Per-namespace RBAC |
FluxCD Reconciliation Isolation:
graph TB
subgraph "FluxCD Controller"
RECONCILE[Reconciliation Loop]
end
subgraph "Git Repository"
TENANT_A[tenants/tenant-a/]
TENANT_B[tenants/tenant-b/]
TENANT_C[tenants/tenant-c/]
end
subgraph "Kubernetes Cluster"
KUST_A[Kustomization: tenant-a]
KUST_B[Kustomization: tenant-b]
KUST_C[Kustomization: tenant-c]
NS_A[Namespace: atp-tenant-a]
NS_B[Namespace: atp-tenant-b]
NS_C[Namespace: atp-tenant-c]
end
RECONCILE --> TENANT_A
RECONCILE --> TENANT_B
RECONCILE --> TENANT_C
TENANT_A --> KUST_A
TENANT_B --> KUST_B
TENANT_C --> KUST_C
KUST_A --> NS_A
KUST_B --> NS_B
KUST_C --> NS_C
style KUST_A fill:#90EE90
style KUST_B fill:#FFE5B4
style KUST_C fill:#87CEEB
Failure Isolation (One Tenant Doesn't Affect Others)¶
Failure Isolation Example:
# Tenant A reconciliation fails
kubectl get kustomization tenant-a -n flux-system
# Output:
# NAME READY STATUS AGE
# tenant-a False Failed 5d
# tenant-b True Applied 10d ← Still working
# tenant-eu True Applied 2d ← Still working
Independent Reconciliation:
- ✅ Tenant A failure does not affect Tenant B or Tenant C
- ✅ Each tenant Kustomization reconciles independently
- ✅ Namespace isolation prevents resource conflicts
- ✅ Network policies prevent cross-tenant traffic
Tenant Offboarding¶
Data Deletion Procedures¶
GDPR Right to be Forgotten:
#!/bin/bash
# scripts/offboard-tenant.sh
TENANT_ID="${1}"
REASON="${2:-tenant-request}"
GDPR_REQUEST="${3:-false}" # true if GDPR right-to-be-forgotten
if [ -z "${TENANT_ID}" ]; then
echo "Usage: $0 <tenant-id> [reason] [gdpr-request]"
echo "Example: $0 tenant-a tenant-request true"
exit 1
fi
NAMESPACE="atp-${TENANT_ID}"
TENANT_DIR="tenants/${TENANT_ID}"
echo "🗑️ Offboarding tenant: ${TENANT_ID}"
echo " Reason: ${REASON}"
echo " GDPR Request: ${GDPR_REQUEST}"
# Step 1: Export tenant data (if GDPR request)
if [ "${GDPR_REQUEST}" = "true" ]; then
echo "📦 Step 1: Exporting tenant data for GDPR compliance..."
./scripts/export-tenant-data.sh "${TENANT_ID}" || exit 1
fi
# Step 2: Delete database data
echo "🗄️ Step 2: Deleting database data..."
./scripts/delete-tenant-database.sh "${TENANT_ID}" || exit 1
# Step 3: Delete Azure resources
echo "☁️ Step 3: Deleting Azure resources..."
./scripts/delete-tenant-azure-resources.sh "${TENANT_ID}" || exit 1
# Step 4: Delete Kubernetes namespace (deletes all resources)
echo "📦 Step 4: Deleting Kubernetes namespace..."
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=10m || true
# Step 5: Remove tenant directory from Git
echo "📝 Step 5: Removing tenant configuration from Git..."
git rm -r "${TENANT_DIR}" || true
git commit -m "feat(tenant): offboard tenant ${TENANT_ID}
- Remove tenant namespace and resources
- Delete tenant data
- Reason: ${REASON}
- GDPR Request: ${GDPR_REQUEST}
Signed-off-by: $(git config user.name) <$(git config user.email)>" \
--gpg-sign
git push origin main
# Step 6: Delete FluxCD Kustomization
echo "⚙️ Step 6: Deleting FluxCD Kustomization..."
kubectl delete kustomization "${TENANT_ID}" -n flux-system || true
echo "✅ Tenant offboarding complete: ${TENANT_ID}"
Namespace Cleanup¶
Namespace Cleanup Script:
#!/bin/bash
# scripts/cleanup-tenant-namespace.sh
TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"
echo "🧹 Cleaning up namespace: ${NAMESPACE}"
# Delete all resources in namespace
kubectl delete all --all -n "${NAMESPACE}" --wait=true --timeout=5m || true
# Delete PVCs
kubectl delete pvc --all -n "${NAMESPACE}" --wait=true --timeout=5m || true
# Delete secrets and configmaps
kubectl delete secrets --all -n "${NAMESPACE}" || true
kubectl delete configmaps --all -n "${NAMESPACE}" || true
# Delete network policies
kubectl delete networkpolicies --all -n "${NAMESPACE}" || true
# Delete namespace
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=5m || true
echo "✅ Namespace cleanup complete"
Git Commit to Remove Tenant¶
Remove Tenant from Git:
#!/bin/bash
# scripts/remove-tenant-from-git.sh
TENANT_ID="${1}"
REASON="${2}"
TENANT_DIR="tenants/${TENANT_ID}"
echo "📝 Removing tenant from Git: ${TENANT_ID}"
# Remove tenant directory
git rm -r "${TENANT_DIR}" || true
# Remove FluxCD Kustomization
git rm "clusters/production/kustomizations/tenants/${TENANT_ID}.yaml" || true
# Commit removal
git commit -m "feat(tenant): remove tenant ${TENANT_ID}
- Remove tenant namespace configuration
- Remove tenant FluxCD Kustomization
- Reason: ${REASON}
Signed-off-by: $(git config user.name) <$(git config user.email)>" \
--gpg-sign
# Push to repository
git push origin main
echo "✅ Tenant removed from Git"
Compliance with GDPR (Right to be Forgotten)¶
GDPR Data Deletion Procedure:
#!/bin/bash
# scripts/gdpr-data-deletion.sh
TENANT_ID="${1}"
echo "🔒 GDPR Data Deletion Request: ${TENANT_ID}"
# Step 1: Export data (for audit trail)
echo "📦 Step 1: Exporting data for audit trail..."
./scripts/export-tenant-data.sh "${TENANT_ID}" \
--output "exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json"
# Step 2: Verify export
if [ ! -f "exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json" ]; then
echo "❌ Data export failed"
exit 1
fi
# Step 3: Delete database records
echo "🗄️ Step 2: Deleting database records..."
./scripts/delete-tenant-database.sh "${TENANT_ID}" --confirm || exit 1
# Step 4: Delete blob storage
echo "💾 Step 3: Deleting blob storage..."
./scripts/delete-tenant-blob-storage.sh "${TENANT_ID}" || exit 1
# Step 5: Delete logs
echo "📋 Step 4: Deleting logs..."
./scripts/delete-tenant-logs.sh "${TENANT_ID}" || exit 1
# Step 6: Offboard tenant
echo "🗑️ Step 5: Offboarding tenant..."
./scripts/offboard-tenant.sh "${TENANT_ID}" "gdpr-request" "true" || exit 1
# Step 7: Generate deletion certificate
echo "📜 Step 6: Generating deletion certificate..."
cat > "certificates/gdpr-deletion-${TENANT_ID}-$(date +%Y%m%d).md" <<EOF
# GDPR Data Deletion Certificate
**Tenant ID**: ${TENANT_ID}
**Date**: $(date -u +%Y-%m-%dT%H:%M:%SZ)
**Request Type**: Right to be Forgotten (GDPR Article 17)
## Data Deletion Summary
- ✅ Database records deleted
- ✅ Blob storage deleted
- ✅ Logs deleted
- ✅ Kubernetes resources deleted
- ✅ Git configuration removed
## Export Location
- Data exported to: exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json
- Retention: 7 years (legal requirement)
## Verification
All data related to tenant ${TENANT_ID} has been permanently deleted
from ATP systems in compliance with GDPR Article 17.
EOF
echo "✅ GDPR data deletion complete"
echo "📜 Deletion certificate: certificates/gdpr-deletion-${TENANT_ID}-$(date +%Y%m%d).md"
Tenant Cost Allocation¶
Namespace-Level Resource Tagging¶
Resource Tagging for Cost Allocation:
# tenants/tenant-a/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: atp-tenant-a
labels:
tenant: tenant-a
cost-center: "sales"
business-unit: "enterprise"
project: "audit-trail-platform"
environment: "production"
tier: "standard"
annotations:
cost-allocation: "tenant-a"
billing-account: "account-12345"
owner: "tenant-a-admin@example.com"
Azure Resource Tagging:
# Tag Azure resources for tenant
az aks update \
--resource-group atp-production-rg \
--name atp-prod-eus-aks \
--tags \
Tenant=tenant-a \
CostCenter=sales \
Environment=production
Cost Reporting per Tenant¶
Cost Reporting Script:
#!/bin/bash
# scripts/tenant-cost-report.sh
TENANT_ID="${1}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"
echo "💰 Cost Report for Tenant: ${TENANT_ID}"
echo " Period: ${START_DATE} to ${END_DATE}"
echo ""
# Query Azure Cost Management API
az consumption usage list \
--start-date "${START_DATE}" \
--end-date "${END_DATE}" \
--query "[?tags.Tenant=='${TENANT_ID}'].{Instance:instanceName, Cost:pretaxCost}" \
--output table
# Calculate total cost
TOTAL_COST=$(az consumption usage list \
--start-date "${START_DATE}" \
--end-date "${END_DATE}" \
--query "[?tags.Tenant=='${TENANT_ID}'].pretaxCost" \
--output tsv | \
awk '{sum+=$1} END {print sum}')
echo ""
echo "Total Cost: \$${TOTAL_COST}"
Chargeback/Showback Models¶
Chargeback Model Configuration:
# tenants/tenant-a/config/cost-allocation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-a-cost-allocation
namespace: atp-tenant-a
data:
chargeback-model: "showback" # or "chargeback"
billing-frequency: "monthly"
cost-center: "sales"
business-unit: "enterprise"
billing-contact: "finance@example.com"
cost-breakdown.yaml: |
resources:
compute:
cpu-requests: 0.05 # $0.05 per CPU-hour
memory-requests: 0.01 # $0.01 per GiB-hour
storage:
persistent-volumes: 0.10 # $0.10 per GiB-month
network:
egress: 0.09 # $0.09 per GB
Cost Allocation Diagram:
graph TB
subgraph "Cluster Costs"
CLUSTER[AKS Cluster<br/>$10,000/month]
end
subgraph "Tenant Costs"
TENANT_A[Tenant A<br/>$2,000/month<br/>20%]
TENANT_B[Tenant B<br/>$5,000/month<br/>50%]
TENANT_C[Tenant C<br/>$3,000/month<br/>30%]
end
CLUSTER --> TENANT_A
CLUSTER --> TENANT_B
CLUSTER --> TENANT_C
style TENANT_A fill:#90EE90
style TENANT_B fill:#FFE5B4
style TENANT_C fill:#87CEEB
Compliance Per Tenant¶
SOC 2, GDPR, HIPAA Configurations¶
Compliance Configuration per Tenant:
# tenants/tenant-a/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-a-compliance
namespace: atp-tenant-a
data:
compliance-type: "soc2"
audit-logging: "enabled"
data-retention-years: "7"
encryption-at-rest: "required"
encryption-in-transit: "required"
soc2-controls.yaml: |
controls:
- id: "CC6.1"
name: "Logical and Physical Access Controls"
status: "implemented"
- id: "CC7.2"
name: "System Operations"
status: "implemented"
GDPR Tenant Configuration:
# tenants/tenant-eu/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-eu-compliance
namespace: atp-tenant-eu
data:
compliance-type: "gdpr"
data-residency: "eu-west"
right-to-be-forgotten: "enabled"
data-export: "enabled"
audit-logging: "enabled"
data-retention-years: "7"
gdpr-requirements.yaml: |
requirements:
- article: "17"
name: "Right to Erasure"
implementation: "automated-deletion"
- article: "20"
name: "Data Portability"
implementation: "data-export-api"
- article: "30"
name: "Records of Processing"
implementation: "audit-logs"
HIPAA Tenant Configuration:
# tenants/tenant-hipaa/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-hipaa-compliance
namespace: atp-tenant-hipaa
data:
compliance-type: "hipaa"
data-classification: "phi"
encryption-at-rest: "required"
encryption-in-transit: "required"
audit-logging: "required"
access-control: "required"
data-retention-years: "6"
hipaa-requirements.yaml: |
requirements:
- section: "164.312(a)(1)"
name: "Access Control"
implementation: "rbac"
- section: "164.312(e)(1)"
name: "Transmission Security"
implementation: "tls-encryption"
- section: "164.312(c)(1)"
name: "Integrity"
implementation: "audit-logs"
Tenant-Specific Audit Logs¶
Audit Log Configuration:
# tenants/tenant-a/resources/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
metadata:
name: tenant-a-audit
rules:
# Audit all API requests in tenant namespace
- level: Metadata
namespaces: ["atp-tenant-a"]
verbs: ["*"]
resources:
- group: "*"
resources: ["*"]
# Audit secret access
- level: RequestResponse
namespaces: ["atp-tenant-a"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
resources:
- group: ""
resources: ["secrets"]
Audit Log Query for Tenant:
// Log Analytics: Query tenant-specific audit logs
AuditLogs
| where Namespace == "atp-tenant-a"
| where TimeGenerated >= ago(7d)
| summarize
EventCount = count(),
UniqueUsers = dcount(UserIdentity),
UniqueResources = dcount(ResourceName)
by Tenant = Namespace, bin(TimeGenerated, 1d)
| render timechart
Data Residency Enforcement¶
Data Residency Policy:
# tenants/tenant-eu/resources/data-residency-policy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-eu-data-residency
namespace: atp-tenant-eu
data:
policy.yaml: |
data-residency:
region: "westeurope"
allowed-regions:
- "westeurope"
- "northeurope"
prohibited-regions:
- "eastus"
- "westus"
enforcement:
database: "required"
storage: "required"
backups: "required"
logs: "required"
Enforce Data Residency with Azure Policy:
# Azure Policy: Enforce EU data residency
apiVersion: templates.azure.com/v1beta1
kind: PolicyTemplate
metadata:
name: enforce-eu-data-residency
properties:
displayName: "Enforce EU Data Residency for Tenant EU"
description: "Ensure all resources for tenant-eu are deployed in EU regions"
policyRule:
if:
allOf:
- field: "Microsoft.Resources/subscriptions/resourceGroups/tags['tenant']"
equals: "tenant-eu"
- not:
field: "location"
in: ["westeurope", "northeurope"]
then:
effect: "deny"
Retention Policies¶
Retention Policy Configuration:
# tenants/tenant-a/config/retention-policy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-a-retention-policy
namespace: atp-tenant-a
data:
retention-policy.yaml: |
policies:
audit-logs:
retention-days: 2555 # 7 years
archive-after-days: 365
archive-location: "az://atp-audit-archive/tenant-a"
compliance-data:
retention-days: 3650 # 10 years
archive-after-days: 730
archive-location: "az://atp-compliance-archive/tenant-a"
operational-logs:
retention-days: 90
archive-after-days: 30
archive-location: "az://atp-ops-archive/tenant-a"
deletion:
automated: true
grace-period-days: 30
Retention Policy by Compliance Type:
| Compliance Type | Retention Period | Rationale |
|---|---|---|
| SOC 2 | 7 years | SOC 2 requirement |
| GDPR | 7 years | Legal/regulatory requirement |
| HIPAA | 6 years | HIPAA requirement |
| General | 1 year | Standard retention |
Summary: Multi-Tenancy in GitOps¶
- Tenant Isolation Strategies: Namespace per tenant (ATP approach), cluster per tenant (not used), shared namespace with labels (not recommended)
- Tenant-Specific Configurations: Resource limits per tenant (tier-based), data residency requirements (EU vs US), compliance controls (GDPR, HIPAA, SOC 2), custom ingestion rules
- GitOps Structure for Tenants:
/tenants/{tenant-id}/directory structure, tenant namespace manifest, tenant resource quota, tenant network policy, tenant RBAC - Dynamic Tenant Provisioning: Onboarding script, automated manifest generation, Git commit for new tenant, FluxCD applies tenant resources
- Tenant Onboarding Automation: Step-by-step onboarding process (8 steps), complete automation script, verification script
- Tenant-Specific Helm Values: values-tenant-{id}.yaml, override replicas/resources/endpoints, tenant-specific feature flags
- Multi-Tenancy and FluxCD: Per-tenant Kustomization, tenant isolation in reconciliation, failure isolation (one tenant doesn't affect others)
- Tenant Offboarding: Data deletion procedures, namespace cleanup, Git commit to remove tenant, compliance with GDPR (right to be forgotten)
- Tenant Cost Allocation: Namespace-level resource tagging, cost reporting per tenant, chargeback/showback models
- Compliance Per Tenant: SOC 2/GDPR/HIPAA configurations, tenant-specific audit logs, data residency enforcement, retention policies
Cost Optimization in GitOps¶
Purpose: Define cost optimization strategies, resource right-sizing, autoscaling configurations, automated shutdown procedures, Azure Cost Management integration, and FinOps practices for ATP's GitOps deployments, ensuring optimal resource utilization, cost efficiency, and cost transparency across all environments while maintaining performance and reliability requirements.
AKS Cost Optimization¶
Node Pool Sizing (Right-Sized VMs)¶
Node Pool Sizing Strategy:
| Environment | Node Pool Type | VM SKU | Node Count (Min/Max) | Monthly Cost (Est.) | Use Case |
|---|---|---|---|---|---|
| Production | System | Standard_D4s_v3 | 3/10 | $1,500 | System pods, monitoring |
| Production | User | Standard_D8s_v3 | 5/20 | $6,000 | Application workloads |
| Staging | User | Standard_D4s_v3 | 2/8 | $1,200 | Staging workloads |
| Test | User | Standard_D2s_v3 | ¼ | $300 | Test workloads |
| Dev | User | Standard_D2s_v3 | ¼ | $300 | Development workloads |
Pulumi C# Node Pool Configuration:
// infrastructure/NodePools.cs
using Pulumi;
using Pulumi.AzureNative.ContainerService;
using Pulumi.AzureNative.ContainerService.Inputs;
public class AKSNodePools
{
public static ManagedClusterAgentPoolProfileArgs CreateProductionSystemPool()
{
return new ManagedClusterAgentPoolProfileArgs
{
Name = "systempool",
Count = 3,
VmSize = "Standard_D4s_v3", // 4 vCPUs, 16 GiB
OsType = "Linux",
OsDiskSizeGB = 128,
Mode = AgentPoolMode.System,
EnableAutoScaling = true,
MinCount = 3,
MaxCount = 10,
MaxPods = 30,
EnableNodePublicIP = false,
ScaleSetPriority = ScaleSetPriority.Regular,
ScaleSetEvictionPolicy = ScaleSetEvictionPolicy.Delete,
Tags = new InputMap<string>
{
{ "Environment", "production" },
{ "NodePoolType", "system" },
{ "CostCenter", "infrastructure" }
}
};
}
public static ManagedClusterAgentPoolProfileArgs CreateProductionUserPool()
{
return new ManagedClusterAgentPoolProfileArgs
{
Name = "userpool",
Count = 5,
VmSize = "Standard_D8s_v3", // 8 vCPUs, 32 GiB
OsType = "Linux",
OsDiskSizeGB = 256,
Mode = AgentPoolMode.User,
EnableAutoScaling = true,
MinCount = 5,
MaxCount = 20,
MaxPods = 50,
EnableNodePublicIP = false,
Tags = new InputMap<string>
{
{ "Environment", "production" },
{ "NodePoolType", "user" },
{ "CostCenter", "applications" }
}
};
}
public static ManagedClusterAgentPoolProfileArgs CreateDevSpotPool()
{
return new ManagedClusterAgentPoolProfileArgs
{
Name = "spotpool",
Count = 1,
VmSize = "Standard_D2s_v3", // 2 vCPUs, 8 GiB
OsType = "Linux",
OsDiskSizeGB = 64,
Mode = AgentPoolMode.User,
EnableAutoScaling = true,
MinCount = 0, // Scale to zero
MaxCount = 4,
MaxPods = 30,
EnableNodePublicIP = false,
ScaleSetPriority = ScaleSetPriority.Spot,
ScaleSetEvictionPolicy = ScaleSetEvictionPolicy.Delete,
SpotMaxPrice = 0.05, // Max $0.05 per hour (80% discount)
Tags = new InputMap<string>
{
{ "Environment", "development" },
{ "NodePoolType", "spot" },
{ "CostCenter", "development" }
}
};
}
}
Spot Instances for Dev/Test¶
Spot Instance Configuration:
# clusters/production/node-pools/spot-pool.yaml
apiVersion: containerservice.azure.com/v1
kind: ManagedClusterAgentPoolProfile
metadata:
name: spotpool
spec:
name: spotpool
count: 1
vmSize: Standard_D2s_v3
osType: Linux
osDiskSizeGB: 64
mode: User
enableAutoScaling: true
minCount: 0
maxCount: 4
scaleSetPriority: Spot
scaleSetEvictionPolicy: Delete
spotMaxPrice: 0.05
nodeLabels:
workload: non-production
cost-optimized: "true"
nodeTaints:
- key: kubernetes.azure.com/scalesetpriority
value: spot
effect: NoSchedule
Pod Tolerations for Spot Nodes:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
nodeSelector:
workload: non-production
containers:
- name: atp-ingestion
# ...
Spot Instance Cost Savings:
| VM SKU | Regular Price | Spot Price (80% discount) | Monthly Savings |
|---|---|---|---|
| Standard_D2s_v3 | $0.096/hour | $0.019/hour | ~$55/month |
| Standard_D4s_v3 | $0.192/hour | $0.038/hour | ~$111/month |
| Standard_D8s_v3 | $0.384/hour | $0.077/hour | ~$221/month |
Reserved Instances for Production¶
Reserved Instance Configuration:
#!/bin/bash
# scripts/purchase-reserved-instances.sh
# Purchase 1-year reserved instances for production
az vm reservation create \
--resource-group atp-production-rg \
--reserved-resource-type VirtualMachines \
--instance-flexibility OnDemand \
--billing-scope /subscriptions/${SUBSCRIPTION_ID} \
--term P1Y \
--quantity 10 \
--sku Standard_D8s_v3 \
--location eastus \
--reserved-to-subscription \
--display-name "ATP Production AKS Nodes - D8s_v3"
echo "✅ Reserved instances purchased (up to 72% discount)"
Reserved Instance Cost Savings:
| Commitment | Discount | Monthly Cost (10x D8s_v3) | Savings vs Pay-as-you-go |
|---|---|---|---|
| 1 Year | ~42% | $2,227 | $1,653/month |
| 3 Years | ~72% | $1,282 | $2,598/month |
ATP Recommendation: Use 1-year reserved instances for production user pool nodes.
Cluster Autoscaler Configuration¶
Cluster Autoscaler Setup:
# clusters/production/kustomizations/cluster-autoscaler.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: cluster-autoscaler
namespace: flux-system
spec:
interval: 5m
path: ./platform/cluster-autoscaler
sourceRef:
kind: GitRepository
name: atp-gitops-production
Cluster Autoscaler Deployment:
# platform/cluster-autoscaler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: mcr.microsoft.com/oss/kubernetes/autoscaler/cluster-autoscaler:v1.27.3
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=azure
- --skip-nodes-with-local-storage=false
- --expander=least-waste # Prefer nodes that waste least resources
- --node-group-auto-discovery=label:cluster-autoscaler-enabled=true
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
- --max-node-provision-time=15m
env:
- name: ARM_SUBSCRIPTION_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-secrets
key: subscription-id
- name: ARM_RESOURCE_GROUP
value: atp-production-rg
- name: ARM_TENANT_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-secrets
key: tenant-id
- name: ARM_CLIENT_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-secrets
key: client-id
- name: ARM_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: cluster-autoscaler-secrets
key: client-secret
Cluster Autoscaler Cost Optimization Settings:
| Setting | Value | Rationale |
|---|---|---|
expander |
least-waste |
Prefer node pools that waste least resources |
scale-down-delay-after-add |
10m |
Wait before scaling down after adding nodes |
scale-down-unneeded-time |
10m |
Node must be unneeded for 10m before removal |
scale-down-utilization-threshold |
0.5 |
Scale down if node utilization < 50% |
balance-similar-node-groups |
true |
Balance pods across similar node groups |
Resource Right-Sizing¶
Analyzing Actual Resource Usage¶
Resource Usage Analysis Script:
#!/bin/bash
# scripts/analyze-resource-usage.sh
NAMESPACE="${1:-all}"
echo "📊 Resource Usage Analysis"
echo "=========================="
if [ "${NAMESPACE}" = "all" ]; then
echo "Analyzing all namespaces..."
kubectl top pods --all-namespaces --containers | \
awk 'NR>1 {cpu+=$3; memory+=$4} END {print "Total CPU: " cpu "m"; print "Total Memory: " memory "Mi"}'
else
echo "Analyzing namespace: ${NAMESPACE}"
kubectl top pods -n "${NAMESPACE}" --containers | \
awk 'NR>1 {cpu+=$3; memory+=$4} END {print "Total CPU: " cpu "m"; print "Total Memory: " memory "Mi"}'
fi
echo ""
echo "Resource Requests vs Limits:"
kubectl get pods -n "${NAMESPACE}" -o json | \
jq -r '.items[] | "\(.metadata.name): CPU req=\(.spec.containers[0].resources.requests.cpu // "none") limit=\(.spec.containers[0].resources.limits.cpu // "none"), Memory req=\(.spec.containers[0].resources.requests.memory // "none") limit=\(.spec.containers[0].resources.limits.memory // "none")"'
Azure Monitor Metrics Query:
// Log Analytics: Pod resource usage analysis
Perf
| where ObjectName == "K8SContainer"
| where CounterName in ("cpuUsageNanoCores", "memoryWorkingSetBytes")
| where TimeGenerated >= ago(7d)
| summarize
AvgCpuNanoCores = avg(CounterValue) by CounterName, Namespace, PodName
| extend CpuUsageCores = case(
CounterName == "cpuUsageNanoCores", AvgCpuNanoCores / 1000000000,
0
),
MemoryUsageMiB = case(
CounterName == "memoryWorkingSetBytes", AvgCpuNanoCores / 1024 / 1024,
0
)
| summarize
AvgCpuCores = max(CpuUsageCores),
AvgMemoryMiB = max(MemoryUsageMiB)
by Namespace, PodName
| render timechart
Adjusting Requests and Limits¶
Resource Right-Sizing Workflow:
graph LR
COLLECT[Collect Metrics<br/>7 days]
ANALYZE[Analyze Usage<br/>P95/P99]
RECOMMEND[Generate<br/>Recommendations]
UPDATE[Update Manifests<br/>in Git]
DEPLOY[Deploy via<br/>GitOps]
MONITOR[Monitor<br/>Performance]
COLLECT --> ANALYZE
ANALYZE --> RECOMMEND
RECOMMEND --> UPDATE
UPDATE --> DEPLOY
DEPLOY --> MONITOR
MONITOR --> COLLECT
style COLLECT fill:#FFE5B4
style DEPLOY fill:#90EE90
Resource Right-Sizing Recommendations Script:
#!/bin/bash
# scripts/generate-right-sizing-recommendations.sh
NAMESPACE="${1}"
OUTPUT_FILE="${2:-right-sizing-recommendations.yaml}"
echo "📊 Generating right-sizing recommendations for: ${NAMESPACE}"
# Query Prometheus metrics (7-day average)
PROMETHEUS_URL="http://prometheus-kube-prometheus-prometheus.monitoring:9090"
cat > /tmp/prometheus-queries.txt <<EOF
# Average CPU usage (7 days)
avg_over_time(rate(container_cpu_usage_seconds_total{namespace="${NAMESPACE}"}[5m])[7d:1h])
# Average memory usage (7 days)
avg_over_time(container_memory_working_set_bytes{namespace="${NAMESPACE}"}[7d:1h])
# P95 CPU usage
quantile_over_time(0.95, rate(container_cpu_usage_seconds_total{namespace="${NAMESPACE}"}[5m])[7d:1h])
# P95 Memory usage
quantile_over_time(0.95, container_memory_working_set_bytes{namespace="${NAMESPACE}"}[7d:1h])
EOF
# Generate recommendations (simplified)
cat > "${OUTPUT_FILE}" <<EOF
# Right-Sizing Recommendations for ${NAMESPACE}
# Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)
# Based on: 7-day average usage
recommendations:
- deployment: atp-ingestion
namespace: ${NAMESPACE}
containers:
- name: atp-ingestion
resources:
requests:
cpu: "500m" # Based on P95 usage: 400m
memory: "1Gi" # Based on P95 usage: 800Mi
limits:
cpu: "2000m" # 4x requests (burst capacity)
memory: "2Gi" # 2x requests
EOF
echo "✅ Recommendations generated: ${OUTPUT_FILE}"
Vertical Pod Autoscaler (VPA)¶
VPA Installation:
# Install VPA
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml
VPA Configuration:
# apps/atp-ingestion/base/vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: atp-ingestion-vpa
namespace: atp-production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
updatePolicy:
updateMode: "Auto" # or "Recreate" or "Off"
resourcePolicy:
containerPolicies:
- containerName: atp-ingestion
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4000m
memory: 8Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
VPA Modes:
| Mode | Description | Use Case |
|---|---|---|
| Auto | Automatically update requests/limits | Dev/Test (with restart) |
| Recreate | Update on pod recreation | Staging |
| Off | Only generate recommendations | Production (manual review) |
ATP Recommendation: Use Off mode in production to generate recommendations, then manually review and apply via GitOps.
Recommendations from Azure Advisor¶
Azure Advisor Cost Recommendations:
#!/bin/bash
# scripts/get-azure-advisor-cost-recommendations.sh
echo "💰 Azure Advisor Cost Recommendations"
echo "======================================"
# Get cost recommendations
az advisor recommendation list \
--category Cost \
--output table
# Get specific right-sizing recommendations
az advisor recommendation list \
--category Cost \
--filter "ResourceGroup eq 'atp-production-rg'" \
--query "[?category=='Cost' && impact=='High'].{Name:shortDescription.problem, Impact:impact, PotentialSavings:extendedProperties.potentialSavings}" \
--output table
echo ""
echo "📊 Right-sizing recommendations:"
az advisor recommendation list \
--category Cost \
--filter "ResourceGroup eq 'atp-production-rg'" \
--query "[?recommendationTypeId=='b0b0a0a0-0a0a-0a0a-0a0a-0a0a0a0a0a0a'].{CurrentSKU:extendedProperties.currentSku, RecommendedSKU:extendedProperties.recommendedSku, EstimatedSavings:extendedProperties.estimatedMonthlySavings}" \
--output table
Horizontal Pod Autoscaler (HPA)¶
CPU-Based Scaling¶
CPU-Based HPA Configuration:
# apps/atp-ingestion/base/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-ingestion-hpa
namespace: atp-production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale when CPU > 70%
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 min stabilization
policies:
- type: Percent
value: 50 # Scale down by 50%
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min # Use most conservative policy
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Double pods
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
selectPolicy: Max # Use most aggressive policy
Cost-Optimized HPA Settings:
| Setting | Value | Rationale |
|---|---|---|
averageUtilization |
70% | Allow higher CPU before scaling (cost efficiency) |
scaleDown.stabilizationWindowSeconds |
300s | Prevent rapid scale-down (cost savings) |
scaleDown.selectPolicy |
Min |
Use conservative scale-down (cost savings) |
scaleUp.selectPolicy |
Max |
Aggressive scale-up (performance) |
Memory-Based Scaling¶
Memory-Based HPA:
# apps/atp-ingestion/base/hpa-memory.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-ingestion-hpa-memory
namespace: atp-production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale when memory > 80%
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # 10 min (memory is sticky)
policies:
- type: Percent
value: 25 # Conservative scale-down
periodSeconds: 120
Custom Metrics with KEDA¶
KEDA ScaledObject for Cost Optimization:
# apps/atp-ingestion/base/keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: atp-ingestion-keda
namespace: atp-production
spec:
scaleTargetRef:
name: atp-ingestion
minReplicaCount: 3
maxReplicaCount: 20
cooldownPeriod: 300 # 5 min cooldown
idleReplicaCount: 0 # Scale to zero when idle (dev only)
triggers:
# CPU-based scaling
- type: cpu
metadata:
type: Utilization
value: "70"
# Memory-based scaling
- type: memory
metadata:
type: Utilization
value: "80"
# Custom metric: Queue length
- type: azure-servicebus
metadata:
queueName: atp-ingestion-queue
messageCount: "100" # Scale when > 100 messages
connectionFromEnv: SERVICEBUS_CONNECTION_STRING
# HTTP request rate
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: http_requests_per_second
threshold: "1000"
query: sum(rate(http_requests_total{service="atp-ingestion"}[1m]))
KEDA Cost Optimization Settings:
| Setting | Value | Use Case |
|---|---|---|
cooldownPeriod |
300s | Prevent rapid scaling (cost savings) |
idleReplicaCount |
0 | Dev environments (scale to zero) |
minReplicaCount |
3 | Production (always available) |
Scaling Policies for Cost Efficiency¶
Cost-Efficient Scaling Strategy:
graph TB
METRICS[Pod Metrics<br/>CPU/Memory]
HPA[Horizontal Pod Autoscaler]
DECISION{Scale Decision}
SCALE_UP[Scale Up<br/>Aggressive]
SCALE_DOWN[Scale Down<br/>Conservative]
METRICS --> HPA
HPA --> DECISION
DECISION -->|High Load| SCALE_UP
DECISION -->|Low Load| SCALE_DOWN
SCALE_UP --> PERFORMANCE[Performance Priority]
SCALE_DOWN --> COST[Cost Priority]
style SCALE_UP fill:#90EE90
style SCALE_DOWN fill:#FFE5B4
style PERFORMANCE fill:#90EE90
style COST fill:#FFB6C1
Environment-Specific Scaling Policies:
| Environment | Min Replicas | Max Replicas | Scale-Down Delay | Rationale |
|---|---|---|---|---|
| Production | 3 | 50 | 10 min | Performance > Cost |
| Staging | 2 | 20 | 5 min | Balanced |
| Test | 1 | 10 | 3 min | Cost > Performance |
| Dev | 0 | 5 | 1 min | Cost optimization |
Cluster Autoscaler¶
Adding Nodes Based on Demand¶
Cluster Autoscaler Behavior:
sequenceDiagram
participant Pod as Pod (Pending)
participant CA as Cluster Autoscaler
participant AKS as AKS Node Pool
participant VM as New VM Node
Pod->>CA: Pod cannot be scheduled
CA->>CA: Check node pool capacity
CA->>AKS: Scale up node pool
AKS->>VM: Provision new VM
VM->>Pod: Pod scheduled on new node
Pod->>CA: Pod running
Cluster Autoscaler Configuration:
# platform/cluster-autoscaler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
namespace: kube-system
data:
config.yaml: |
nodeGroups:
- name: userpool
minSize: 5
maxSize: 20
scaleDownDelayAfterAdd: 10m
scaleDownUnneededTime: 10m
scaleDownUtilizationThreshold: 0.5
scaleDownEnabled: true
maxNodeProvisionTime: 15m
balanceSimilarNodeGroups: true
expander: least-waste
Removing Idle Nodes¶
Scale-Down Conditions:
| Condition | Value | Rationale |
|---|---|---|
scaleDownDelayAfterAdd |
10m | Wait before removing newly added nodes |
scaleDownUnneededTime |
10m | Node must be unneeded for 10 minutes |
scaleDownUtilizationThreshold |
0.5 | Node utilization < 50% before removal |
maxEmptyBulkDelete |
10 | Remove up to 10 idle nodes at once |
Pod Disruption Budget Protection:
# apps/atp-ingestion/base/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: atp-ingestion-pdb
namespace: atp-production
spec:
minAvailable: 2 # Always keep 2 pods running
selector:
matchLabels:
app: atp-ingestion
Scale-Down Delays and Thresholds¶
Cost-Optimized Scale-Down Configuration:
# Cluster Autoscaler: Aggressive scale-down (cost savings)
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-cost-optimized
namespace: kube-system
data:
config.yaml: |
scaleDownDelayAfterAdd: 5m # Reduced from 10m
scaleDownUnneededTime: 5m # Reduced from 10m
scaleDownUtilizationThreshold: 0.4 # More aggressive (40%)
scaleDownGpuUtilizationThreshold: 0.4
maxEmptyBulkDelete: 20 # Remove more nodes at once
scaleDownEnabled: true
Node Affinity and Taints¶
Node Affinity for Cost Optimization:
# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
spec:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
# Prefer spot nodes (cost savings)
- weight: 100
preference:
matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values:
- spot
# Prefer smaller nodes (cost efficiency)
- weight: 50
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- Standard_D2s_v3
- Standard_D4s_v3
tolerations:
# Allow scheduling on spot nodes
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
Development Environment Auto-Shutdown¶
Schedule-Based Scaling to Zero¶
CronJob for Auto-Shutdown:
# platform/auto-shutdown/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: dev-auto-shutdown
namespace: kube-system
spec:
schedule: "0 20 * * 1-5" # 8 PM Monday-Friday
timeZone: "America/New_York"
jobTemplate:
spec:
template:
spec:
serviceAccountName: dev-shutdown-sa
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
# Scale down dev namespaces
for ns in atp-dev atp-test; do
for deployment in $(kubectl get deployments -n $ns -o name); do
kubectl scale $deployment -n $ns --replicas=0
done
done
echo "✅ Dev environments scaled down at $(date)"
restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: dev-auto-startup
namespace: kube-system
spec:
schedule: "0 8 * * 1-5" # 8 AM Monday-Friday
timeZone: "America/New_York"
jobTemplate:
spec:
template:
spec:
serviceAccountName: dev-startup-sa
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
# Scale up dev namespaces
for ns in atp-dev atp-test; do
for deployment in $(kubectl get deployments -n $ns -o name); do
kubectl scale $deployment -n $ns --replicas=1
done
done
echo "✅ Dev environments scaled up at $(date)"
restartPolicy: OnFailure
Scaling Down Replicas at Night/Weekends¶
Auto-Shutdown Script:
#!/bin/bash
# scripts/auto-shutdown-dev-environments.sh
NAMESPACES=("atp-dev" "atp-test")
SHUTDOWN_TIME="20:00" # 8 PM
STARTUP_TIME="08:00" # 8 AM
CURRENT_HOUR=$(date +%H)
CURRENT_DAY=$(date +%u) # 1=Monday, 7=Sunday
# Check if it's weekend
if [ "${CURRENT_DAY}" -eq 6 ] || [ "${CURRENT_DAY}" -eq 7 ]; then
echo "📴 Weekend: Scaling down all dev environments..."
for NS in "${NAMESPACES[@]}"; do
kubectl get deployments -n "${NS}" -o name | \
xargs -I {} kubectl scale {} -n "${NS}" --replicas=0
done
exit 0
fi
# Check if it's shutdown time (8 PM - 8 AM)
if [ "${CURRENT_HOUR}" -ge 20 ] || [ "${CURRENT_HOUR}" -lt 8 ]; then
echo "🌙 Night time: Scaling down dev environments..."
for NS in "${NAMESPACES[@]}"; do
kubectl get deployments -n "${NS}" -o name | \
xargs -I {} kubectl scale {} -n "${NS}" --replicas=0
done
else
echo "☀️ Day time: Ensuring dev environments are running..."
for NS in "${NAMESPACES[@]}"; do
kubectl get deployments -n "${NS}" -o name | \
xargs -I {} kubectl scale {} -n "${NS}" --replicas=1
done
fi
Wake-Up Procedures¶
Wake-Up Script:
#!/bin/bash
# scripts/wake-up-dev-environments.sh
NAMESPACES=("atp-dev" "atp-test")
echo "☀️ Waking up dev environments..."
for NS in "${NAMESPACES[@]}"; do
echo " - Scaling up namespace: ${NS}"
# Scale up deployments
kubectl get deployments -n "${NS}" -o name | \
xargs -I {} kubectl scale {} -n "${NS}" --replicas=1
# Wait for pods to be ready
echo " - Waiting for pods to be ready..."
kubectl wait --for=condition=available --timeout=5m \
deployment --all -n "${NS}"
done
echo "✅ Dev environments are ready"
Cost Savings Calculation¶
Auto-Shutdown Cost Savings:
| Environment | Daily Hours | Weekly Hours | Monthly Cost (Running) | Monthly Cost (Shutdown) | Savings |
|---|---|---|---|---|---|
| Dev | 12 hours | 60 hours | $300 | $120 | $180/month (60%) |
| Test | 12 hours | 60 hours | $300 | $120 | $180/month (60%) |
| Total | - | - | $600 | $240 | $360/month |
Cost Savings Formula:
Monthly Savings = (24 hours - Running hours) / 24 hours × Monthly Cost
Monthly Savings = (24 - 12) / 24 × $300 = $150/month per environment
Azure Cost Management Integration¶
Cost Tracking per Environment¶
Cost Tracking Dashboard Query:
// Log Analytics: Cost tracking per environment
AzureCost
| where TimeGenerated >= ago(30d)
| where Tags contains "Environment"
| extend Environment = tostring(Tags.Environment)
| extend Service = tostring(Tags.Service)
| summarize
TotalCost = sum(Cost),
AvgDailyCost = avg(Cost)
by Environment, bin(TimeGenerated, 1d)
| render timechart
Cost Tracking Script:
#!/bin/bash
# scripts/track-costs-by-environment.sh
ENVIRONMENT="${1:-all}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"
echo "💰 Cost Tracking: ${ENVIRONMENT}"
echo " Period: ${START_DATE} to ${END_DATE}"
echo ""
if [ "${ENVIRONMENT}" = "all" ]; then
ENVIRONMENTS=("production" "staging" "test" "development")
else
ENVIRONMENTS=("${ENVIRONMENT}")
fi
for ENV in "${ENVIRONMENTS[@]}"; do
echo "📊 ${ENV}:"
COST=$(az consumption usage list \
--start-date "${START_DATE}" \
--end-date "${END_DATE}" \
--query "[?tags.Environment=='${ENV}'].pretaxCost" \
--output tsv | \
awk '{sum+=$1} END {printf "%.2f", sum}')
echo " Total Cost: \$${COST}"
# Daily average
DAYS=$(( ($(date -d "${END_DATE}" +%s) - $(date -d "${START_DATE}" +%s)) / 86400 ))
AVG_DAILY=$(echo "scale=2; ${COST} / ${DAYS}" | bc)
echo " Avg Daily: \$${AVG_DAILY}"
echo ""
done
Budget Alerts¶
Budget Configuration:
#!/bin/bash
# scripts/create-budget-alert.sh
BUDGET_NAME="${1}"
AMOUNT="${2}"
RESOURCE_GROUP="${3}"
EMAIL="${4}"
az consumption budget create \
--budget-name "${BUDGET_NAME}" \
--amount "${AMOUNT}" \
--time-grain Monthly \
--start-date "$(date +%Y-%m-01)" \
--end-date "$(date -d '+1 year' +%Y-%m-01)" \
--category Cost \
--resource-group "${RESOURCE_GROUP}" \
--notifications threshold=50 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}" \
--notifications threshold=80 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}" \
--notifications threshold=100 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}"
echo "✅ Budget created: ${BUDGET_NAME} (\$${AMOUNT}/month)"
Budget Alert Configuration:
# infrastructure/budgets.yaml (Pulumi C# example concept)
var productionBudget = new Budget("atp-production-budget", new BudgetArgs
{
BudgetName = "atp-production-monthly",
Amount = 10000.0, // $10,000/month
TimeGrain = "Monthly",
StartDate = DateTime.Now.ToString("yyyy-MM-01"),
Category = "Cost",
Notifications = new[]
{
new BudgetNotificationArgs
{
Threshold = 50, // 50% of budget
ThresholdType = "Actual",
Operator = "GreaterThan",
ContactEmails = new[] { "finance@example.com" }
},
new BudgetNotificationArgs
{
Threshold = 80, // 80% of budget
ThresholdType = "Actual",
Operator = "GreaterThan",
ContactEmails = new[] { "finance@example.com", "ops@example.com" }
},
new BudgetNotificationArgs
{
Threshold = 100, // 100% of budget
ThresholdType = "Actual",
Operator = "GreaterThan",
ContactEmails = new[] { "finance@example.com", "ops@example.com", "cto@example.com" }
}
}
});
Cost Anomaly Detection¶
Cost Anomaly Detection:
#!/bin/bash
# scripts/detect-cost-anomalies.sh
THRESHOLD="${1:-0.2}" # 20% increase threshold
echo "🔍 Detecting cost anomalies..."
# Get current month cost
CURRENT_MONTH=$(date +%Y-%m)
CURRENT_COST=$(az consumption usage list \
--start-date "${CURRENT_MONTH}-01" \
--end-date "$(date +%Y-%m-%d)" \
--query "[].pretaxCost" \
--output tsv | \
awk '{sum+=$1} END {print sum}')
# Get last month cost
LAST_MONTH=$(date -d '1 month ago' +%Y-%m)
LAST_MONTH_COST=$(az consumption usage list \
--start-date "${LAST_MONTH}-01" \
--end-date "${LAST_MONTH}-$(date -d "${LAST_MONTH}-01 +1 month -1 day" +%d)" \
--query "[].pretaxCost" \
--output tsv | \
awk '{sum+=$1} END {print sum}')
# Calculate increase percentage
INCREASE=$(echo "scale=2; (${CURRENT_COST} - ${LAST_MONTH_COST}) / ${LAST_MONTH_COST} * 100" | bc)
if (( $(echo "${INCREASE} > ${THRESHOLD} * 100" | bc -l) )); then
echo "⚠️ Cost anomaly detected!"
echo " Current month: \$${CURRENT_COST}"
echo " Last month: \$${LAST_MONTH_COST}"
echo " Increase: ${INCREASE}%"
echo " Threshold: $(echo "${THRESHOLD} * 100" | bc)%"
# Send alert
echo "Sending alert to finance@example.com..."
else
echo "✅ No cost anomalies detected"
echo " Increase: ${INCREASE}%"
fi
Cost Optimization Recommendations¶
Azure Advisor Cost Recommendations:
#!/bin/bash
# scripts/get-cost-optimization-recommendations.sh
echo "💰 Azure Advisor Cost Optimization Recommendations"
echo "=================================================="
# Get all cost recommendations
az advisor recommendation list \
--category Cost \
--query "[].{Name:shortDescription.problem, Impact:impact, ResourceGroup:resourceGroup, PotentialSavings:extendedProperties.potentialSavings}" \
--output table
echo ""
echo "📊 Top 10 Cost Savings Opportunities:"
az advisor recommendation list \
--category Cost \
--query "[?impact=='High' || impact=='Medium'].{Name:shortDescription.problem, Impact:impact, PotentialSavings:extendedProperties.potentialSavings, ResourceId:resourceId}" \
--output table | head -n 10
Cost Allocation¶
Tags per Environment, Service, Tenant¶
Comprehensive Tagging Strategy:
# Resource tagging template
tags:
Environment: production | staging | test | development
Service: atp-ingestion | atp-query | atp-gateway | platform
Tenant: tenant-a | tenant-b | tenant-eu | shared
CostCenter: sales | engineering | operations
BusinessUnit: enterprise | smb
Project: audit-trail-platform
Owner: team-name@example.com
ManagedBy: gitops | terraform | pulumi
AutoShutdown: true | false
Criticality: critical | high | medium | low
Tagging in Pulumi:
// infrastructure/Tags.cs
public static class ResourceTags
{
public static InputMap<string> ProductionTags(string service, string costCenter)
{
return new InputMap<string>
{
{ "Environment", "production" },
{ "Service", service },
{ "CostCenter", costCenter },
{ "Project", "audit-trail-platform" },
{ "ManagedBy", "pulumi" },
{ "Criticality", "critical" },
{ "AutoShutdown", "false" }
};
}
public static InputMap<string> DevelopmentTags(string service)
{
return new InputMap<string>
{
{ "Environment", "development" },
{ "Service", service },
{ "CostCenter", "engineering" },
{ "Project", "audit-trail-platform" },
{ "ManagedBy", "pulumi" },
{ "Criticality", "low" },
{ "AutoShutdown", "true" }
};
}
}
Namespace-Level Cost Reporting¶
Namespace Cost Reporting:
#!/bin/bash
# scripts/namespace-cost-report.sh
NAMESPACE="${1}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"
echo "💰 Cost Report for Namespace: ${NAMESPACE}"
echo " Period: ${START_DATE} to ${END_DATE}"
echo ""
# Get pods in namespace
PODS=$(kubectl get pods -n "${NAMESPACE}" -o json | jq -r '.items[].metadata.name')
TOTAL_CPU=0
TOTAL_MEMORY=0
for POD in ${PODS}; do
# Get CPU and memory requests
CPU=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o json | \
jq -r '.spec.containers[].resources.requests.cpu' | \
sed 's/m//' | awk '{sum+=$1} END {print sum}')
MEMORY=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o json | \
jq -r '.spec.containers[].resources.requests.memory' | \
sed 's/Gi//' | awk '{sum+=$1} END {print sum}')
TOTAL_CPU=$((TOTAL_CPU + CPU))
TOTAL_MEMORY=$((TOTAL_MEMORY + MEMORY))
done
echo "Resource Requests:"
echo " CPU: ${TOTAL_CPU}m cores"
echo " Memory: ${TOTAL_MEMORY}Gi"
echo ""
# Estimate cost (example pricing)
CPU_COST=$(echo "scale=2; ${TOTAL_CPU} / 1000 * 0.096 * 24 * 30" | bc)
MEMORY_COST=$(echo "scale=2; ${TOTAL_MEMORY} * 0.01 * 24 * 30" | bc)
TOTAL_COST=$(echo "scale=2; ${CPU_COST} + ${MEMORY_COST}" | bc)
echo "Estimated Monthly Cost:"
echo " CPU: \$${CPU_COST}"
echo " Memory: \$${MEMORY_COST}"
echo " Total: \$${TOTAL_COST}"
Chargeback to Teams¶
Team Chargeback Report:
#!/bin/bash
# scripts/team-chargeback-report.sh
TEAM="${1:-all}"
MONTH="${2:-$(date +%Y-%m)}"
echo "💰 Team Chargeback Report: ${TEAM}"
echo " Month: ${MONTH}"
echo ""
if [ "${TEAM}" = "all" ]; then
TEAMS=("engineering" "sales" "operations")
else
TEAMS=("${TEAM}")
fi
for T in "${TEAMS[@]}"; do
echo "📊 ${T}:"
# Get costs for team's resources
COST=$(az consumption usage list \
--start-date "${MONTH}-01" \
--end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
--query "[?tags.CostCenter=='${T}'].pretaxCost" \
--output tsv | \
awk '{sum+=$1} END {printf "%.2f", sum}')
echo " Total Cost: \$${COST}"
echo ""
done
Showback Reporting¶
Showback Dashboard Query:
// Log Analytics: Showback report
AzureCost
| where TimeGenerated >= ago(30d)
| extend CostCenter = tostring(Tags.CostCenter)
| extend Service = tostring(Tags.Service)
| extend Environment = tostring(Tags.Environment)
| summarize
TotalCost = sum(Cost),
ResourceCount = count()
by CostCenter, Service, Environment
| render barchart
Resource Cleanup Automation¶
Deleting Unused Images in ACR¶
ACR Cleanup Script:
#!/bin/bash
# scripts/cleanup-acr-images.sh
ACR_NAME="${1}"
KEEP_DAYS="${2:-30}" # Keep images from last 30 days
KEEP_TAGS="${3:-10}" # Keep 10 most recent tags per repository
echo "🧹 Cleaning up unused ACR images: ${ACR_NAME}"
echo " Keep days: ${KEEP_DAYS}"
echo " Keep tags per repo: ${KEEP_TAGS}"
echo ""
# Get all repositories
REPOS=$(az acr repository list --name "${ACR_NAME}" --output tsv)
for REPO in ${REPOS}; do
echo "📦 Repository: ${REPO}"
# Get all tags sorted by last update date
TAGS=$(az acr repository show-tags \
--name "${ACR_NAME}" \
--repository "${REPO}" \
--orderby time_desc \
--output tsv | head -n "${KEEP_TAGS}")
# Get tags older than KEEP_DAYS
OLD_TAGS=$(az acr repository show-tags \
--name "${ACR_NAME}" \
--repository "${REPO}" \
--query "[?lastUpdateTime < '$(date -d "${KEEP_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)'].name" \
--output tsv)
# Delete old tags (but keep the KEEP_TAGS most recent)
for TAG in ${OLD_TAGS}; do
if ! echo "${TAGS}" | grep -q "${TAG}"; then
echo " 🗑️ Deleting: ${REPO}:${TAG}"
az acr repository delete \
--name "${ACR_NAME}" \
--image "${REPO}:${TAG}" \
--yes || true
fi
done
done
echo "✅ ACR cleanup complete"
Removing Old PersistentVolumes¶
PV Cleanup Script:
#!/bin/bash
# scripts/cleanup-old-pvs.sh
NAMESPACE="${1:-all}"
AGE_DAYS="${2:-30}" # Delete PVs older than 30 days
echo "🧹 Cleaning up old PersistentVolumes"
echo " Namespace: ${NAMESPACE}"
echo " Age threshold: ${AGE_DAYS} days"
echo ""
if [ "${NAMESPACE}" = "all" ]; then
PVS=$(kubectl get pv -o json | \
jq -r ".items[] | select(.status.phase == \"Released\" or .status.phase == \"Failed\") | .metadata.name")
else
PVS=$(kubectl get pv -o json | \
jq -r ".items[] | select(.spec.claimRef.namespace == \"${NAMESPACE}\" and (.status.phase == \"Released\" or .status.phase == \"Failed\")) | .metadata.name")
fi
for PV in ${PVS}; do
# Get PV creation timestamp
CREATED=$(kubectl get pv "${PV}" -o jsonpath='{.metadata.creationTimestamp}')
CREATED_EPOCH=$(date -d "${CREATED}" +%s)
AGE_EPOCH=$(date -d "${AGE_DAYS} days ago" +%s)
if [ "${CREATED_EPOCH}" -lt "${AGE_EPOCH}" ]; then
echo "🗑️ Deleting old PV: ${PV} (created: ${CREATED})"
kubectl delete pv "${PV}" || true
fi
done
echo "✅ PV cleanup complete"
Cleaning Up Completed Jobs¶
Job Cleanup Script:
#!/bin/bash
# scripts/cleanup-completed-jobs.sh
NAMESPACE="${1:-all}"
AGE_HOURS="${2:-24}" # Delete jobs older than 24 hours
echo "🧹 Cleaning up completed Jobs"
echo " Namespace: ${NAMESPACE}"
echo " Age threshold: ${AGE_HOURS} hours"
echo ""
if [ "${NAMESPACE}" = "all" ]; then
NAMESPACES=$(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}')
else
NAMESPACES=("${NAMESPACE}")
fi
for NS in ${NAMESPACES}; do
# Get completed/failed jobs
JOBS=$(kubectl get jobs -n "${NS}" -o json | \
jq -r ".items[] | select(.status.succeeded == 1 or .status.failed > 0) | .metadata.name")
for JOB in ${JOBS}; do
# Get job completion time
COMPLETION_TIME=$(kubectl get job "${JOB}" -n "${NS}" -o jsonpath='{.status.completionTime}')
if [ -n "${COMPLETION_TIME}" ]; then
COMPLETION_EPOCH=$(date -d "${COMPLETION_TIME}" +%s)
AGE_EPOCH=$(date -d "${AGE_HOURS} hours ago" +%s)
if [ "${COMPLETION_EPOCH}" -lt "${AGE_EPOCH}" ]; then
echo "🗑️ Deleting completed job: ${NS}/${JOB}"
kubectl delete job "${JOB}" -n "${NS}" || true
fi
fi
done
done
echo "✅ Job cleanup complete"
Snapshot Cleanup¶
Snapshot Cleanup Script:
#!/bin/bash
# scripts/cleanup-old-snapshots.sh
RESOURCE_GROUP="${1}"
AGE_DAYS="${2:-7}" # Keep snapshots from last 7 days
echo "🧹 Cleaning up old snapshots: ${RESOURCE_GROUP}"
echo " Age threshold: ${AGE_DAYS} days"
echo ""
# Get all snapshots older than AGE_DAYS
SNAPSHOTS=$(az snapshot list \
--resource-group "${RESOURCE_GROUP}" \
--query "[?timeCreated < '$(date -d "${AGE_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)'].{Name:name, TimeCreated:timeCreated}" \
--output tsv)
for SNAPSHOT in ${SNAPSHOTS}; do
NAME=$(echo "${SNAPSHOT}" | cut -f1)
TIME=$(echo "${SNAPSHOT}" | cut -f2)
echo "🗑️ Deleting snapshot: ${NAME} (created: ${TIME})"
az snapshot delete \
--resource-group "${RESOURCE_GROUP}" \
--name "${NAME}" \
--yes || true
done
echo "✅ Snapshot cleanup complete"
Azure Advisor Recommendations¶
Reviewing Cost Recommendations¶
Review Azure Advisor Recommendations:
#!/bin/bash
# scripts/review-azure-advisor-recommendations.sh
echo "💰 Azure Advisor Cost Recommendations"
echo "======================================"
# Get all cost recommendations
az advisor recommendation list \
--category Cost \
--output table
echo ""
echo "📊 High Impact Recommendations:"
az advisor recommendation list \
--category Cost \
--filter "Impact eq 'High'" \
--query "[].{Name:shortDescription.problem, ResourceGroup:resourceGroup, PotentialSavings:extendedProperties.potentialSavings}" \
--output table
Implementing Right-Sizing Suggestions¶
Right-Sizing Implementation:
#!/bin/bash
# scripts/implement-right-sizing.sh
RECOMMENDATION_ID="${1}"
if [ -z "${RECOMMENDATION_ID}" ]; then
echo "Usage: $0 <recommendation-id>"
echo ""
echo "Available recommendations:"
az advisor recommendation list \
--category Cost \
--query "[].{ID:id, Name:shortDescription.problem, CurrentSKU:extendedProperties.currentSku, RecommendedSKU:extendedProperties.recommendedSku}" \
--output table
exit 1
fi
echo "📊 Implementing right-sizing recommendation: ${RECOMMENDATION_ID}"
# Get recommendation details
RECOMMENDATION=$(az advisor recommendation show \
--id "${RECOMMENDATION_ID}")
CURRENT_SKU=$(echo "${RECOMMENDATION}" | jq -r '.extendedProperties.currentSku')
RECOMMENDED_SKU=$(echo "${RECOMMENDATION}" | jq -r '.extendedProperties.recommendedSku')
RESOURCE_ID=$(echo "${RECOMMENDATION}" | jq -r '.resourceId')
echo " Current SKU: ${CURRENT_SKU}"
echo " Recommended SKU: ${RECOMMENDED_SKU}"
echo " Resource: ${RESOURCE_ID}"
echo ""
read -p "Apply this recommendation? (yes/no): " CONFIRM
if [ "${CONFIRM}" = "yes" ]; then
echo "🔧 Applying right-sizing..."
# Determine resource type and update
if echo "${RESOURCE_ID}" | grep -q "Microsoft.Compute/virtualMachines"; then
RESOURCE_GROUP=$(echo "${RESOURCE_ID}" | cut -d'/' -f5)
VM_NAME=$(echo "${RESOURCE_ID}" | cut -d'/' -f9)
echo " Updating VM: ${VM_NAME}"
az vm resize \
--resource-group "${RESOURCE_GROUP}" \
--name "${VM_NAME}" \
--size "${RECOMMENDED_SKU}"
else
echo " Resource type not yet supported for automatic resizing"
echo " Please apply manually: ${RESOURCE_ID}"
fi
else
echo "❌ Right-sizing not applied"
fi
SKU Optimization¶
SKU Optimization Analysis:
#!/bin/bash
# scripts/analyze-sku-optimization.sh
echo "📊 SKU Optimization Analysis"
echo "============================"
# Get all VMs and their current SKUs
az vm list \
--query "[].{Name:name, ResourceGroup:resourceGroup, Size:hardwareProfile.vmSize}" \
--output table
echo ""
echo "💰 Cost comparison (example VMs):"
echo " Standard_D4s_v3 (4 vCPU, 16 GiB): \$0.192/hour = \$140/month"
echo " Standard_D8s_v3 (8 vCPU, 32 GiB): \$0.384/hour = \$280/month"
echo " Standard_D16s_v3 (16 vCPU, 64 GiB): \$0.768/hour = \$561/month"
echo ""
echo "💡 Recommendations:"
echo " - Right-size based on actual usage (P95 metrics)"
echo " - Use Reserved Instances for production (up to 72% discount)"
echo " - Use Spot Instances for dev/test (up to 80% discount)"
FinOps Practices¶
Cost Monitoring Dashboards¶
FinOps Dashboard Configuration:
# monitoring/dashboards/finops-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: finops-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "ATP FinOps Dashboard",
"panels": [
{
"title": "Monthly Cost by Environment",
"targets": [
{
"expr": "sum(azure_cost_total{environment=~\"production|staging|test|development\"}) by (environment)",
"legendFormat": "{{environment}}"
}
]
},
{
"title": "Cost Trend (30 days)",
"targets": [
{
"expr": "sum(rate(azure_cost_total[1d]))",
"legendFormat": "Daily Cost"
}
]
},
{
"title": "Resource Utilization vs Cost",
"targets": [
{
"expr": "sum(container_cpu_usage_seconds_total) / sum(container_resource_requests_cpu_seconds_total) * 100",
"legendFormat": "CPU Utilization %"
},
{
"expr": "sum(container_memory_working_set_bytes) / sum(container_resource_requests_memory_bytes) * 100",
"legendFormat": "Memory Utilization %"
}
]
}
]
}
}
Monthly Cost Reviews¶
Monthly Cost Review Script:
#!/bin/bash
# scripts/monthly-cost-review.sh
MONTH="${1:-$(date -d '1 month ago' +%Y-%m)}"
echo "💰 Monthly Cost Review: ${MONTH}"
echo "================================"
echo ""
# Total cost
TOTAL_COST=$(az consumption usage list \
--start-date "${MONTH}-01" \
--end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
--query "[].pretaxCost" \
--output tsv | \
awk '{sum+=$1} END {printf "%.2f", sum}')
echo "📊 Total Cost: \$${TOTAL_COST}"
echo ""
# Cost by environment
echo "Cost by Environment:"
az consumption usage list \
--start-date "${MONTH}-01" \
--end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
--query "[].{Environment:tags.Environment, Cost:pretaxCost}" \
--output tsv | \
awk '{cost[$1]+=$2} END {for (env in cost) printf " %s: $%.2f\n", env, cost[env]}'
echo ""
# Cost by service
echo "Cost by Service:"
az consumption usage list \
--start-date "${MONTH}-01" \
--end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
--query "[].{Service:tags.Service, Cost:pretaxCost}" \
--output tsv | \
awk '{cost[$1]+=$2} END {for (svc in cost) printf " %s: $%.2f\n", svc, cost[svc]}'
echo ""
# Top 10 resources by cost
echo "Top 10 Resources by Cost:"
az consumption usage list \
--start-date "${MONTH}-01" \
--end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
--query "[].{Resource:instanceName, Cost:pretaxCost}" \
--output tsv | \
sort -k2 -nr | head -n 10
Budget Forecasting¶
Budget Forecast Script:
#!/bin/bash
# scripts/budget-forecast.sh
CURRENT_MONTH=$(date +%Y-%m)
LAST_MONTH=$(date -d '1 month ago' +%Y-%m)
echo "📈 Budget Forecast"
echo "=================="
echo ""
# Get last 3 months of costs
for i in {2..0}; do
MONTH=$(date -d "${i} months ago" +%Y-%m)
COST=$(az consumption usage list \
--start-date "${MONTH}-01" \
--end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
--query "[].pretaxCost" \
--output tsv | \
awk '{sum+=$1} END {printf "%.2f", sum}')
echo "${MONTH}: \$${COST}"
done
echo ""
# Forecast next month (simple average)
CURRENT_COST=$(az consumption usage list \
--start-date "${CURRENT_MONTH}-01" \
--end-date "$(date +%Y-%m-%d)" \
--query "[].pretaxCost" \
--output tsv | \
awk '{sum+=$1} END {printf "%.2f", sum}')
DAYS_IN_MONTH=$(date -d "$(date +%Y-%m-01) +1 month -1 day" +%d)
DAYS_ELAPSED=$(date +%d)
FORECAST=$(echo "scale=2; ${CURRENT_COST} / ${DAYS_ELAPSED} * ${DAYS_IN_MONTH}" | bc)
echo "📊 Forecast for $(date -d '+1 month' +%Y-%m): \$${FORECAST}"
echo " Based on current month trend"
Cost Optimization KPIs¶
Cost Optimization KPI Dashboard:
// Log Analytics: Cost Optimization KPIs
let CostData = AzureCost
| where TimeGenerated >= ago(30d)
| extend Environment = tostring(Tags.Environment)
| extend Service = tostring(Tags.Service)
| summarize TotalCost = sum(Cost) by Environment, Service, bin(TimeGenerated, 1d);
// KPI 1: Cost per Environment
CostData
| summarize
TotalCost = sum(TotalCost),
AvgDailyCost = avg(TotalCost)
by Environment
| extend KPI = "Cost per Environment";
// KPI 2: Resource Utilization vs Cost
union (
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize AvgCpu = avg(CounterValue) by Namespace, bin(TimeGenerated, 1d)
),
(
AzureCost
| where TimeGenerated >= ago(30d)
| extend Namespace = tostring(Tags.Namespace)
| summarize Cost = sum(Cost) by Namespace, bin(TimeGenerated, 1d)
)
| summarize
AvgCpu = max(AvgCpu),
TotalCost = max(Cost)
by Namespace, bin(TimeGenerated, 1d)
| extend Efficiency = TotalCost / (AvgCpu / 1000000000)
| extend KPI = "Cost Efficiency"
| render timechart
Cost Optimization KPIs:
| KPI | Target | Current | Status |
|---|---|---|---|
| Cost per Environment | < $5,000/month | $4,200 | ✅ |
| Resource Utilization | > 70% | 65% | ⚠️ |
| Cost per Transaction | < $0.01 | $0.008 | ✅ |
| Waste (Unused Resources) | < 10% | 12% | ⚠️ |
| Reserved Instance Coverage | > 80% | 75% | ⚠️ |
Summary: Cost Optimization in GitOps¶
- AKS Cost Optimization: Node pool sizing (right-sized VMs), spot instances for dev/test (80% discount), reserved instances for production (up to 72% discount), cluster autoscaler configuration
- Resource Right-Sizing: Analyzing actual resource usage (7-day metrics), adjusting requests and limits, Vertical Pod Autoscaler (VPA), recommendations from Azure Advisor
- Horizontal Pod Autoscaler (HPA): CPU-based scaling (70% threshold), memory-based scaling, custom metrics with KEDA, scaling policies for cost efficiency (conservative scale-down)
- Cluster Autoscaler: Adding nodes based on demand, removing idle nodes (50% utilization threshold), scale-down delays and thresholds, node affinity and taints for spot instances
- Development Environment Auto-Shutdown: Schedule-based scaling to zero (8 PM - 8 AM, weekends), scaling down replicas at night/weekends, wake-up procedures, cost savings calculation (60% savings)
- Azure Cost Management Integration: Cost tracking per environment, budget alerts (50%, 80%, 100% thresholds), cost anomaly detection, cost optimization recommendations
- Cost Allocation: Tags per environment/service/tenant, namespace-level cost reporting, chargeback to teams, showback reporting
- Resource Cleanup Automation: Deleting unused images in ACR (30-day retention), removing old PersistentVolumes, cleaning up completed jobs (24-hour retention), snapshot cleanup (7-day retention)
- Azure Advisor Recommendations: Reviewing cost recommendations, implementing right-sizing suggestions, SKU optimization
- FinOps Practices: Cost monitoring dashboards, monthly cost reviews, budget forecasting, cost optimization KPIs (utilization, waste, efficiency)
Networking & Service Mesh¶
Purpose: Define networking architecture, ingress controller configuration, certificate management, network policies, service mesh selection and implementation, mTLS, traffic management, and multi-cluster networking strategies for ATP's GitOps deployments, ensuring secure, scalable, and observable service-to-service communication across all environments.
AKS Networking Models¶
kubenet (Basic Networking)¶
kubenet Networking Overview:
graph TB
subgraph "AKS Cluster (kubenet)"
POD1[Pod 1<br/>10.244.0.0/24]
POD2[Pod 2<br/>10.244.1.0/24]
KUBENET[kubenet Plugin<br/>Overlay Network]
end
subgraph "Azure VNet"
VNET[VNet<br/>10.0.0.0/16]
SUBNET[Subnet<br/>10.0.1.0/24]
end
POD1 --> KUBENET
POD2 --> KUBENET
KUBENET --> SUBNET
SUBNET --> VNET
style KUBENET fill:#FFE5B4
style SUBNET fill:#90EE90
kubenet Characteristics:
| Aspect | kubenet | Description |
|---|---|---|
| Pod IP Addresses | Overlay network | Pods get IPs from overlay (10.244.0.0/16) |
| VNet Integration | Limited | Pod IPs not routable from VNet |
| IP Address Limit | Limited by nodes | ~250 pods per node |
| Network Policies | ✅ Supported | NetworkPolicy resources |
| Azure Integration | ⚠️ Limited | Requires routing tables |
| Complexity | ✅ Simple | Easier to set up |
Azure CNI (Advanced Networking)¶
Azure CNI Networking Overview:
graph TB
subgraph "AKS Cluster (Azure CNI)"
POD1[Pod 1<br/>10.0.1.10]
POD2[Pod 2<br/>10.0.1.11]
AZCNI[Azure CNI<br/>Direct VNet Integration]
end
subgraph "Azure VNet"
VNET[VNet<br/>10.0.0.0/16]
SUBNET[Subnet<br/>10.0.1.0/24]
ROUTE[Route Tables]
NSG[Network Security Groups]
end
POD1 --> AZCNI
POD2 --> AZCNI
AZCNI --> SUBNET
SUBNET --> VNET
SUBNET --> ROUTE
SUBNET --> NSG
style AZCNI fill:#90EE90
style SUBNET fill:#87CEEB
Azure CNI Characteristics:
| Aspect | Azure CNI | Description |
|---|---|---|
| Pod IP Addresses | VNet IPs | Pods get IPs directly from VNet subnet |
| VNet Integration | ✅ Full | Pod IPs routable from VNet |
| IP Address Limit | Limited by subnet size | Large subnet required |
| Network Policies | ✅ Supported | Azure Network Policy or Calico |
| Azure Integration | ✅ Full | Direct integration with Azure services |
| Complexity | ⚠️ Complex | More configuration required |
Comparison and Selection¶
kubenet vs Azure CNI Comparison:
| Feature | kubenet | Azure CNI | ATP Selection |
|---|---|---|---|
| Pod IP Management | Overlay network | VNet IP addresses | ✅ Azure CNI (VNet integration) |
| VNet Integration | Limited | Full | ✅ Azure CNI (required for ATP) |
| IP Address Limits | ~250 pods/node | Subnet size | ✅ Azure CNI (more IPs) |
| Network Policies | ✅ Supported | ✅ Supported | ✅ Azure CNI |
| Azure Services | ⚠️ Routing required | ✅ Direct access | ✅ Azure CNI |
| Setup Complexity | ✅ Simple | ⚠️ Complex | ✅ Azure CNI (accept complexity) |
| Multi-Tenancy | ⚠️ Limited | ✅ Better isolation | ✅ Azure CNI |
ATP Decision: Azure CNI - Required for multi-tenancy, VNet integration, direct Azure service access, and network isolation per tenant namespace.
Pulumi C# AKS Configuration with Azure CNI:
// infrastructure/AKS.cs
var aksCluster = new ManagedCluster("atp-production-aks", new ManagedClusterArgs
{
ResourceGroupName = resourceGroup.Name,
Location = location,
DnsPrefix = "atp-prod",
KubernetesVersion = "1.27.3",
// Azure CNI networking
NetworkProfile = new ManagedClusterNetworkProfileArgs
{
NetworkPlugin = NetworkPlugin.Azure,
NetworkPolicy = NetworkPolicy.Azure,
ServiceCidr = "10.2.0.0/16", // Service CIDR
DnsServiceIP = "10.2.0.10",
PodCidr = null, // Not used with Azure CNI
LoadBalancerSku = LoadBalancerSku.Standard,
OutboundType = OutboundType.LoadBalancer,
LoadBalancerProfile = new ManagedClusterLoadBalancerProfileArgs
{
ManagedOutboundIPs = new ManagedClusterLoadBalancerProfileManagedOutboundIPsArgs
{
Count = 2
}
}
},
AgentPoolProfiles = new[]
{
new ManagedClusterAgentPoolProfileArgs
{
Name = "systempool",
VmSize = "Standard_D4s_v3",
Count = 3,
OsType = "Linux",
VnetSubnetId = subnet.Id, // Subnet for pods (large enough)
MaxPods = 50,
Mode = AgentPoolMode.System,
EnableAutoScaling = true,
MinCount = 3,
MaxCount = 10
}
}
});
Subnet Sizing for Azure CNI:
| Node Count | Pods per Node | Required Subnet Size | Example CIDR |
|---|---|---|---|
| 5 nodes | 50 pods | /24 (256 addresses) | 10.0.1.0/24 |
| 20 nodes | 50 pods | /23 (512 addresses) | 10.0.1.0/23 |
| 100 nodes | 50 pods | /22 (1024 addresses) | 10.0.1.0/22 |
Subnet Calculation:
Required IPs = (Node count × Max pods per node) + Node count + 5 (reserved)
Example: (20 × 50) + 20 + 5 = 1025 IPs → /22 subnet (1024 addresses)
Ingress Controllers¶
NGINX Ingress Controller (ATP Choice)¶
NGINX Ingress Architecture:
graph TB
subgraph "Internet"
USER[Users]
end
subgraph "Azure Load Balancer"
LB[Load Balancer<br/>Public IP]
end
subgraph "AKS Cluster"
subgraph "ingress-nginx namespace"
NGINX1[NGINX Pod 1<br/>Replica 1]
NGINX2[NGINX Pod 2<br/>Replica 2]
NGINX_SVC[NGINX Service<br/>LoadBalancer]
end
subgraph "Application Namespaces"
APP1[ATP Ingestion<br/>Service]
APP2[ATP Query<br/>Service]
APP3[ATP Gateway<br/>Service]
end
end
USER --> LB
LB --> NGINX_SVC
NGINX_SVC --> NGINX1
NGINX_SVC --> NGINX2
NGINX1 --> APP1
NGINX1 --> APP2
NGINX2 --> APP3
NGINX2 --> APP1
style NGINX1 fill:#90EE90
style NGINX2 fill:#90EE90
style APP1 fill:#FFE5B4
NGINX Ingress Installation via Helm:
#!/bin/bash
# scripts/install-nginx-ingress.sh
# Add NGINX Ingress Helm repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
# Install NGINX Ingress Controller
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--set controller.replicaCount=2 \
--set controller.nodeSelector."kubernetes\.io/os"=linux \
--set controller.service.type=LoadBalancer \
--set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \
--set controller.service.externalTrafficPolicy=Local \
--set controller.resources.requests.cpu=100m \
--set controller.resources.requests.memory=128Mi \
--set controller.resources.limits.cpu=500m \
--set controller.resources.limits.memory=512Mi \
--set controller.metrics.enabled=true \
--set controller.podSecurityPolicy.enabled=false
echo "✅ NGINX Ingress Controller installed"
echo " Waiting for LoadBalancer IP..."
kubectl wait --namespace ingress-nginx \
--for=condition=ready pod \
--selector=app.kubernetes.io/component=controller \
--timeout=300s
# Get LoadBalancer IP
EXTERNAL_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo " External IP: ${EXTERNAL_IP}"
NGINX Ingress via FluxCD:
# platform/ingress-nginx/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: ingress-nginx
namespace: ingress-nginx
spec:
interval: 5m
chart:
spec:
chart: ingress-nginx
sourceRef:
kind: HelmRepository
name: ingress-nginx
interval: 1h
values:
controller:
replicaCount: 2
nodeSelector:
kubernetes.io/os: linux
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz
externalTrafficPolicy: Local
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
metrics:
enabled: true
serviceMonitor:
enabled: true
podSecurityPolicy:
enabled: false
Azure Application Gateway Ingress (Alternative)¶
Azure Application Gateway Ingress Controller (AGIC):
# platform/application-gateway/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: ingress-appgw
namespace: ingress-appgw
spec:
interval: 5m
chart:
spec:
chart: ingress-azure
sourceRef:
kind: HelmRepository
name: application-gateway-kubernetes-ingress
interval: 1h
values:
appgw:
subscriptionId: ${AZURE_SUBSCRIPTION_ID}
resourceGroup: atp-production-rg
name: atp-prod-appgw
usePrivateIP: false
armAuth:
type: aadPodIdentity
identityResourceID: /subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCE_GROUP}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/agic-identity
identityClientID: ${AGIC_IDENTITY_CLIENT_ID}
rbac:
enabled: true
AGIC vs NGINX Comparison:
| Feature | NGINX Ingress | Azure Application Gateway | ATP Selection |
|---|---|---|---|
| Cost | ✅ Lower | ⚠️ Higher (dedicated gateway) | ✅ NGINX |
| WAF | ⚠️ External (Cloudflare) | ✅ Built-in | ⚠️ NGINX (accept trade-off) |
| SSL Termination | ✅ Supported | ✅ Supported | ✅ Both |
| Path-based Routing | ✅ Supported | ✅ Supported | ✅ Both |
| Azure Integration | ⚠️ Basic | ✅ Full | ⚠️ NGINX (sufficient) |
| ATP Decision | ✅ Selected | ❌ Not selected | ✅ NGINX |
ATP Decision: NGINX Ingress Controller - Lower cost, sufficient features, simpler management, standard Kubernetes ingress.
Installation and Configuration¶
NGINX Ingress Configuration:
# platform/ingress-nginx/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
data:
# Connection settings
worker-processes: "auto"
worker-connections: "16384"
max-worker-open-files: "65535"
# Timeouts
proxy-connect-timeout: "60"
proxy-send-timeout: "60"
proxy-read-timeout: "60"
# SSL
ssl-protocols: "TLSv1.2 TLSv1.3"
ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256,ECDHE-RSA-AES128-GCM-SHA256"
ssl-prefer-server-ciphers: "true"
# Logging
log-format-upstream: '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_length $request_time [$proxy_upstream_name] [$proxy_alternative_upstream_name] $upstream_addr $upstream_response_length $upstream_response_time $upstream_status $req_id'
# Rate limiting
enable-brotli: "true"
use-forwarded-headers: "true"
compute-full-forwarded-for: "true"
TLS Termination¶
TLS Termination in NGINX:
# apps/atp-gateway/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: atp-gateway-ingress
namespace: atp-production
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.atp.connectsoft.example
- gateway.atp.connectsoft.example
secretName: atp-gateway-tls
rules:
- host: api.atp.connectsoft.example
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: atp-gateway
port:
number: 80
Certificate Management¶
cert-manager Overview¶
cert-manager Architecture:
graph TB
subgraph "Kubernetes Cluster"
INGRESS[Ingress<br/>TLS Secret]
CERT_MGR[cert-manager<br/>Controller]
CERT[cert-manager<br/>Certificate CRD]
CLUSTER_ISSUER[ClusterIssuer<br/>Let's Encrypt]
end
subgraph "Let's Encrypt"
LE[Let's Encrypt<br/>API]
CHALLENGE[HTTP-01 Challenge]
end
subgraph "DNS"
TXT[TXT Record<br/>DNS-01 Challenge]
end
INGRESS --> CERT
CERT --> CERT_MGR
CERT_MGR --> CLUSTER_ISSUER
CLUSTER_ISSUER --> LE
LE --> CHALLENGE
LE --> TXT
CERT_MGR --> CERT
CERT --> INGRESS
style CERT_MGR fill:#90EE90
style CLUSTER_ISSUER fill:#FFE5B4
cert-manager Installation:
#!/bin/bash
# scripts/install-cert-manager.sh
# Install cert-manager CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.crds.yaml
# Add cert-manager Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update
# Install cert-manager
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.13.0 \
--set installCRDs=true \
--set global.leaderElection.namespace=cert-manager \
--set resources.requests.cpu=100m \
--set resources.requests.memory=128Mi
echo "✅ cert-manager installed"
kubectl wait --for=condition=ready pod \
--all -n cert-manager \
--timeout=300s
cert-manager via FluxCD:
# platform/cert-manager/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: cert-manager
namespace: cert-manager
spec:
interval: 5m
chart:
spec:
chart: cert-manager
sourceRef:
kind: HelmRepository
name: jetstack
version: v1.13.0
values:
installCRDs: true
global:
leaderElection:
namespace: cert-manager
resources:
requests:
cpu: 100m
memory: 128Mi
webhook:
resources:
requests:
cpu: 50m
memory: 64Mi
Let's Encrypt Integration¶
Let's Encrypt ClusterIssuer (HTTP-01 Challenge):
# platform/cert-manager/clusterissuer-letsencrypt-prod.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: devops@connectsoft.example
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
podTemplate:
spec:
nodeSelector:
kubernetes.io/os: linux
Let's Encrypt ClusterIssuer (DNS-01 Challenge for Wildcard):
# platform/cert-manager/clusterissuer-letsencrypt-dns.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-dns
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: devops@connectsoft.example
privateKeySecretRef:
name: letsencrypt-dns
solvers:
- dns01:
azureDNS:
clientID: ${AZURE_CLIENT_ID}
clientSecretSecretRef:
name: azure-dns-secret
key: client-secret
subscriptionID: ${AZURE_SUBSCRIPTION_ID}
tenantID: ${AZURE_TENANT_ID}
resourceGroupName: atp-production-rg
hostedZoneName: connectsoft.example
environment: AzurePublicCloud
Let's Encrypt Staging ClusterIssuer:
# platform/cert-manager/clusterissuer-letsencrypt-staging.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
server: https://acme-staging-v02.api.letsencrypt.org/directory
email: devops@connectsoft.example
privateKeySecretRef:
name: letsencrypt-staging
solvers:
- http01:
ingress:
class: nginx
ClusterIssuer Configuration¶
ClusterIssuer Configuration Matrix:
| ClusterIssuer | Challenge Type | Use Case | Rate Limits |
|---|---|---|---|
| letsencrypt-prod | HTTP-01 | Production domains | 50 certs/week/domain |
| letsencrypt-staging | HTTP-01 | Testing | 300 certs/week/domain |
| letsencrypt-dns | DNS-01 | Wildcard certificates | 50 certs/week/domain |
Certificate Resource:
# apps/atp-gateway/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: atp-gateway-tls
namespace: atp-production
spec:
secretName: atp-gateway-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
commonName: api.atp.connectsoft.example
dnsNames:
- api.atp.connectsoft.example
- gateway.atp.connectsoft.example
- *.atp.connectsoft.example
duration: 2160h # 90 days
renewBefore: 720h # Renew 30 days before expiration
Automatic Certificate Renewal¶
Certificate Renewal Flow:
sequenceDiagram
participant Cert as Certificate
participant CM as cert-manager
participant LE as Let's Encrypt
participant NGINX as NGINX Ingress
Cert->>CM: Certificate expires in 30 days
CM->>LE: Request renewal
LE->>CM: Challenge request
CM->>NGINX: Create challenge ingress
NGINX->>LE: Serve challenge
LE->>CM: Validate challenge
CM->>LE: Get new certificate
LE->>CM: Issue certificate
CM->>Cert: Update TLS secret
Cert->>NGINX: Reload with new cert
Certificate Status Check:
#!/bin/bash
# scripts/check-certificate-status.sh
NAMESPACE="${1:-all}"
echo "🔍 Certificate Status Check"
echo "============================"
if [ "${NAMESPACE}" = "all" ]; then
CERTIFICATES=$(kubectl get certificates --all-namespaces -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')
else
CERTIFICATES=$(kubectl get certificates -n "${NAMESPACE}" -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')
fi
for CERT in ${CERTIFICATES}; do
NS=$(echo "${CERT}" | cut -d'/' -f1)
NAME=$(echo "${CERT}" | cut -d'/' -f2)
STATUS=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
AGE=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.metadata.creationTimestamp}')
NOT_AFTER=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.status.notAfter}')
if [ "${STATUS}" = "True" ]; then
echo "✅ ${NS}/${NAME}: Ready"
if [ -n "${NOT_AFTER}" ]; then
DAYS_UNTIL_EXPIRY=$(( ($(date -d "${NOT_AFTER}" +%s) - $(date +%s)) / 86400 ))
echo " Expires in: ${DAYS_UNTIL_EXPIRY} days"
fi
else
echo "❌ ${NS}/${NAME}: Not Ready"
kubectl describe certificate "${NAME}" -n "${NS}" | grep -A 5 "Status:"
fi
done
Certificate Monitoring¶
Certificate Expiration Alert:
# monitoring/alerts/certificate-expiration.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: certificate-expiration
namespace: monitoring
spec:
groups:
- name: certificate
interval: 1h
rules:
- alert: CertificateExpiringSoon
expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate expiring soon"
description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires in {{ $value }} days"
- alert: CertificateExpiringVerySoon
expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
for: 1h
labels:
severity: critical
annotations:
summary: "Certificate expiring very soon"
description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires in {{ $value }} days"
Network Policies¶
Default Deny All Policy¶
Default Deny All Network Policy:
# platform/network-policies/default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: atp-production
spec:
podSelector: {} # Match all pods
policyTypes:
- Ingress
- Egress
# No rules = deny all traffic
Apply Default Deny to All Namespaces:
#!/bin/bash
# scripts/apply-default-deny-policy.sh
NAMESPACES=("atp-production" "atp-staging" "atp-test")
for NS in "${NAMESPACES[@]}"; do
echo "Applying default deny policy to namespace: ${NS}"
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: ${NS}
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
EOF
done
echo "✅ Default deny policies applied"
Service-to-Service Allow Rules¶
Service-to-Service Communication:
# apps/atp-gateway/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: atp-gateway-network-policy
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-gateway
policyTypes:
- Ingress
- Egress
ingress:
# Allow from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
ports:
- protocol: TCP
port: 8080
# Allow from other ATP services
- from:
- podSelector:
matchLabels:
app: atp-ingestion
- podSelector:
matchLabels:
app: atp-query
ports:
- protocol: TCP
port: 8080
egress:
# Allow to ATP services
- to:
- podSelector:
matchLabels:
app: atp-ingestion
- podSelector:
matchLabels:
app: atp-query
ports:
- protocol: TCP
port: 8080
# Allow to external services (database, Redis, etc.)
- to:
- ipBlock:
cidr: 10.0.0.0/16 # Azure VNet
ports:
- protocol: TCP
port: 5432 # PostgreSQL
- protocol: TCP
port: 6380 # Redis
Ingress and Egress Rules¶
Ingress Allow Rules:
# apps/atp-ingestion/network-policy-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: atp-ingestion-allow-ingress
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-ingestion
policyTypes:
- Ingress
ingress:
# Allow from gateway
- from:
- podSelector:
matchLabels:
app: atp-gateway
ports:
- protocol: TCP
port: 8080
# Allow from query service
- from:
- podSelector:
matchLabels:
app: atp-query
ports:
- protocol: TCP
port: 8080
Egress Allow Rules:
# apps/atp-ingestion/network-policy-egress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: atp-ingestion-allow-egress
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-ingestion
policyTypes:
- Egress
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow to database
- to:
- ipBlock:
cidr: 10.0.2.0/24 # Database subnet
ports:
- protocol: TCP
port: 5432
# Allow to Redis
- to:
- ipBlock:
cidr: 10.0.3.0/24 # Redis subnet
ports:
- protocol: TCP
port: 6380
# Allow to Service Bus
- to:
- ipBlock:
cidr: 0.0.0.0/0 # Azure Service Bus (public IP)
ports:
- protocol: TCP
port: 5671
port: 443
DNS Exceptions¶
DNS Exception in Network Policy:
# platform/network-policies/dns-exception.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: atp-production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
# Allow DNS queries
- to:
- namespaceSelector:
matchLabels:
name: kube-system
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Monitoring and Logging Exceptions¶
Monitoring Exception:
# platform/network-policies/monitoring-exception.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-monitoring
namespace: atp-production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
# Allow to Prometheus
- to:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090
# Allow to Grafana
- to:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app: grafana
ports:
- protocol: TCP
port: 3000
# Allow to Azure Monitor (Log Analytics)
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: TCP
port: 443
Service Mesh Options¶
Linkerd (Lightweight, ATP Preference)¶
Linkerd Architecture:
graph TB
subgraph "Service A Pod"
APP_A[Application]
PROXY_A[Linkerd Proxy<br/>sidecar]
end
subgraph "Service B Pod"
APP_B[Application]
PROXY_B[Linkerd Proxy<br/>sidecar]
end
subgraph "Linkerd Control Plane"
DEST[destination]
IDENTITY[identity]
PROXY_INJECTOR[proxy-injector]
end
APP_A <--> PROXY_A
APP_B <--> PROXY_B
PROXY_A <--mTLS--> PROXY_B
PROXY_A --> DEST
PROXY_B --> DEST
PROXY_A --> IDENTITY
PROXY_B --> IDENTITY
style PROXY_A fill:#90EE90
style PROXY_B fill:#90EE90
style DEST fill:#FFE5B4
Linkerd Installation:
#!/bin/bash
# scripts/install-linkerd.sh
# Install Linkerd CLI
curl -sL https://run.linkerd.io/install-edge | sh
export PATH=$PATH:$HOME/.linkerd2/bin
# Verify installation
linkerd version --client
# Check cluster prerequisites
linkerd check --pre
# Install Linkerd control plane
linkerd install | kubectl apply -f -
# Wait for control plane to be ready
linkerd check
# Install Linkerd Viz (observability)
linkerd viz install | kubectl apply -f -
# Install Linkerd Multicluster (if needed)
# linkerd multicluster install | kubectl apply -f -
echo "✅ Linkerd installed"
Linkerd via FluxCD:
# platform/linkerd/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: linkerd-control-plane
namespace: linkerd
spec:
interval: 5m
chart:
spec:
chart: linkerd-control-plane
sourceRef:
kind: HelmRepository
name: linkerd
version: 1.14.0
values:
identity:
issuer:
tls:
certPEM: |
# Certificate from linkerd identity
keyPEM: |
# Key from linkerd identity
proxyInjector:
enabled: true
destination:
enabled: true
identity:
enabled: true
Istio (Feature-Rich, Complex)¶
Istio vs Linkerd Comparison:
| Feature | Linkerd | Istio | ATP Selection |
|---|---|---|---|
| Size | ✅ Lightweight (~50MB) | ⚠️ Heavy (~500MB) | ✅ Linkerd |
| Learning Curve | ✅ Simple | ⚠️ Complex | ✅ Linkerd |
| mTLS | ✅ Automatic | ✅ Automatic | ✅ Linkerd |
| Traffic Management | ✅ Supported | ✅ Rich features | ⚠️ Linkerd (sufficient) |
| Observability | ✅ Built-in | ✅ Built-in | ✅ Linkerd |
| Resource Usage | ✅ Low | ⚠️ High | ✅ Linkerd |
| ATP Decision | ✅ Selected | ❌ Not selected | ✅ Linkerd |
ATP Decision: Linkerd - Lightweight, simple, sufficient features, low resource usage, better fit for ATP's requirements.
Open Service Mesh (Azure-Native)¶
Open Service Mesh (OSM) Overview:
| Feature | OSM | Linkerd | ATP Selection |
|---|---|---|---|
| Azure Integration | ✅ Native | ⚠️ Generic | ⚠️ Linkerd (sufficient) |
| Maturity | ⚠️ Newer | ✅ Mature | ✅ Linkerd |
| Community | ⚠️ Smaller | ✅ Large | ✅ Linkerd |
| Features | ✅ Good | ✅ Good | ✅ Linkerd |
ATP Decision: Linkerd - More mature, larger community, proven in production, sufficient Azure integration.
Comparison and Selection¶
Service Mesh Selection Matrix:
| Criteria | Weight | Linkerd | Istio | OSM | Winner |
|---|---|---|---|---|---|
| Simplicity | High | 9 | 4 | 7 | ✅ Linkerd |
| Resource Usage | High | 9 | 5 | 7 | ✅ Linkerd |
| Features | Medium | 7 | 9 | 7 | ⚠️ Istio |
| Maturity | High | 9 | 9 | 6 | ✅ Linkerd/Istio |
| ATP Decision | - | Selected | - | - | ✅ Linkerd |
mTLS Between Services¶
Automatic mTLS with Service Mesh¶
Linkerd Automatic mTLS:
# Linkerd automatically enables mTLS for all injected services
# No configuration required - works out of the box
# Example: Service with Linkerd proxy injection
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-production
annotations:
linkerd.io/inject: enabled # Enable automatic injection
spec:
template:
metadata:
annotations:
linkerd.io/inject: enabled
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:v1.2.3
# Linkerd proxy automatically injected as sidecar
Verify mTLS Status:
# Check mTLS status for all services
linkerd viz stat deploy -n atp-production
# Check mTLS percentage
linkerd viz edges deploy -n atp-production
# View service topology with mTLS
linkerd viz tap deploy/atp-gateway -n atp-production
Certificate Rotation¶
Linkerd Certificate Rotation:
#!/bin/bash
# scripts/rotate-linkerd-certificates.sh
echo "🔄 Rotating Linkerd certificates..."
# Rotate identity certificates
linkerd identity rotate --trust-anchors-file=ca.crt --trust-anchors-validity=87600h | kubectl apply -f -
# Verify rotation
linkerd check --proxy
echo "✅ Linkerd certificates rotated"
Automatic Certificate Rotation:
Linkerd automatically rotates certificates before expiration. Default validity: 24 hours.
# Linkerd identity configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: linkerd-identity
namespace: linkerd
data:
identity.issuer.tls.crtPEM: |
# Certificate (auto-rotated)
identity.issuer.tls.keyPEM: |
# Key (auto-rotated)
Identity and Authorization¶
Linkerd Authorization Policy:
# apps/atp-gateway/authorization-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
name: atp-gateway-server
namespace: atp-production
spec:
podSelector:
matchLabels:
app: atp-gateway
port: 8080
proxyProtocol: HTTP/1
---
apiVersion: policy.linkerd.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: atp-gateway-authz
namespace: atp-production
spec:
targetRef:
group: policy.linkerd.io
kind: Server
name: atp-gateway-server
requiredAuthenticationModes:
- mtls
networks:
- cidr: 10.0.0.0/16 # Allow from VNet only
Traffic Management¶
Canary Routing with Service Mesh¶
Linkerd TrafficSplit for Canary:
# apps/atp-ingestion/canary-trafficsplit.yaml
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: atp-ingestion-canary
namespace: atp-production
spec:
service: atp-ingestion
backends:
- service: atp-ingestion-stable
weight: 90 # 90% traffic to stable
- service: atp-ingestion-canary
weight: 10 # 10% traffic to canary
Canary Deployment Strategy:
graph TB
INGRESS[Ingress<br/>100% Traffic]
TRAFFIC_SPLIT[TrafficSplit<br/>90/10 Split]
STABLE[Stable Service<br/>90% Traffic]
CANARY[Canary Service<br/>10% Traffic]
INGRESS --> TRAFFIC_SPLIT
TRAFFIC_SPLIT --> STABLE
TRAFFIC_SPLIT --> CANARY
style TRAFFIC_SPLIT fill:#FFE5B4
style STABLE fill:#90EE90
style CANARY fill:#FFB6C1
Circuit Breakers¶
Linkerd Circuit Breaker:
# apps/atp-gateway/circuit-breaker.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
name: atp-ingestion-service-profile
namespace: atp-production
spec:
routes:
- name: default
condition:
method: GET
pathRegex: "/api/.*"
isRetryable: true
timeout: 10s
circuitBreakers:
- maxConnections: 100
maxPendingRequests: 50
maxRequests: 200
maxRetries: 3
minRequests: 10
maxEjectionPercent: 50
sleepWindow: 30s
consecutiveFailures: 5
Retry Policies¶
Linkerd Retry Policy:
# apps/atp-gateway/retry-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
name: atp-ingestion-retry
namespace: atp-production
spec:
routes:
- name: default
condition:
method: POST
pathRegex: "/api/ingestion"
isRetryable: true
timeout: 30s
retries:
budget:
retryRatio: 0.2 # Max 20% retries
minRetriesPerSecond: 10
ttl: 10s
Timeout Configuration¶
Linkerd Timeout Policy:
# apps/atp-gateway/timeout-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
name: atp-query-timeout
namespace: atp-production
spec:
routes:
- name: query-route
condition:
method: GET
pathRegex: "/api/query/.*"
timeout: 5s # 5 second timeout
- name: export-route
condition:
method: GET
pathRegex: "/api/export/.*"
timeout: 60s # 60 second timeout for exports
Observability with Service Mesh¶
Distributed Tracing¶
Linkerd Distributed Tracing:
# platform/linkerd/tracing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: linkerd-config
namespace: linkerd
data:
config.yaml: |
tracing:
enabled: true
collectorSvcAddr: "jaeger-collector.monitoring:14268"
collectorSvcAccount: "linkerd-collector"
Linkerd + Jaeger Integration:
# platform/linkerd/jaeger-integration.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-collector
namespace: monitoring
spec:
template:
spec:
containers:
- name: jaeger-collector
image: jaegertracing/jaeger-collector:latest
env:
- name: SPAN_STORAGE_TYPE
value: "elasticsearch"
- name: ES_SERVER_URLS
value: "http://elasticsearch.monitoring:9200"
Metrics and Dashboards¶
Linkerd Metrics:
# View service metrics
linkerd viz stat deploy -n atp-production
# View top services
linkerd viz top deploy -n atp-production
# View service profile metrics
linkerd viz profile -n atp-production atp-ingestion --tap
Linkerd Grafana Dashboard:
# platform/linkerd/grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: linkerd-dashboard
namespace: monitoring
data:
linkerd-dashboard.json: |
{
"dashboard": {
"title": "Linkerd Service Mesh",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(linkerd_proxy_http_requests_total{deployment=\"$deployment\"}[1m]))",
"legendFormat": "{{deployment}}"
}
]
},
{
"title": "P50 Latency",
"targets": [
{
"expr": "histogram_quantile(0.5, sum(rate(linkerd_proxy_http_request_duration_seconds_bucket{deployment=\"$deployment\"}[1m])) by (le, deployment))",
"legendFormat": "{{deployment}}"
}
]
}
]
}
}
Service Topology Visualization¶
Linkerd Viz (Topology View):
# Open Linkerd Viz dashboard
linkerd viz dashboard
# View service topology
linkerd viz edges deploy -n atp-production
# Tap live traffic
linkerd viz tap deploy/atp-gateway -n atp-production
Service Mesh GitOps Integration¶
Mesh Configuration in Git¶
Linkerd Configuration in GitOps:
atp-gitops/
├── platform/
│ ├── linkerd/
│ │ ├── kustomization.yaml
│ │ ├── control-plane.yaml
│ │ ├── service-profiles/
│ │ │ ├── atp-gateway.yaml
│ │ │ ├── atp-ingestion.yaml
│ │ │ └── atp-query.yaml
│ │ ├── authorization-policies/
│ │ │ └── default-policy.yaml
│ │ └── trafficsplits/
│ │ └── canary-split.yaml
Linkerd Kustomization:
# platform/linkerd/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- control-plane.yaml
- service-profiles/
- authorization-policies/
- trafficsplits/
commonLabels:
managed-by: kustomize
TrafficSplit Resources¶
TrafficSplit in GitOps:
# apps/atp-ingestion/overlays/production/trafficsplit.yaml
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: atp-ingestion-split
namespace: atp-production
spec:
service: atp-ingestion
backends:
- service: atp-ingestion-v1
weight: 90
- service: atp-ingestion-v2
weight: 10
FluxCD Kustomization for TrafficSplit:
# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 5m
path: ./apps/atp-ingestion/overlays/production
sourceRef:
kind: GitRepository
name: atp-gitops-production
# TrafficSplit resources included in path
SMI (Service Mesh Interface)¶
SMI Resources Supported by Linkerd:
| SMI Resource | Linkerd Support | Use Case |
|---|---|---|
| TrafficSplit | ✅ Supported | Canary deployments |
| TrafficTarget | ✅ Supported | Access control |
| HTTPRouteGroup | ✅ Supported | HTTP routing rules |
| TCPRoute | ⚠️ Limited | TCP routing |
SMI TrafficTarget Example:
# apps/atp-gateway/smi-traffic-target.yaml
apiVersion: access.smi-spec.io/v1alpha3
kind: TrafficTarget
metadata:
name: atp-gateway-to-ingestion
namespace: atp-production
spec:
destination:
kind: ServiceAccount
name: atp-ingestion
namespace: atp-production
sources:
- kind: ServiceAccount
name: atp-gateway
namespace: atp-production
rules:
- kind: HTTPRouteGroup
name: atp-ingestion-routes
matches:
- ingestion-api
---
apiVersion: specs.smi-spec.io/v1alpha4
kind: HTTPRouteGroup
metadata:
name: atp-ingestion-routes
namespace: atp-production
spec:
matches:
- name: ingestion-api
methods:
- GET
- POST
pathRegex: "/api/ingestion/.*"
Multi-Cluster Networking¶
VNet Peering Between Environments¶
VNet Peering Configuration:
// infrastructure/VNetPeering.cs
using Pulumi;
using Pulumi.AzureNative.Network;
public class VNetPeering
{
public static VirtualNetworkPeering CreatePeering(
VirtualNetwork sourceVNet,
VirtualNetwork targetVNet,
ResourceGroup resourceGroup,
string peeringName)
{
return new VirtualNetworkPeering($"peering-{peeringName}", new VirtualNetworkPeeringArgs
{
ResourceGroupName = resourceGroup.Name,
VirtualNetworkName = sourceVNet.Name,
RemoteVirtualNetworkId = targetVNet.Id,
AllowVirtualNetworkAccess = true,
AllowForwardedTraffic = true,
AllowGatewayTransit = false,
UseRemoteGateways = false
});
}
}
VNet Peering Between Production and Staging:
#!/bin/bash
# scripts/create-vnet-peering.sh
SOURCE_RG="${1}"
SOURCE_VNET="${2}"
TARGET_RG="${3}"
TARGET_VNET="${4}"
echo "🔗 Creating VNet peering: ${SOURCE_VNET} <-> ${TARGET_VNET}"
# Get VNet IDs
SOURCE_VNET_ID=$(az network vnet show \
--resource-group "${SOURCE_RG}" \
--name "${SOURCE_VNET}" \
--query id -o tsv)
TARGET_VNET_ID=$(az network vnet show \
--resource-group "${TARGET_RG}" \
--name "${TARGET_VNET}" \
--query id -o tsv)
# Create peering from source to target
az network vnet peering create \
--resource-group "${SOURCE_RG}" \
--name "${SOURCE_VNET}-to-${TARGET_VNET}" \
--vnet-name "${SOURCE_VNET}" \
--remote-vnet "${TARGET_VNET_ID}" \
--allow-vnet-access \
--allow-forwarded-traffic
# Create peering from target to source
az network vnet peering create \
--resource-group "${TARGET_RG}" \
--name "${TARGET_VNET}-to-${SOURCE_VNET}" \
--vnet-name "${TARGET_VNET}" \
--remote-vnet "${SOURCE_VNET_ID}" \
--allow-vnet-access \
--allow-forwarded-traffic
echo "✅ VNet peering created"
Azure Virtual WAN¶
Virtual WAN Architecture:
graph TB
subgraph "Virtual WAN Hub"
VWAN[Azure Virtual WAN<br/>Hub]
end
subgraph "Production VNet"
PROD_VNET[Production VNet<br/>10.0.0.0/16]
PROD_AKS[Production AKS]
end
subgraph "Staging VNet"
STAGE_VNET[Staging VNet<br/>10.1.0.0/16]
STAGE_AKS[Staging AKS]
end
subgraph "On-Premises"
ONPREM[On-Premises<br/>Network]
end
PROD_VNET --> VWAN
STAGE_VNET --> VWAN
ONPREM --> VWAN
VWAN --> PROD_VNET
VWAN --> STAGE_VNET
VWAN --> ONPREM
style VWAN fill:#90EE90
Virtual WAN Configuration (Pulumi C# concept):
// infrastructure/VirtualWAN.cs
var virtualWan = new VirtualWan("atp-vwan", new VirtualWanArgs
{
ResourceGroupName = resourceGroup.Name,
Location = location,
Type = "Standard",
AllowBranchToBranchTraffic = true,
AllowVnetToVnetTraffic = true
});
var virtualHub = new VirtualHub("atp-vhub", new VirtualHubArgs
{
ResourceGroupName = resourceGroup.Name,
Location = location,
VirtualWanId = virtualWan.Id,
AddressPrefix = "10.100.0.0/24"
});
Cross-Cluster Service Discovery¶
Linkerd Multi-Cluster Service Discovery:
#!/bin/bash
# scripts/setup-linkerd-multicluster.sh
# Install Linkerd Multicluster on production cluster
linkerd multicluster install | kubectl apply -f -
# Link staging cluster to production
linkerd multicluster link --cluster-name staging --api-server-address https://staging-api-server:6443 | kubectl apply -f -
# Verify multicluster status
linkerd multicluster check
echo "✅ Multi-cluster service discovery configured"
Service Export/Import (Kubernetes Multi-Cluster Services):
# apps/atp-gateway/service-export.yaml
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
name: atp-gateway
namespace: atp-production
spec: {}
---
# In staging cluster: Service Import
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
name: atp-gateway-production
namespace: atp-staging
spec:
type: ClusterSetIP
ports:
- port: 8080
protocol: TCP
Multi-Cluster Mesh¶
Linkerd Multi-Cluster Mesh:
graph TB
subgraph "Production Cluster"
PROD_CTRL[Linkerd Control Plane]
PROD_SVC[ATP Services]
end
subgraph "Staging Cluster"
STAGE_CTRL[Linkerd Control Plane]
STAGE_SVC[ATP Services]
end
subgraph "Linkerd Multicluster"
GATEWAY[Service Mirror<br/>Gateway]
end
PROD_CTRL <--> GATEWAY
STAGE_CTRL <--> GATEWAY
PROD_SVC <--mTLS--> STAGE_SVC
style GATEWAY fill:#FFE5B4
Multi-Cluster Mesh Configuration:
# platform/linkerd/multicluster/gateway.yaml
apiVersion: multicluster.linkerd.io/v1alpha1
kind: ServiceMirror
metadata:
name: staging-cluster
namespace: linkerd-multicluster
spec:
cluster:
name: staging
namespace: linkerd-multicluster
apiKey: ${STAGING_CLUSTER_API_KEY}
gateway:
name: gateway
namespace: linkerd-multicluster
service:
name: linkerd-gateway
namespace: linkerd-multicluster
port: 4143
Summary: Networking & Service Mesh¶
- AKS Networking Models: Azure CNI selected (VNet integration, multi-tenancy), kubenet comparison, subnet sizing for Azure CNI
- Ingress Controllers: NGINX Ingress Controller (ATP choice), Azure Application Gateway comparison, installation and configuration, TLS termination
- Certificate Management: cert-manager overview, Let's Encrypt integration (HTTP-01, DNS-01), ClusterIssuer configuration, automatic certificate renewal, certificate monitoring
- Network Policies: Default deny all policy, service-to-service allow rules, ingress and egress rules, DNS exceptions, monitoring and logging exceptions
- Service Mesh Options: Linkerd selected (lightweight, ATP preference), Istio comparison, Open Service Mesh comparison, selection matrix
- mTLS Between Services: Automatic mTLS with Linkerd, certificate rotation, identity and authorization policies
- Traffic Management: Canary routing with TrafficSplit, circuit breakers, retry policies, timeout configuration
- Observability with Service Mesh: Distributed tracing (Jaeger), metrics and dashboards (Linkerd Viz), service topology visualization
- Service Mesh GitOps Integration: Mesh configuration in Git, TrafficSplit resources, SMI (Service Mesh Interface) support
- Multi-Cluster Networking: VNet peering between environments, Azure Virtual WAN, cross-cluster service discovery, multi-cluster mesh with Linkerd
Storage & StatefulSets in GitOps¶
Purpose: Define storage architecture, PersistentVolumes and PersistentVolumeClaims, StatefulSet deployment patterns, database deployments, backup and restore procedures, volume snapshots, data migration strategies, and disaster recovery for persistent data in ATP's GitOps deployments, ensuring reliable, scalable, and recoverable stateful workloads.
Persistent Volumes (PV) and Claims (PVC)¶
PV and PVC Concepts¶
Persistent Volume (PV) vs Persistent Volume Claim (PVC):
graph TB
subgraph "Storage Provider"
AZDISK[Azure Disk<br/>or Azure Files]
end
subgraph "Kubernetes Cluster"
PV[PersistentVolume<br/>Cluster Resource]
PVC[PersistentVolumeClaim<br/>Namespace Resource]
POD[Pod<br/>Application]
end
AZDISK --> PV
PVC --> PV
POD --> PVC
style PV fill:#FFE5B4
style PVC fill:#90EE90
style POD fill:#87CEEB
PV and PVC Relationship:
| Resource | Scope | Purpose | Managed By |
|---|---|---|---|
| PersistentVolume (PV) | Cluster-wide | Represents actual storage | Admin/Storage Provisioner |
| PersistentVolumeClaim (PVC) | Namespace | Request for storage | Developer/Application |
| StorageClass | Cluster-wide | Defines storage provisioner | Admin |
PVC Lifecycle:
- Create PVC → Kubernetes matches with available PV or creates new PV
- Bind → PVC bound to PV
- Use → Pod mounts PVC
- Release → Pod terminates, PVC remains (Retain policy)
- Reclaim → PV reclaimed based on reclaim policy
Dynamic Provisioning¶
Dynamic Provisioning Flow:
sequenceDiagram
participant Dev as Developer
participant K8s as Kubernetes API
participant SC as StorageClass
participant Prov as Provisioner
participant Azure as Azure Disk/Files
participant Pod as Pod
Dev->>K8s: Create PVC
K8s->>SC: Match StorageClass
SC->>Prov: Provision volume
Prov->>Azure: Create Azure Disk/File
Azure-->>Prov: Volume created
Prov->>K8s: Create PV
K8s->>PVC: Bind PVC to PV
Dev->>K8s: Create Pod with PVC
K8s->>Pod: Mount volume
Dynamic Provisioning Example:
# apps/atp-ingestion/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: atp-ingestion-data
namespace: atp-production
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium # Triggers dynamic provisioning
resources:
requests:
storage: 100Gi
Static vs Dynamic Provisioning:
| Provisioning Type | Use Case | ATP Preference |
|---|---|---|
| Static | Pre-created PVs, manual management | ❌ Not used |
| Dynamic | On-demand PV creation via StorageClass | ✅ Preferred |
ATP Decision: Use dynamic provisioning for all workloads - simpler, scalable, automated.
Storage Classes¶
StorageClass Definition:
# platform/storage/storageclass-premium.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS # Premium SSD
kind: managed
cachingMode: ReadOnly
diskEncryptionSetID: /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Compute/diskEncryptionSets/atp-disk-encryption
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer # Wait until pod is scheduled
StorageClass Options:
# Standard HDD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-standard
provisioner: disk.csi.azure.com
parameters:
skuname: Standard_LRS # Standard HDD
kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
# Premium SSD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS # Premium SSD
kind: managed
reclaimPolicy: Retain
---
# Azure Files (SMB)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurefile-csi
provisioner: file.csi.azure.com
parameters:
skuname: Premium_LRS # Premium Files
storageAccount: atpstorageaccount # Optional: specific storage account
reclaimPolicy: Delete
allowVolumeExpansion: true
---
# Azure Files (NFS)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurefile-csi-nfs
provisioner: file.csi.azure.com
parameters:
protocol: nfs
skuname: Premium_LRS
reclaimPolicy: Delete
Access Modes¶
PVC Access Modes:
| Access Mode | Description | Use Case | Supported by |
|---|---|---|---|
| ReadWriteOnce (RWO) | Single node read-write | Single pod, databases | Azure Disk |
| ReadOnlyMany (ROX) | Multiple nodes read-only | Shared config, readonly data | Azure Files |
| ReadWriteMany (RWX) | Multiple nodes read-write | Shared storage, file shares | Azure Files |
| ReadWriteOncePod (RWOP) | Single pod read-write | Kubernetes 1.22+ | Azure Disk |
Access Mode Selection:
# Single pod (database)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
spec:
accessModes:
- ReadWriteOnce # Single pod mount
storageClassName: managed-premium
resources:
requests:
storage: 500Gi
---
# Multiple pods (shared files)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-storage
spec:
accessModes:
- ReadWriteMany # Multiple pods can mount
storageClassName: azurefile-csi
resources:
requests:
storage: 100Gi
Azure Disk vs Azure Files¶
Azure Disk (Block Storage, Single Mount)¶
Azure Disk Characteristics:
| Aspect | Azure Disk | Description |
|---|---|---|
| Type | Block storage | Direct-attached disk |
| Mount | Single pod | RWO (ReadWriteOnce) |
| Performance | ✅ High IOPS | Up to 20,000 IOPS (Premium SSD) |
| Latency | ✅ Low latency | < 1ms |
| Use Case | Databases, single-pod apps | PostgreSQL, MongoDB, Redis |
Azure Disk StorageClass:
# platform/storage/storageclass-premium-disk.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS
kind: managed
cachingMode: ReadOnly # Optimize for database workloads
diskEncryptionSetID: ${DISK_ENCRYPTION_SET_ID}
allowVolumeExpansion: true
reclaimPolicy: Retain # Keep data on PVC deletion
volumeBindingMode: WaitForFirstConsumer # Zone-aware scheduling
Azure Files (Shared Storage, Multi-Mount)¶
Azure Files Characteristics:
| Aspect | Azure Files | Description |
|---|---|---|
| Type | File storage | Network file share |
| Mount | Multiple pods | RWX (ReadWriteMany) |
| Protocol | SMB or NFS | Protocol selection |
| Performance | ⚠️ Lower IOPS | Up to 100,000 IOPS (Premium) |
| Latency | ⚠️ Higher latency | Network latency |
| Use Case | Shared files, config | Content storage, logs |
Azure Files StorageClass:
# platform/storage/storageclass-premium-files.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurefile-premium
provisioner: file.csi.azure.com
parameters:
skuname: Premium_LRS
protocol: smb # or "nfs"
# Optional: specific storage account
# storageAccount: atpstorageaccount
reclaimPolicy: Delete
allowVolumeExpansion: true
Performance Characteristics¶
Performance Comparison:
| Storage Type | SKU | IOPS | Throughput | Latency | ATP Use Case |
|---|---|---|---|---|---|
| Azure Disk (Premium SSD) | Premium_LRS | 20,000 IOPS | 900 MB/s | < 1ms | ✅ Databases |
| Azure Disk (Standard SSD) | StandardSSD_LRS | 6,000 IOPS | 750 MB/s | < 5ms | ⚠️ Dev/Test |
| Azure Files (Premium) | Premium_LRS | 100,000 IOPS | 10,240 MiB/s | < 10ms | ✅ Shared storage |
| Azure Files (Standard) | Standard_LRS | 1,000 IOPS | 60 MiB/s | < 20ms | ⚠️ Dev/Test |
ATP Performance Requirements:
- Database workloads: Premium SSD (Azure Disk) - High IOPS, low latency
- Shared files: Premium Files (Azure Files) - Multiple mounts, good performance
- Dev/Test: Standard SSD (Azure Disk) - Cost-effective
Cost Comparison¶
Storage Cost Comparison (per GB/month):
| Storage Type | SKU | Cost (East US) | Use Case |
|---|---|---|---|
| Azure Disk Premium SSD | Premium_LRS | $0.17/GB | Production databases |
| Azure Disk Standard SSD | StandardSSD_LRS | $0.06/GB | Dev/Test databases |
| Azure Files Premium | Premium_LRS | $0.19/GB | Production file shares |
| Azure Files Standard | Standard_LRS | $0.06/GB | Dev/Test file shares |
Cost Optimization Strategy:
- Production databases: Premium SSD (required for performance)
- Dev/Test databases: Standard SSD (cost savings)
- Shared storage: Premium Files for production, Standard for dev/test
Use Case Selection¶
Storage Selection Matrix:
| Use Case | Recommended Storage | Access Mode | Rationale |
|---|---|---|---|
| PostgreSQL | Azure Disk Premium | RWO | High IOPS, single pod |
| MongoDB | Azure Disk Premium | RWO | High IOPS, single pod |
| Redis | Azure Disk Premium | RWO | Low latency, single pod |
| Shared Logs | Azure Files Premium | RWX | Multiple pods, shared access |
| Config Files | Azure Files Standard | RWX | Low cost, shared access |
| Backups | Azure Files Premium | RWX | Multiple pods, shared access |
ATP Decision Matrix:
| Component | Storage Type | StorageClass | Size |
|---|---|---|---|
| PostgreSQL | Azure Disk | managed-premium |
500Gi |
| MongoDB | Azure Disk | managed-premium |
1Ti |
| Redis | Azure Disk | managed-premium |
100Gi |
| Shared Logs | Azure Files | azurefile-premium |
500Gi |
| Backups | Azure Files | azurefile-premium |
2Ti |
Storage Classes¶
Performance Tiers (Standard, Premium, Ultra)¶
Storage Class Performance Tiers:
# Standard HDD (Lowest cost, lowest performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-standard
provisioner: disk.csi.azure.com
parameters:
skuname: Standard_LRS
kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
# Standard SSD (Balanced)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-standard-ssd
provisioner: disk.csi.azure.com
parameters:
skuname: StandardSSD_LRS
kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
# Premium SSD (High performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS
kind: managed
cachingMode: ReadOnly
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
---
# Ultra SSD (Highest performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-ultra
provisioner: disk.csi.azure.com
parameters:
skuname: UltraSSD_LRS
kind: managed
cachingMode: None # Ultra SSD doesn't support caching
diskIopsReadWrite: "5000" # IOPS limit
diskMbpsReadWrite: "200" # Throughput limit (MB/s)
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
Performance Tier Comparison:
| Tier | SKU | IOPS | Throughput | Latency | Cost | ATP Use Case |
|---|---|---|---|---|---|---|
| Standard HDD | Standard_LRS | 500 | 60 MB/s | Variable | $0.04/GB | ❌ Not used |
| Standard SSD | StandardSSD_LRS | 6,000 | 750 MB/s | < 5ms | $0.06/GB | ✅ Dev/Test |
| Premium SSD | Premium_LRS | 20,000 | 900 MB/s | < 1ms | $0.17/GB | ✅ Production |
| Ultra SSD | UltraSSD_LRS | 160,000 | 2,000 MB/s | < 0.5ms | $0.24/GB | ⚠️ High-performance only |
ATP Decision: Use Premium SSD for production databases, Standard SSD for dev/test.
Encryption Configuration¶
Encryption at Rest with Disk Encryption Set:
# platform/storage/storageclass-encrypted.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium-encrypted
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS
kind: managed
diskEncryptionSetID: /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Compute/diskEncryptionSets/atp-disk-encryption
cachingMode: ReadOnly
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
Pulumi C# Disk Encryption Set:
// infrastructure/DiskEncryption.cs
var diskEncryptionSet = new DiskEncryptionSet("atp-disk-encryption", new DiskEncryptionSetArgs
{
ResourceGroupName = resourceGroup.Name,
Location = location,
Identity = new EncryptionSetIdentityArgs
{
Type = "SystemAssigned"
},
ActiveKey = new KeyVaultAndKeyReferenceArgs
{
KeyUrl = keyVaultKey.Uri,
SourceVault = new SourceVaultArgs
{
Id = keyVault.Id
}
},
EncryptionType = "EncryptionAtRestWithCustomerKey"
});
// Grant Key Vault access to Disk Encryption Set
var keyVaultAccessPolicy = new KeyVaultAccessPolicyArgs
{
TenantId = tenantId,
ObjectId = diskEncryptionSet.Identity.PrincipalId,
Permissions = new KeyVaultPermissionsArgs
{
Keys = new[] { "Get", "WrapKey", "UnwrapKey" }
}
};
Snapshot Support¶
Volume Snapshot Class:
# platform/storage/volumesnapshotclass.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: azure-disk-snapshot
driver: disk.csi.azure.com
deletionPolicy: Retain # or Delete
parameters:
incremental: "true" # Incremental snapshots (cost-effective)
resourceGroup: atp-production-rg
storageAccount: atpsnapshots # Optional: specific storage account
Create Volume Snapshot:
# apps/atp-ingestion/volumesnapshot.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-data-snapshot-20240115
namespace: atp-production
spec:
volumeSnapshotClassName: azure-disk-snapshot
source:
persistentVolumeClaimName: postgres-data
Provisioner Settings¶
Azure Disk CSI Driver Provisioner Settings:
# platform/storage/storageclass-advanced.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium-advanced
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS
kind: managed
cachingMode: ReadOnly # ReadOnly, ReadWrite, None
diskEncryptionSetID: ${DISK_ENCRYPTION_SET_ID}
diskIOPSReadWrite: "5000" # Optional: IOPS limit
diskMBpsReadWrite: "200" # Optional: Throughput limit
networkAccessPolicy: "DenyAll" # DenyAll, AllowPrivate, AllowAll
publicNetworkAccess: "Disabled"
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer # Zone-aware
Volume Binding Modes:
| Binding Mode | Description | Use Case | ATP Selection |
|---|---|---|---|
| Immediate | Bind immediately | Static provisioning | ⚠️ Not used |
| WaitForFirstConsumer | Bind when pod scheduled | Zone-aware, topology | ✅ Preferred |
ATP Recommendation: Use WaitForFirstConsumer for zone-aware scheduling and topology constraints.
StatefulSet Deployment Patterns¶
Ordered Deployment and Scaling¶
StatefulSet Ordered Deployment:
# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
namespace: atp-production
spec:
serviceName: postgresql
replicas: 3
podManagementPolicy: OrderedReady # Sequential creation (default)
# podManagementPolicy: Parallel # Parallel creation (optional)
selector:
matchLabels:
app: postgresql
template:
metadata:
labels:
app: postgresql
spec:
containers:
- name: postgresql
image: postgres:15
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: managed-premium
resources:
requests:
storage: 100Gi
StatefulSet Scaling Order:
sequenceDiagram
participant K8s as Kubernetes
participant Pod0 as postgresql-0
participant Pod1 as postgresql-1
participant Pod2 as postgresql-2
K8s->>Pod0: Create and wait for Ready
Pod0-->>K8s: Ready
K8s->>Pod1: Create and wait for Ready
Pod1-->>K8s: Ready
K8s->>Pod2: Create and wait for Ready
Pod2-->>K8s: Ready
Ordered Scaling Behavior:
- Scale Up: Creates pods sequentially (0, 1, 2...)
- Scale Down: Deletes pods in reverse order (2, 1, 0...)
- Ensures: Each pod is ready before creating the next
Stable Network Identity¶
Headless Service for StatefulSet:
# apps/postgresql/service.yaml
apiVersion: v1
kind: Service
metadata:
name: postgresql
namespace: atp-production
spec:
clusterIP: None # Headless service
selector:
app: postgresql
ports:
- port: 5432
name: postgresql
Stable Network Identity:
# StatefulSet pods get stable DNS names
# postgresql-0.postgresql.atp-production.svc.cluster.local
# postgresql-1.postgresql.atp-production.svc.cluster.local
# postgresql-2.postgresql.atp-production.svc.cluster.local
Accessing StatefulSet Pods:
# Access specific pod
psql -h postgresql-0.postgresql.atp-production.svc.cluster.local
# Access any pod via service
psql -h postgresql.atp-production.svc.cluster.local
Persistent Storage per Pod¶
StatefulSet with Volume Claim Templates:
# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
namespace: atp-production
spec:
serviceName: postgresql
replicas: 3
selector:
matchLabels:
app: postgresql
template:
metadata:
labels:
app: postgresql
spec:
containers:
- name: postgresql
image: postgres:15
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
- name: config
mountPath: /etc/postgresql
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: managed-premium
resources:
requests:
storage: 100Gi
- metadata:
name: config
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: managed-premium
resources:
requests:
storage: 10Gi
PVCs Created Automatically:
data-postgresql-0 # Persistent volume for pod 0
data-postgresql-1 # Persistent volume for pod 1
data-postgresql-2 # Persistent volume for pod 2
config-postgresql-0
config-postgresql-1
config-postgresql-2
Headless Service Configuration¶
Headless Service Pattern:
# apps/postgresql/service-headless.yaml
apiVersion: v1
kind: Service
metadata:
name: postgresql
namespace: atp-production
spec:
clusterIP: None # Headless - no load balancing
selector:
app: postgresql
ports:
- port: 5432
targetPort: 5432
name: postgresql
Service Discovery with Headless Service:
# StatefulSet pod discovery
apiVersion: v1
kind: Service
metadata:
name: postgresql-read
namespace: atp-production
spec:
selector:
app: postgresql
role: replica # Read replicas only
ports:
- port: 5432
name: postgresql
---
# StatefulSet pod discovery
apiVersion: v1
kind: Service
metadata:
name: postgresql-write
namespace: atp-production
spec:
selector:
app: postgresql
role: primary # Primary only
ports:
- port: 5432
name: postgresql
Database Deployments in Kubernetes¶
PostgreSQL Operator¶
PostgreSQL Operator (Crunchy Data):
#!/bin/bash
# scripts/install-postgres-operator.sh
# Add PostgreSQL Operator Helm repository
helm repo add postgres-operator https://opensource.postgresql.org/postgres/postgres-operator/charts
helm repo update
# Install PostgreSQL Operator
helm install postgres-operator postgres-operator/postgres-operator \
--namespace postgres-operator \
--create-namespace \
--set configResources.requests.memory=128Mi \
--set configResources.requests.cpu=100m
echo "✅ PostgreSQL Operator installed"
PostgreSQL Cluster via Operator:
# apps/postgresql/postgrescluster.yaml
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: atp-postgres
namespace: atp-production
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-15.4-0
postgresVersion: 15
instances:
- name: instance1
replicas: 3
dataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
storageClassName: managed-premium
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.47-0
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: managed-premium
global:
repo1-retention-full: "7"
repo1-retention-full-type: count
monitoring:
pgMonitor:
exporter:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.3.0-0
MongoDB Operator¶
MongoDB Community Operator:
#!/bin/bash
# scripts/install-mongodb-operator.sh
# Install MongoDB Community Operator
kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes-operator/master/config/crd/bases/mongodbcommunity.mongodb.com_mongodbcommunity.yaml
# Install operator
kubectl create namespace mongodb-operator
kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes-operator/master/config/manager/manager.yaml -n mongodb-operator
echo "✅ MongoDB Operator installed"
MongoDB ReplicaSet via Operator:
# apps/mongodb/mongodbcommunity.yaml
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
name: atp-mongodb
namespace: atp-production
spec:
members: 3
type: ReplicaSet
version: "7.0.0"
security:
authentication:
modes: ["SCRAM"]
users:
- name: atp-user
db: admin
passwordSecretRef:
name: mongodb-password
roles:
- name: readWriteAnyDatabase
db: admin
additionalMongodConfig:
storage.wiredTiger.engineConfig.journalCompressor: snappy
storage.wiredTiger.collectionConfig.blockCompressor: snappy
statefulSet:
spec:
volumeClaimTemplates:
- metadata:
name: data-volume
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 500Gi
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
Redis Deployment¶
Redis StatefulSet:
# apps/redis/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
namespace: atp-production
spec:
serviceName: redis
replicas: 3
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7-alpine
command:
- redis-server
- /etc/redis/redis.conf
- --appendonly
- "yes"
ports:
- containerPort: 6379
name: redis
volumeMounts:
- name: data
mountPath: /data
- name: config
mountPath: /etc/redis
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 100Gi
Redis Sentinel Configuration:
# apps/redis/redis-sentinel.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-sentinel
namespace: atp-production
spec:
serviceName: redis-sentinel
replicas: 3
selector:
matchLabels:
app: redis-sentinel
template:
metadata:
labels:
app: redis-sentinel
spec:
containers:
- name: sentinel
image: redis:7-alpine
command:
- redis-sentinel
- /etc/redis/sentinel.conf
ports:
- containerPort: 26379
name: sentinel
volumeMounts:
- name: config
mountPath: /etc/redis
StatefulSet vs Managed Service Decision¶
Kubernetes vs Azure Managed Services:
| Aspect | Kubernetes (StatefulSet) | Azure Managed Service | ATP Decision |
|---|---|---|---|
| PostgreSQL | PostgreSQL Operator | Azure Database for PostgreSQL | ⚠️ Managed Service (recommended) |
| MongoDB | MongoDB Operator | Azure Cosmos DB (MongoDB API) | ⚠️ Managed Service (recommended) |
| Redis | Redis StatefulSet | Azure Cache for Redis | ⚠️ Managed Service (recommended) |
| Control | ✅ Full control | ⚠️ Limited | ⚠️ Managed Service |
| Operations | ⚠️ Self-managed | ✅ Managed | ✅ Managed Service |
| Cost | ⚠️ Higher (infra + ops) | ✅ Lower (includes ops) | ✅ Managed Service |
| ATP Decision | ⚠️ Dev/Test only | ✅ Production | ✅ Managed Services |
ATP Decision: Use Azure managed services for production databases (PostgreSQL, MongoDB, Redis) - lower operational overhead, better SLA, automated backups. Use Kubernetes StatefulSets for dev/test environments.
Backup and Restore Procedures¶
Velero for Cluster Backups¶
Velero Installation:
#!/bin/bash
# scripts/install-velero.sh
# Install Velero CLI
curl -fsSL -o velero-v1.12.0-linux-amd64.tar.gz \
https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
# Create Azure Storage Account for Velero backups
az storage account create \
--name atpvelerobackups \
--resource-group atp-production-rg \
--sku Standard_LRS \
--location eastus
# Create blob container
az storage container create \
--name velero \
--account-name atpvelerobackups
# Install Velero
velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.7.0 \
--bucket velero \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=atp-production-rg,storageAccount=atpvelerobackups \
--snapshot-location-config apiTimeout=5m,resourceGroup=atp-production-rg \
--use-volume-snapshots=true
echo "✅ Velero installed"
Velero via Helm:
# platform/velero/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: velero
namespace: velero
spec:
interval: 5m
chart:
spec:
chart: velero
sourceRef:
kind: HelmRepository
name: vmware-tanzu
version: 5.1.1
values:
configuration:
provider: azure
backupStorageLocation:
bucket: velero
config:
resourceGroup: atp-production-rg
storageAccount: atpvelerobackups
volumeSnapshotLocation:
config:
apiTimeout: 5m
resourceGroup: atp-production-rg
initContainers:
- name: velero-plugin-for-microsoft-azure
image: velero/velero-plugin-for-microsoft-azure:v1.7.0
volumeMounts:
- mountPath: /target
name: plugins
credentials:
secretContents:
cloud: |
# Azure credentials
Volume Snapshots¶
Velero Backup with Volume Snapshots:
#!/bin/bash
# scripts/create-velero-backup.sh
BACKUP_NAME="atp-production-backup-$(date +%Y%m%d-%H%M%S)"
NAMESPACE="atp-production"
echo "💾 Creating Velero backup: ${BACKUP_NAME}"
# Create backup
velero backup create "${BACKUP_NAME}" \
--namespace "${NAMESPACE}" \
--include-namespaces "${NAMESPACE}" \
--snapshot-volumes \
--wait
# Check backup status
velero backup describe "${BACKUP_NAME}"
echo "✅ Backup created: ${BACKUP_NAME}"
Scheduled Backups:
# platform/velero/backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-production-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- atp-production
snapshotVolumes: true
ttl: 720h # 30 days retention
metadata:
labels:
backup-type: daily
environment: production
Backup Scheduling¶
Backup Schedule Configuration:
# platform/velero/schedules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: velero-schedules
namespace: velero
data:
daily-backup.yaml: |
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-production-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- atp-production
snapshotVolumes: true
ttl: 720h # 30 days
weekly-backup.yaml: |
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: weekly-production-backup
namespace: velero
spec:
schedule: "0 3 * * 0" # 3 AM Sunday
template:
includedNamespaces:
- atp-production
snapshotVolumes: true
ttl: 2160h # 90 days
Restore Procedures¶
Velero Restore Procedure:
#!/bin/bash
# scripts/restore-from-velero.sh
BACKUP_NAME="${1}"
NAMESPACE="${2:-atp-production}"
if [ -z "${BACKUP_NAME}" ]; then
echo "Usage: $0 <backup-name> [namespace]"
echo ""
echo "Available backups:"
velero backup get
exit 1
fi
echo "🔄 Restoring from backup: ${BACKUP_NAME}"
# List backups
velero backup get
# Restore from backup
velero restore create "restore-${BACKUP_NAME}-$(date +%Y%m%d-%H%M%S)" \
--from-backup "${BACKUP_NAME}" \
--namespace-mappings "${NAMESPACE}:${NAMESPACE}-restored" \
--wait
echo "✅ Restore initiated"
echo " Check status: velero restore get"
Restore to Different Namespace:
# Restore production backup to test namespace
velero restore create restore-production-to-test \
--from-backup daily-production-backup-20240115 \
--namespace-mappings atp-production:atp-test \
--wait
Volume Snapshots¶
Creating Snapshots¶
Manual Volume Snapshot:
# apps/postgresql/volumesnapshot-manual.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-data-snapshot-20240115
namespace: atp-production
spec:
volumeSnapshotClassName: azure-disk-snapshot
source:
persistentVolumeClaimName: postgres-data-postgresql-0
Create Snapshot Script:
#!/bin/bash
# scripts/create-volume-snapshot.sh
PVC_NAME="${1}"
NAMESPACE="${2}"
SNAPSHOT_NAME="${3:-${PVC_NAME}-snapshot-$(date +%Y%m%d-%H%M%S)}"
if [ -z "${PVC_NAME}" ] || [ -z "${NAMESPACE}" ]; then
echo "Usage: $0 <pvc-name> <namespace> [snapshot-name]"
exit 1
fi
echo "📸 Creating volume snapshot: ${SNAPSHOT_NAME}"
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ${SNAPSHOT_NAME}
namespace: ${NAMESPACE}
spec:
volumeSnapshotClassName: azure-disk-snapshot
source:
persistentVolumeClaimName: ${PVC_NAME}
EOF
# Wait for snapshot to be ready
kubectl wait volumesnapshot/${SNAPSHOT_NAME} \
-n "${NAMESPACE}" \
--for=condition=Ready \
--timeout=300s
echo "✅ Snapshot created: ${SNAPSHOT_NAME}"
Snapshot Classes¶
Snapshot Class Configuration:
# platform/storage/volumesnapshotclass-premium.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: azure-disk-snapshot-premium
driver: disk.csi.azure.com
deletionPolicy: Retain # Keep snapshot after PVC deletion
parameters:
incremental: "true" # Incremental snapshots
resourceGroup: atp-production-rg
# Optional: specific storage account for snapshots
# storageAccount: atpsnapshots
---
# platform/storage/volumesnapshotclass-standard.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: azure-disk-snapshot-standard
driver: disk.csi.azure.com
deletionPolicy: Delete # Delete snapshot when PVC is deleted
parameters:
incremental: "true"
resourceGroup: atp-production-rg
Snapshot Class Selection:
| SnapshotClass | Deletion Policy | Use Case | ATP Selection |
|---|---|---|---|
| azure-disk-snapshot-premium | Retain | Production backups | ✅ Production |
| azure-disk-snapshot-standard | Delete | Dev/Test snapshots | ✅ Dev/Test |
Azure Backup Integration¶
Azure Backup for AKS Volumes:
#!/bin/bash
# scripts/setup-azure-backup.sh
# Create Recovery Services Vault
az backup vault create \
--name atp-backup-vault \
--resource-group atp-production-rg \
--location eastus
# Enable backup for AKS volumes
az backup protection enable-for-azurefileshare \
--vault-name atp-backup-vault \
--resource-group atp-production-rg \
--storage-account atpstorageaccount \
--azure-file-share-name postgres-backup \
--backup-management-type AzureStorage \
--workload-type AzureFileShare \
--policy-name DefaultPolicy
echo "✅ Azure Backup configured"
Backup Policy:
# Create backup policy (daily, 30-day retention)
az backup policy create \
--vault-name atp-backup-vault \
--resource-group atp-production-rg \
--name daily-policy \
--backup-management-type AzureStorage \
--workload-type AzureFileShare \
--policy '{
"name": "daily-policy",
"recoveryPointType": "FileSystemConsistent",
"schedulePolicy": {
"scheduleRunFrequency": "Daily",
"scheduleRunTimes": ["02:00"]
},
"retentionPolicy": {
"dailySchedule": {
"retentionDuration": {
"count": 30,
"durationType": "Days"
}
}
}
}'
Snapshot Retention¶
Snapshot Retention Policies:
| Environment | Retention Period | Rationale |
|---|---|---|
| Production | 90 days | Long-term recovery |
| Staging | 30 days | Shorter retention |
| Test | 7 days | Minimal retention |
| Dev | 3 days | Very short retention |
Automated Snapshot Cleanup:
#!/bin/bash
# scripts/cleanup-old-snapshots.sh
NAMESPACE="${1:-atp-production}"
RETENTION_DAYS="${2:-30}"
echo "🧹 Cleaning up old snapshots (older than ${RETENTION_DAYS} days)..."
# Get all snapshots older than retention period
OLD_SNAPSHOTS=$(kubectl get volumesnapshot -n "${NAMESPACE}" -o json | \
jq -r ".items[] | select(.metadata.creationTimestamp < \"$(date -d "${RETENTION_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)\") | .metadata.name")
for SNAPSHOT in ${OLD_SNAPSHOTS}; do
echo "🗑️ Deleting snapshot: ${SNAPSHOT}"
kubectl delete volumesnapshot "${SNAPSHOT}" -n "${NAMESPACE}" || true
done
echo "✅ Snapshot cleanup complete"
Data Migration Strategies¶
Migrating Data Between Versions¶
Database Migration Strategy:
sequenceDiagram
participant Old as Old Version<br/>PostgreSQL 14
participant Snapshot as Volume Snapshot
participant New as New Version<br/>PostgreSQL 15
participant Data as Data Migration
Old->>Snapshot: Create snapshot
Snapshot->>New: Clone volume
New->>Data: Mount snapshot
Data->>New: Migrate schema
New->>Data: Migrate data
Data->>New: Validate
PostgreSQL Version Migration:
#!/bin/bash
# scripts/migrate-postgres-version.sh
OLD_VERSION="14"
NEW_VERSION="15"
NAMESPACE="atp-production"
PVC_NAME="postgres-data-postgresql-0"
echo "🔄 Migrating PostgreSQL ${OLD_VERSION} → ${NEW_VERSION}"
# Step 1: Create snapshot of current data
echo "📸 Step 1: Creating snapshot..."
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-migration-snapshot
namespace: ${NAMESPACE}
spec:
volumeSnapshotClassName: azure-disk-snapshot-premium
source:
persistentVolumeClaimName: ${PVC_NAME}
EOF
kubectl wait volumesnapshot/postgres-migration-snapshot \
-n "${NAMESPACE}" \
--for=condition=Ready \
--timeout=300s
# Step 2: Scale down old StatefulSet
echo "⏸️ Step 2: Scaling down old StatefulSet..."
kubectl scale statefulset postgresql-${OLD_VERSION} --replicas=0 -n "${NAMESPACE}"
# Step 3: Create new StatefulSet from snapshot
echo "🆕 Step 3: Creating new StatefulSet from snapshot..."
# (Create new StatefulSet YAML with PostgreSQL ${NEW_VERSION})
# Step 4: Restore data from snapshot
echo "📥 Step 4: Restoring data from snapshot..."
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-postgresql-0-new
namespace: ${NAMESPACE}
spec:
dataSource:
name: postgres-migration-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 500Gi
EOF
echo "✅ Migration initiated"
Volume Cloning¶
Volume Clone from Snapshot:
# apps/postgresql/pvc-from-snapshot.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-clone
namespace: atp-production
spec:
dataSource:
name: postgres-data-snapshot-20240115
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 500Gi # Must be >= snapshot size
Clone Volume Script:
#!/bin/bash
# scripts/clone-volume-from-snapshot.sh
SNAPSHOT_NAME="${1}"
NEW_PVC_NAME="${2}"
NAMESPACE="${3:-atp-production}"
STORAGE_SIZE="${4:-500Gi}"
if [ -z "${SNAPSHOT_NAME}" ] || [ -z "${NEW_PVC_NAME}" ]; then
echo "Usage: $0 <snapshot-name> <new-pvc-name> [namespace] [storage-size]"
exit 1
fi
echo "📋 Cloning volume from snapshot: ${SNAPSHOT_NAME}"
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${NEW_PVC_NAME}
namespace: ${NAMESPACE}
spec:
dataSource:
name: ${SNAPSHOT_NAME}
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: ${STORAGE_SIZE}
EOF
echo "✅ Volume clone created: ${NEW_PVC_NAME}"
Zero-Downtime Migrations¶
Zero-Downtime Migration Strategy:
sequenceDiagram
participant App as Application
participant OldDB as Old DB<br/>Primary
participant NewDB as New DB<br/>Replica
participant Sync as Data Sync
App->>OldDB: Write traffic
OldDB->>Sync: Stream changes
Sync->>NewDB: Apply changes
NewDB->>NewDB: Validate sync
NewDB->>App: Switch traffic
App->>NewDB: Write traffic
PostgreSQL Logical Replication for Zero-Downtime:
# apps/postgresql/migration-replica.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql-new
namespace: atp-production
spec:
serviceName: postgresql-new
replicas: 1
selector:
matchLabels:
app: postgresql-new
template:
metadata:
labels:
app: postgresql-new
spec:
containers:
- name: postgresql
image: postgres:15
env:
- name: POSTGRES_REPLICATION_MODE
value: "replica" # Logical replica
- name: POSTGRES_PRIMARY_HOST
value: "postgresql.atp-production.svc.cluster.local"
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 500Gi
GitOps Considerations for Stateful Apps¶
Careful Rollback Procedures¶
StatefulSet Rollback Strategy:
# StatefulSet with update strategy
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
namespace: atp-production
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0 # Update all pods (reduce gradually for staged rollout)
# OR
# updateStrategy:
# type: OnDelete # Manual update control
Staged Rollout for StatefulSets:
#!/bin/bash
# scripts/staged-statefulset-rollout.sh
STATEFULSET="${1}"
NAMESPACE="${2:-atp-production}"
PARTITION="${3:-2}" # Keep 2 pods on old version
echo "🔄 Staged rollout for StatefulSet: ${STATEFULSET}"
echo " Keeping ${PARTITION} pods on old version"
# Set partition (only pods >= partition index will be updated)
kubectl patch statefulset "${STATEFULSET}" -n "${NAMESPACE}" \
--type='json' \
-p="[{\"op\": \"replace\", \"path\": \"/spec/updateStrategy/rollingUpdate/partition\", \"value\": ${PARTITION}}]"
# Update image
kubectl set image statefulset/${STATEFULSET} \
-n "${NAMESPACE}" \
postgresql=postgres:15
# Gradually reduce partition for staged rollout
for i in {2..0}; do
echo " Updating partition: ${i}"
kubectl patch statefulset "${STATEFULSET}" -n "${NAMESPACE}" \
--type='json' \
-p="[{\"op\": \"replace\", \"path\": \"/spec/updateStrategy/rollingUpdate/partition\", \"value\": ${i}}]"
# Wait for pod to be ready
kubectl wait --for=condition=ready pod/${STATEFULSET}-${i} \
-n "${NAMESPACE}" \
--timeout=300s
sleep 60 # Wait before next update
done
echo "✅ Staged rollout complete"
No Auto-Prune for PVCs¶
FluxCD Kustomization with Prune Safety:
# clusters/production/kustomizations/stateful-apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: stateful-apps-production
namespace: flux-system
spec:
interval: 5m
path: ./apps/postgresql/overlays/production
prune: true # Enable pruning
pruneOptions:
keepLabels: # Keep resources with these labels
- app=postgresql
- component=database
# Do NOT prune PVCs
sourceRef:
kind: GitRepository
name: atp-gitops-production
PVC Protection Labels:
# apps/postgresql/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
namespace: atp-production
labels:
app: postgresql
component: database
fluxcd.io/prune: "false" # Explicitly exclude from pruning
managed-by: kustomize
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 500Gi
StatefulSet Update Strategies¶
Update Strategy Options:
| Strategy | Description | Use Case | ATP Selection |
|---|---|---|---|
| RollingUpdate | Update pods sequentially | Production (controlled) | ✅ Production |
| OnDelete | Update only when pod deleted | Manual control | ⚠️ Staging (manual) |
StatefulSet Update Strategy Configuration:
# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
namespace: atp-production
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0 # Start updating from index 0
# partition: 2 # Keep pods 0-1 on old version, update 2+
# OR for manual control
# updateStrategy:
# type: OnDelete
Data Persistence Across Deployments¶
PVC Retention Policy:
# StorageClass with Retain policy
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS
reclaimPolicy: Retain # Keep PV when PVC is deleted
volumeBindingMode: WaitForFirstConsumer
PVC Lifecycle with Retain Policy:
sequenceDiagram
participant Dev as Developer
participant PVC as PVC
participant PV as PV
participant Azure as Azure Disk
Dev->>PVC: Delete PVC
PVC->>PV: Release (Retain)
PV->>Azure: Keep disk (not deleted)
Azure->>PV: Data preserved
Dev->>PV: Reuse PV with new PVC
Reusing Retained PV:
# Reuse existing PV with new PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-restored
namespace: atp-production
spec:
volumeName: pv-abc123 # Reference existing PV
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 500Gi
Disaster Recovery for Persistent Data¶
Backup Frequency per Environment¶
Backup Schedule Matrix:
| Environment | Frequency | Retention | Rationale |
|---|---|---|---|
| Production | Every 6 hours | 90 days | High availability, long-term recovery |
| Staging | Daily | 30 days | Moderate retention |
| Test | Weekly | 14 days | Minimal retention |
| Dev | Manual only | 7 days | Cost optimization |
Automated Backup Schedules:
# platform/velero/schedules-production.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: production-backup-6h
namespace: velero
spec:
schedule: "0 */6 * * *" # Every 6 hours
template:
includedNamespaces:
- atp-production
snapshotVolumes: true
ttl: 2160h # 90 days
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: production-backup-daily
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- atp-production
snapshotVolumes: true
ttl: 720h # 30 days
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: production-backup-weekly
namespace: velero
spec:
schedule: "0 3 * * 0" # 3 AM Sunday
template:
includedNamespaces:
- atp-production
snapshotVolumes: true
ttl: 2160h # 90 days
Cross-Region Replication¶
Cross-Region Backup Replication:
#!/bin/bash
# scripts/setup-cross-region-backup.sh
PRIMARY_REGION="eastus"
SECONDARY_REGION="westeurope"
# Create backup storage in secondary region
az storage account create \
--name atpvelerobackupseu \
--resource-group atp-production-rg-eu \
--sku Standard_LRS \
--location "${SECONDARY_REGION}" \
--allow-blob-public-access false
# Configure blob replication
az storage blob service-properties update \
--account-name atpvelerobackups \
--enable-change-feed true \
--enable-versioning true
# Enable geo-replication
az storage account update \
--name atpvelerobackups \
--resource-group atp-production-rg \
--allow-blob-public-access false \
--min-tls-version TLS1_2
echo "✅ Cross-region backup replication configured"
Velero with Multiple Backup Locations:
# platform/velero/backup-locations.yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: backup-primary
namespace: velero
spec:
provider: azure
objectStorage:
bucket: velero
prefix: primary
config:
resourceGroup: atp-production-rg
storageAccount: atpvelerobackups
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: backup-secondary
namespace: velero
spec:
provider: azure
objectStorage:
bucket: velero
prefix: secondary
config:
resourceGroup: atp-production-rg-eu
storageAccount: atpvelerobackupseu
RPO Targets for Databases¶
RPO (Recovery Point Objective) Targets:
| Environment | RPO Target | Backup Frequency | Actual RPO |
|---|---|---|---|
| Production | < 1 hour | Every 6 hours | 6 hours |
| Staging | < 24 hours | Daily | 24 hours |
| Test | < 7 days | Weekly | 7 days |
| Dev | N/A | Manual | N/A |
RPO Validation:
#!/bin/bash
# scripts/validate-rpo.sh
NAMESPACE="atp-production"
RPO_TARGET_HOURS=6
echo "🔍 Validating RPO compliance..."
# Get latest backup
LATEST_BACKUP=$(velero backup get --namespace velero | \
grep "${NAMESPACE}" | \
sort -k4 -r | \
head -n 1 | \
awk '{print $1}')
if [ -z "${LATEST_BACKUP}" ]; then
echo "❌ No backups found"
exit 1
fi
# Get backup creation time
BACKUP_TIME=$(velero backup describe "${LATEST_BACKUP}" --namespace velero | \
grep "Creation" | \
awk '{print $2, $3}')
BACKUP_EPOCH=$(date -d "${BACKUP_TIME}" +%s)
CURRENT_EPOCH=$(date +%s)
AGE_HOURS=$(( (CURRENT_EPOCH - BACKUP_EPOCH) / 3600 ))
if [ "${AGE_HOURS}" -gt "${RPO_TARGET_HOURS}" ]; then
echo "⚠️ RPO violation: Latest backup is ${AGE_HOURS} hours old (target: ${RPO_TARGET_HOURS}h)"
exit 1
else
echo "✅ RPO compliant: Latest backup is ${AGE_HOURS} hours old (target: ${RPO_TARGET_HOURS}h)"
fi
DR Testing for Stateful Apps¶
DR Test Procedure:
#!/bin/bash
# scripts/dr-test-stateful-apps.sh
BACKUP_NAME="${1}"
TEST_NAMESPACE="atp-production-dr-test"
if [ -z "${BACKUP_NAME}" ]; then
echo "Usage: $0 <backup-name>"
echo ""
echo "Available backups:"
velero backup get --namespace velero
exit 1
fi
echo "🧪 DR Test: Restoring backup ${BACKUP_NAME} to test namespace"
# Step 1: Restore backup to test namespace
echo "📥 Step 1: Restoring backup..."
velero restore create "dr-test-${BACKUP_NAME}-$(date +%Y%m%d-%H%M%S)" \
--from-backup "${BACKUP_NAME}" \
--namespace-mappings atp-production:${TEST_NAMESPACE} \
--wait
# Step 2: Verify restored resources
echo "✅ Step 2: Verifying restored resources..."
kubectl get statefulsets -n "${TEST_NAMESPACE}"
kubectl get pvcs -n "${TEST_NAMESPACE}"
# Step 3: Test database connectivity
echo "🔌 Step 3: Testing database connectivity..."
kubectl run postgresql-test \
-n "${TEST_NAMESPACE}" \
--image=postgres:15 \
--rm -it --restart=Never \
-- psql -h postgresql.${TEST_NAMESPACE}.svc.cluster.local -U postgres -c "SELECT version();"
# Step 4: Cleanup
echo "🧹 Step 4: Cleaning up test namespace..."
read -p "Delete test namespace ${TEST_NAMESPACE}? (yes/no): " CONFIRM
if [ "${CONFIRM}" = "yes" ]; then
kubectl delete namespace "${TEST_NAMESPACE}"
echo "✅ DR test complete and cleaned up"
else
echo "⚠️ Test namespace retained: ${TEST_NAMESPACE}"
fi
DR Test Checklist:
## DR Test Checklist
### Pre-Test
- [ ] Backup exists and is valid
- [ ] Test namespace created
- [ ] Test resources allocated
### Test Execution
- [ ] Backup restored successfully
- [ ] StatefulSets recreated
- [ ] PVCs restored
- [ ] Pods running and healthy
- [ ] Database accessible
- [ ] Data integrity verified
### Post-Test
- [ ] Test results documented
- [ ] Test namespace cleaned up
- [ ] Lessons learned captured
Summary: Storage & StatefulSets in GitOps¶
- Persistent Volumes (PV) and Claims (PVC): PV and PVC concepts, dynamic provisioning (ATP preference), storage classes, access modes (RWO, RWX, ROX)
- Azure Disk vs Azure Files: Azure Disk (block storage, single mount) for databases, Azure Files (shared storage, multi-mount) for shared files, performance characteristics comparison, cost comparison, use case selection matrix
- Storage Classes: Performance tiers (Standard, Premium, Ultra), encryption configuration with Disk Encryption Set, snapshot support, provisioner settings (binding modes, expansion)
- StatefulSet Deployment Patterns: Ordered deployment and scaling, stable network identity with headless services, persistent storage per pod (volume claim templates), headless service configuration
- Database Deployments in Kubernetes: PostgreSQL operator (Crunchy Data), MongoDB operator, Redis deployment, StatefulSet vs managed service decision (ATP: managed services for production)
- Backup and Restore Procedures: Velero for cluster backups, volume snapshots, backup scheduling (6h/daily/weekly), restore procedures (to same/different namespace)
- Volume Snapshots: Creating snapshots (manual/automated), snapshot classes (Retain/Delete policies), Azure Backup integration, snapshot retention policies per environment
- Data Migration Strategies: Migrating data between versions (PostgreSQL 14→15 example), volume cloning from snapshots, zero-downtime migrations with logical replication
- GitOps Considerations for Stateful Apps: Careful rollback procedures (staged rollout), no auto-prune for PVCs (explicit labels), StatefulSet update strategies (RollingUpdate/OnDelete), data persistence across deployments (Retain policy)
- Disaster Recovery for Persistent Data: Backup frequency per environment (6h/daily/weekly), cross-region replication, RPO targets for databases (< 1 hour production), DR testing for stateful apps (restore validation)
Troubleshooting GitOps Issues¶
Purpose: Define comprehensive troubleshooting procedures, debugging tools, common error patterns, and escalation procedures for ATP's GitOps deployments, enabling rapid identification and resolution of issues across FluxCD, Kubernetes resources, networking, secrets, health checks, and performance.
FluxCD Sync Failures¶
Authentication Issues (Git Credentials)¶
Common Authentication Errors:
# Check GitRepository authentication status
kubectl get gitrepository -n flux-system -o yaml
# Describe GitRepository to see authentication errors
kubectl describe gitrepository atp-gitops-production -n flux-system
# Check secret for Git credentials
kubectl get secret git-credentials -n flux-system -o yaml
# View FluxCD logs for authentication errors
kubectl logs -n flux-system -l app=source-controller --tail=100 | grep -i "auth\|error\|failed"
Troubleshooting Git Authentication:
#!/bin/bash
# scripts/troubleshoot-git-auth.sh
GIT_REPO="${1:-atp-gitops-production}"
NAMESPACE="${2:-flux-system}"
echo "🔍 Troubleshooting Git authentication for: ${GIT_REPO}"
# Step 1: Check GitRepository status
echo "📋 Step 1: Checking GitRepository status..."
kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.status.conditions[*]}' | jq
# Step 2: Check if secret exists
echo "🔐 Step 2: Checking Git credentials secret..."
SECRET_NAME=$(kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.spec.secretRef.name}')
if [ -n "${SECRET_NAME}" ]; then
echo " Secret name: ${SECRET_NAME}"
kubectl get secret "${SECRET_NAME}" -n "${NAMESPACE}" || echo " ❌ Secret not found"
else
echo " ⚠️ No secret reference found"
fi
# Step 3: Check source controller logs
echo "📜 Step 3: Checking source controller logs..."
kubectl logs -n "${NAMESPACE}" -l app=source-controller --tail=50 | grep -i "${GIT_REPO}"
# Step 4: Test Git connectivity manually
echo "🌐 Step 4: Testing Git connectivity..."
GIT_URL=$(kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.spec.url}')
echo " Git URL: ${GIT_URL}"
Fix Git Authentication:
# Regenerate Git credentials (PAT)
PAT=$(az devops security token create --scope repo --organization ${ORG} --query token -o tsv)
# Update secret
kubectl create secret generic git-credentials \
--from-literal=username=${USERNAME} \
--from-literal=password=${PAT} \
-n flux-system \
--dry-run=client -o yaml | kubectl apply -f -
# Reconcile GitRepository
flux reconcile source git atp-gitops-production -n flux-system
Manifest Syntax Errors¶
Detecting Manifest Syntax Errors:
#!/bin/bash
# scripts/check-manifest-syntax.sh
PATH_TO_CHECK="${1:-.}"
echo "🔍 Checking manifest syntax in: ${PATH_TO_CHECK}"
# Check YAML syntax
find "${PATH_TO_CHECK}" -name "*.yaml" -o -name "*.yml" | while read -r file; do
echo "Checking: ${file}"
# Use yamllint or kubeval
yamllint "${file}" 2>/dev/null || echo " ⚠️ YAML syntax error in ${file}"
done
# Validate Kubernetes manifests
kubeval --directories "${PATH_TO_CHECK}" --ignore-missing-schemas || echo " ⚠️ Kubernetes manifest validation errors"
Common Syntax Errors:
| Error Type | Example | Fix |
|---|---|---|
| Indentation | key:value |
Use proper YAML indentation (spaces, not tabs) |
| Missing colon | key value |
Use key: value |
| Invalid type | replicas: "3" (string) |
Use replicas: 3 (integer) |
| Invalid enum | type: Invalid |
Use valid Kubernetes enum value |
Fix Manifest Syntax:
# Validate before committing
kubectl apply --dry-run=client -f manifests/
# Use kubeval for validation
kubectl kustomize . | kubeval
# Use FluxCD validation
flux check --pre
Resource Conflicts (Already Exists)¶
Identifying Resource Conflicts:
#!/bin/bash
# scripts/check-resource-conflicts.sh
NAMESPACE="${1:-atp-production}"
echo "🔍 Checking for resource conflicts in namespace: ${NAMESPACE}"
# Check for resources not managed by FluxCD
echo "📋 Resources not managed by FluxCD:"
kubectl get all -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.metadata.labels."kustomize.toolkit.fluxcd.io/name" == null) | "\(.kind)/\(.metadata.name)"'
# Check Kustomization status
echo "📦 Kustomization status:"
kubectl get kustomization -n flux-system -o wide | grep "${NAMESPACE}"
# Check for "already exists" errors in FluxCD logs
echo "🚨 Checking FluxCD logs for conflicts:"
kubectl logs -n flux-system -l app=kustomize-controller --tail=100 | \
grep -i "already exists\|conflict\|error"
Resolve Resource Conflicts:
# Option 1: Adopt existing resource (add FluxCD labels)
kubectl label resource/name kustomize.toolkit.fluxcd.io/name=atp-apps \
kustomize.toolkit.fluxcd.io/namespace=flux-system \
-n atp-production
# Option 2: Delete conflicting resource (if safe)
kubectl delete deployment conflicting-deployment -n atp-production
# Option 3: Suspend Kustomization, fix, then resume
flux suspend kustomization atp-apps-production -n flux-system
# Fix the conflict
flux resume kustomization atp-apps-production -n flux-system
Timeout Errors¶
Troubleshooting Timeout Errors:
#!/bin/bash
# scripts/troubleshoot-timeout.sh
RESOURCE="${1}"
NAMESPACE="${2:-flux-system}"
echo "⏱️ Troubleshooting timeout for: ${RESOURCE}"
# Check resource status
kubectl get "${RESOURCE}" -n "${NAMESPACE}" -o yaml | grep -A 5 "conditions:"
# Check reconciliation timeout settings
kubectl get kustomization "${RESOURCE}" -n "${NAMESPACE}" -o jsonpath='{.spec.timeout}'
# View detailed status
flux get kustomization "${RESOURCE}" -n "${NAMESPACE}" -o wide
# Check for stuck reconciliations
kubectl get kustomization -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.status.conditions[].status == "False") | "\(.metadata.name): \(.status.conditions[].message)"'
Increase Timeout:
# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 5m
timeout: 10m # Increase timeout from default 5m
path: ./apps/atp-gateway/overlays/production
sourceRef:
kind: GitRepository
name: atp-gitops-production
Network Connectivity Issues¶
Check Network Connectivity:
#!/bin/bash
# scripts/troubleshoot-network.sh
echo "🌐 Troubleshooting network connectivity..."
# Test Git repository access
GIT_URL=$(kubectl get gitrepository atp-gitops-production -n flux-system -o jsonpath='{.spec.url}')
echo "Testing Git URL: ${GIT_URL}"
# Test from source controller pod
SOURCE_POD=$(kubectl get pod -n flux-system -l app=source-controller -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${SOURCE_POD}" -n flux-system -- wget -O- "${GIT_URL}" 2>&1
# Check DNS resolution
kubectl exec -it "${SOURCE_POD}" -n flux-system -- nslookup dev.azure.com
# Check proxy settings
kubectl get deployment source-controller -n flux-system -o yaml | grep -i proxy
Common Network Issues:
| Issue | Symptom | Fix |
|---|---|---|
| DNS Resolution | could not resolve host |
Check CoreDNS, network policies |
| Firewall | connection timeout |
Allow Git repository IPs in NSG |
| Proxy | proxy authentication required |
Configure proxy in source controller |
| VNet Peering | network unreachable |
Verify VNet peering configuration |
Drift Detection and Resolution¶
Identifying Drifted Resources¶
Detect Drift:
#!/bin/bash
# scripts/detect-drift.sh
NAMESPACE="${1:-atp-production}"
KUSTOMIZATION="${2:-apps-production}"
echo "🔍 Detecting drift in namespace: ${NAMESPACE}"
# Check Kustomization drift status
flux get kustomization "${KUSTOMIZATION}" -n flux-system
# Force reconciliation and check for drift
flux reconcile kustomization "${KUSTOMIZATION}" -n flux-system --with-source
# Check for drifted resources
kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
jq -r '.status.inventory.entries[] | select(.lastAppliedAnno == null) | "\(.v) \(.kind)/\(.name)"'
# Compare Git state with cluster state
flux diff kustomization "${KUSTOMIZATION}" -n flux-system
Drift Detection Query (KQL):
// Log Analytics: Detect FluxCD drift events
KubePodInventory
| where Namespace == "flux-system"
| where Name contains "kustomize-controller"
| join kind=inner (
ContainerLog
| where LogEntry contains "drift" or LogEntry contains "diff"
| project TimeGenerated, LogEntry, ContainerID
) on ContainerID
| project TimeGenerated, LogEntry
| order by TimeGenerated desc
Manual Changes Detection¶
Detect Manual Changes:
#!/bin/bash
# scripts/detect-manual-changes.sh
NAMESPACE="${1:-atp-production}"
echo "🔍 Detecting manually modified resources..."
# Find resources without FluxCD labels
kubectl get all -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.metadata.labels."kustomize.toolkit.fluxcd.io/name" == null) |
"\(.kind)/\(.metadata.name) - Not managed by FluxCD"'
# Find resources with different last-applied-configuration
kubectl get all -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.metadata.annotations."kubectl.kubernetes.io/last-applied-configuration" != null) |
"\(.kind)/\(.metadata.name) - Manually modified"'
# Check Git commit history for resource
RESOURCE="${2}"
if [ -n "${RESOURCE}" ]; then
git log --all --oneline --grep="${RESOURCE}" -- manifests/
fi
Revert Drift vs Accept Change¶
Decision Tree for Drift Resolution:
graph TD
START[Detect Drift] --> CHECK{Type of Change?}
CHECK -->|Critical Config| REVERT[Force Revert]
CHECK -->|Performance Tuning| ACCEPT[Accept & Commit]
CHECK -->|Debugging Change| DECIDE{Production?}
REVERT --> RECONCILE[Reconcile Resource]
ACCEPT --> COMMIT[Commit to Git]
DECIDE -->|Yes| REVERT
DECIDE -->|No| ACCEPT
RECONCILE --> VERIFY[Verify Fixed]
COMMIT --> VERIFY
VERIFY --> DONE[Complete]
style REVERT fill:#FFB6C1
style ACCEPT fill:#90EE90
Force Revert Drift:
#!/bin/bash
# scripts/revert-drift.sh
RESOURCE_TYPE="${1}" # e.g., deployment
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"
echo "🔄 Reverting drift for ${RESOURCE_TYPE}/${RESOURCE_NAME}"
# Get desired state from Git
flux diff kustomization apps-production -n flux-system --path "${RESOURCE_TYPE}/${RESOURCE_NAME}"
# Force reconciliation
flux reconcile kustomization apps-production -n flux-system --with-source
# Verify reverted
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml | \
diff - expected-state.yaml
Accept and Commit Drift:
#!/bin/bash
# scripts/accept-drift.sh
RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"
echo "✅ Accepting drift for ${RESOURCE_TYPE}/${RESOURCE_NAME}"
# Export current state
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml > \
manifests/apps/atp-gateway/base/${RESOURCE_TYPE}-${RESOURCE_NAME}.yaml
# Commit to Git
git add manifests/
git commit -m "Accept drift: ${RESOURCE_TYPE}/${RESOURCE_NAME} in ${NAMESPACE}"
git push
# Reconcile to sync
flux reconcile source git atp-gitops-production -n flux-system
Investigating Drift Causes¶
Drift Investigation Workflow:
#!/bin/bash
# scripts/investigate-drift.sh
RESOURCE="${1}"
NAMESPACE="${2:-atp-production}"
echo "🔬 Investigating drift cause for: ${RESOURCE}"
# Step 1: Check resource history
echo "📜 Step 1: Resource change history..."
kubectl get events -n "${NAMESPACE}" --field-selector involvedObject.name="${RESOURCE}" --sort-by='.lastTimestamp'
# Step 2: Check audit logs
echo "📋 Step 2: Kubernetes audit logs..."
# Query Azure Monitor Log Analytics for audit logs
cat <<EOF
AzureLogAnalytics Query:
AzureActivity
| where ResourceProvider == "Microsoft.ContainerService"
| where OperationName contains "write"
| where Properties contains "${RESOURCE}"
| order by TimeGenerated desc
EOF
# Step 3: Check FluxCD reconciliation history
echo "🔄 Step 3: FluxCD reconciliation history..."
kubectl get kustomization -n flux-system -o json | \
jq -r '.items[] | select(.status.inventory.entries[]?.name | contains("'"${RESOURCE}"'")) |
"\(.metadata.name): Last reconciled at \(.status.lastAppliedRevision)"'
# Step 4: Compare with Git
echo "📦 Step 4: Compare with Git state..."
flux diff kustomization apps-production -n flux-system | grep "${RESOURCE}"
Image Pull Errors¶
ACR Authentication Failures¶
Troubleshooting ACR Authentication:
#!/bin/bash
# scripts/troubleshoot-acr-auth.sh
NAMESPACE="${1:-atp-production}"
POD_NAME="${2}"
echo "🔐 Troubleshooting ACR authentication..."
# Check image pull secrets
echo "📋 Image pull secrets:"
kubectl get secret -n "${NAMESPACE}" | grep -i "docker\|acr\|registry"
# Check pod's image pull secret
if [ -n "${POD_NAME}" ]; then
echo "🔍 Pod image pull secrets:"
kubectl get pod "${POD_NAME}" -n "${NAMESPACE}" -o jsonpath='{.spec.imagePullSecrets[*].name}'
# Try pulling image manually from pod
echo "🌐 Testing image pull from pod:"
kubectl exec -it "${POD_NAME}" -n "${NAMESPACE}" -- docker pull ${IMAGE} 2>&1 || true
fi
# Check ACR authentication
ACR_NAME="${3:-connectsoft.azurecr.io}"
echo "🔑 Checking ACR access..."
az acr repository list --name "${ACR_NAME}" --output table
Fix ACR Authentication:
# Create ACR pull secret using Workload Identity
az acr login --name connectsoft
# Create Kubernetes secret
kubectl create secret docker-registry acr-secret \
--docker-server=connectsoft.azurecr.io \
--docker-username=00000000-0000-0000-0000-000000000000 \
--docker-password=$(az acr credential show --name connectsoft --query passwords[0].value -o tsv) \
-n atp-production \
--dry-run=client -o yaml | kubectl apply -f -
# Add to default service account
kubectl patch serviceaccount default -n atp-production -p '{"imagePullSecrets":[{"name":"acr-secret"}]}'
Image Not Found¶
Troubleshooting Missing Images:
#!/bin/bash
# scripts/troubleshoot-image-not-found.sh
IMAGE="${1}"
NAMESPACE="${2:-atp-production}"
echo "🔍 Troubleshooting image not found: ${IMAGE}"
# Check if image exists in ACR
ACR_NAME=$(echo "${IMAGE}" | cut -d'/' -f1)
REPO_TAG=$(echo "${IMAGE}" | cut -d'/' -f2-)
REPO=$(echo "${REPO_TAG}" | cut -d':' -f1)
TAG=$(echo "${REPO_TAG}" | cut -d':' -f2)
echo "ACR: ${ACR_NAME}"
echo "Repository: ${REPO}"
echo "Tag: ${TAG}"
# Check ACR repository
az acr repository show --name "${ACR_NAME}" --repository "${REPO}" || \
echo "❌ Repository not found"
# List tags
az acr repository show-tags --name "${ACR_NAME}" --repository "${REPO}" --output table
# Check pods with ImagePullBackOff
echo "🚨 Pods with ImagePullBackOff:"
kubectl get pods -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") |
"\(.metadata.name): \(.spec.containers[].image)"'
Fix Missing Image:
# Rebuild and push image
docker build -t connectsoft.azurecr.io/atp/gateway:v1.2.3 .
docker push connectsoft.azurecr.io/atp/gateway:v1.2.3
# Update manifest
kustomize edit set image connectsoft.azurecr.io/atp/gateway:v1.2.3
# Commit and push
git add .
git commit -m "Fix: Update image tag to v1.2.3"
git push
ImagePullBackOff Troubleshooting¶
Diagnose ImagePullBackOff:
#!/bin/bash
# scripts/diagnose-imagepullbackoff.sh
NAMESPACE="${1:-atp-production}"
echo "🚨 Diagnosing ImagePullBackOff errors..."
# Find pods with ImagePullBackOff
kubectl get pods -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") |
"Pod: \(.metadata.name)\n Image: \(.spec.containers[].image)\n Reason: \(.status.containerStatuses[].state.waiting.reason)\n Message: \(.status.containerStatuses[].state.waiting.message)\n---"'
# Describe pod for details
PODS=$(kubectl get pods -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") | .metadata.name')
for POD in ${PODS}; do
echo "📋 Details for pod: ${POD}"
kubectl describe pod "${POD}" -n "${NAMESPACE}" | grep -A 10 "Events:"
done
# Check events
kubectl get events -n "${NAMESPACE}" --sort-by='.lastTimestamp' | grep -i "pull\|image\|backoff"
Common ImagePullBackOff Causes:
| Cause | Symptom | Fix |
|---|---|---|
| Image doesn't exist | manifest unknown |
Rebuild and push image |
| Authentication failed | unauthorized |
Fix ACR credentials |
| Network issue | timeout |
Check network policies, DNS |
| Wrong tag | not found |
Update image tag in manifest |
Resource Conflicts¶
"Already Exists" Errors¶
Resolve "Already Exists" Errors:
#!/bin/bash
# scripts/resolve-already-exists.sh
RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"
echo "🔧 Resolving 'already exists' error for ${RESOURCE_TYPE}/${RESOURCE_NAME}"
# Check if resource exists
if kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" &>/dev/null; then
echo "✅ Resource exists"
# Check if managed by FluxCD
MANAGED=$(kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o jsonpath='{.metadata.labels.kustomize\.toolkit\.fluxcd\.io/name}')
if [ -z "${MANAGED}" ]; then
echo "⚠️ Resource not managed by FluxCD"
echo " Options:"
echo " 1. Adopt resource: kubectl label ${RESOURCE_TYPE} ${RESOURCE_NAME} kustomize.toolkit.fluxcd.io/name=apps-production -n ${NAMESPACE}"
echo " 2. Delete resource: kubectl delete ${RESOURCE_TYPE} ${RESOURCE_NAME} -n ${NAMESPACE}"
else
echo "✅ Resource managed by FluxCD: ${MANAGED}"
echo " Force reconciliation: flux reconcile kustomization ${MANAGED} -n flux-system"
fi
else
echo "❌ Resource does not exist"
fi
Immutable Field Errors¶
Handle Immutable Field Changes:
#!/bin/bash
# scripts/handle-immutable-fields.sh
RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"
echo "🔒 Handling immutable field changes for ${RESOURCE_TYPE}/${RESOURCE_NAME}"
# Common immutable fields
# - Deployment: selector, template.labels
# - Service: clusterIP (if set)
# - StatefulSet: volumeClaimTemplates
# For immutable fields, delete and recreate
echo "⚠️ Immutable field detected. Need to delete and recreate."
# Step 1: Export current resource
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml > \
backup-${RESOURCE_NAME}.yaml
# Step 2: Delete resource
kubectl delete "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}"
# Step 3: Reconcile to recreate
flux reconcile kustomization apps-production -n flux-system --with-source
echo "✅ Resource recreated"
Owner Reference Conflicts¶
Resolve Owner Reference Conflicts:
#!/bin/bash
# scripts/resolve-owner-conflicts.sh
RESOURCE="${1}"
NAMESPACE="${2:-atp-production}"
echo "🔗 Resolving owner reference conflicts for: ${RESOURCE}"
# Check owner references
kubectl get "${RESOURCE}" -n "${NAMESPACE}" -o jsonpath='{.metadata.ownerReferences[*].kind}'
# Remove conflicting owner reference
kubectl patch "${RESOURCE}" -n "${NAMESPACE}" --type=json \
-p='[{"op": "remove", "path": "/metadata/ownerReferences"}]'
# Or adopt resource properly
kubectl label "${RESOURCE}" -n "${NAMESPACE}" \
kustomize.toolkit.fluxcd.io/name=apps-production \
kustomize.toolkit.fluxcd.io/namespace=flux-system
Secret Access Failures¶
Workload Identity Misconfiguration¶
Troubleshoot Workload Identity:
#!/bin/bash
# scripts/troubleshoot-workload-identity.sh
NAMESPACE="${1:-atp-production}"
SERVICE_ACCOUNT="${2:-default}"
echo "🔐 Troubleshooting Workload Identity..."
# Check ServiceAccount annotations
echo "📋 ServiceAccount annotations:"
kubectl get serviceaccount "${SERVICE_ACCOUNT}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations}' | jq
# Check federated credentials in Azure AD
AZURE_CLIENT_ID=$(kubectl get serviceaccount "${SERVICE_ACCOUNT}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations.azure\.workload\.identity/client-id}')
echo "Azure Client ID: ${AZURE_CLIENT_ID}"
# Check pod annotations
echo "📦 Pod annotations:"
kubectl get pods -n "${NAMESPACE}" -o json | \
jq -r '.items[].metadata | select(.annotations."azure.workload.identity/service-account" != null) |
"Pod: \(.name)\n ServiceAccount: \(.annotations."azure.workload.identity/service-account")\n"'
# Test token acquisition from pod
POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-gateway -o jsonpath='{.items[0].metadata.name}')
if [ -n "${POD}" ]; then
echo "🧪 Testing token acquisition from pod: ${POD}"
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
cat /var/run/secrets/azure/tokens/azure-identity-token 2>&1 || echo "❌ Token not available"
fi
Key Vault Permission Issues¶
Check Key Vault Permissions:
#!/bin/bash
# scripts/check-keyvault-permissions.sh
KEY_VAULT="${1:-atp-keyvault}"
IDENTITY="${2}" # Managed identity client ID
echo "🔑 Checking Key Vault permissions..."
# Check access policies
az keyvault show --name "${KEY_VAULT}" --query "properties.accessPolicies[].objectId"
# Check RBAC permissions
if [ -n "${IDENTITY}" ]; then
echo "Checking RBAC for identity: ${IDENTITY}"
az role assignment list --assignee "${IDENTITY}" --scope "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KEY_VAULT}"
fi
# Test secret access
SECRET_NAME="test-secret"
az keyvault secret show --vault-name "${KEY_VAULT}" --name "${SECRET_NAME}" || \
echo "❌ Cannot access secret: ${SECRET_NAME}"
ExternalSecret Sync Failures¶
Troubleshoot ExternalSecret:
#!/bin/bash
# scripts/troubleshoot-externalsecret.sh
SECRET_NAME="${1}"
NAMESPACE="${2:-atp-production}"
echo "🔍 Troubleshooting ExternalSecret: ${SECRET_NAME}"
# Check ExternalSecret status
kubectl get externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" -o yaml | \
grep -A 20 "status:"
# Check ClusterSecretStore
STORE=$(kubectl get externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" -o jsonpath='{.spec.secretStoreRef.name}')
echo "SecretStore: ${STORE}"
kubectl get clustersecretstore "${STORE}" -o yaml
# Check external-secrets operator logs
kubectl logs -n external-secrets-system -l app.kubernetes.io/name=external-secrets --tail=100 | \
grep -i "${SECRET_NAME}"
# Force refresh
kubectl annotate externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" \
force-sync=$(date +%s) --overwrite
Health Check Failures¶
Readiness Probe Timeouts¶
Troubleshoot Readiness Probes:
#!/bin/bash
# scripts/troubleshoot-readiness.sh
POD="${1}"
NAMESPACE="${2:-atp-production}"
echo "🏥 Troubleshooting readiness probe for pod: ${POD}"
# Check pod status
kubectl get pod "${POD}" -n "${NAMESPACE}" -o yaml | \
grep -A 10 "readinessProbe:"
# Check probe configuration
kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[*].readinessProbe}' | jq
# Test probe endpoint manually
ENDPOINT=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.path}')
PORT=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.port}')
echo "Testing endpoint: http://localhost:${PORT}${ENDPOINT}"
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
curl -f http://localhost:${PORT}${ENDPOINT} || echo "❌ Probe endpoint failed"
# Check events
kubectl describe pod "${POD}" -n "${NAMESPACE}" | grep -A 5 "Events:"
Liveness Probe Failures¶
Troubleshoot Liveness Probes:
#!/bin/bash
# scripts/troubleshoot-liveness.sh
POD="${1}"
NAMESPACE="${2:-atp-production}"
echo "💓 Troubleshooting liveness probe for pod: ${POD}"
# Check if pod is restarting
RESTARTS=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.containerStatuses[0].restartCount}')
echo "Restart count: ${RESTARTS}"
# Check liveness probe config
kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[*].livenessProbe}' | jq
# Check previous container logs (if restarted)
if [ "${RESTARTS}" -gt 0 ]; then
echo "📜 Previous container logs:"
kubectl logs "${POD}" -n "${NAMESPACE}" --previous --tail=50
fi
# Check current logs
echo "📜 Current container logs:"
kubectl logs "${POD}" -n "${NAMESPACE}" --tail=50
Debugging Health Endpoints¶
Health Endpoint Debugging:
#!/bin/bash
# scripts/debug-health-endpoint.sh
POD="${1}"
NAMESPACE="${2:-atp-production}"
ENDPOINT="${3:-/health}"
echo "🔍 Debugging health endpoint: ${ENDPOINT}"
# Get pod IP
POD_IP=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.podIP}')
echo "Pod IP: ${POD_IP}"
# Test from within pod
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
curl -v http://localhost:8080${ENDPOINT} 2>&1
# Test from another pod
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -n "${NAMESPACE}" -- \
curl -v http://${POD_IP}:8080${ENDPOINT}
# Check application logs
kubectl logs "${POD}" -n "${NAMESPACE}" --tail=100 | grep -i "health\|ready\|startup"
Networking Issues¶
Service Discovery Failures¶
Troubleshoot Service Discovery:
#!/bin/bash
# scripts/troubleshoot-service-discovery.sh
SERVICE="${1}"
NAMESPACE="${2:-atp-production}"
echo "🌐 Troubleshooting service discovery for: ${SERVICE}"
# Check service exists
kubectl get service "${SERVICE}" -n "${NAMESPACE}"
# Check endpoints
kubectl get endpoints "${SERVICE}" -n "${NAMESPACE}"
# Test DNS resolution
POD=$(kubectl get pod -n "${NAMESPACE}" -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
nslookup "${SERVICE}.${NAMESPACE}.svc.cluster.local"
# Test service connectivity
kubectl run test-pod --image=curlimages/curl --rm -it --restart=Never -n "${NAMESPACE}" -- \
curl -v http://${SERVICE}.${NAMESPACE}.svc.cluster.local:8080
DNS Resolution Problems¶
Troubleshoot DNS:
#!/bin/bash
# scripts/troubleshoot-dns.sh
NAMESPACE="${1:-atp-production}"
echo "🔍 Troubleshooting DNS resolution..."
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Test DNS from pod
POD=$(kubectl get pod -n "${NAMESPACE}" -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
nslookup kubernetes.default.svc.cluster.local
# Check DNS configuration
kubectl get configmap coredns -n kube-system -o yaml
Network Policy Blocking Traffic¶
Troubleshoot Network Policies:
#!/bin/bash
# scripts/troubleshoot-network-policy.sh
NAMESPACE="${1:-atp-production}"
echo "🔒 Troubleshooting network policies..."
# List network policies
kubectl get networkpolicies -n "${NAMESPACE}"
# Check if default deny is blocking
kubectl get networkpolicy default-deny-all -n "${NAMESPACE}" && \
echo "⚠️ Default deny policy found"
# Test connectivity between pods
SOURCE_POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-gateway -o jsonpath='{.items[0].metadata.name}')
TARGET_POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-ingestion -o jsonpath='{.items[0].metadata.name}')
if [ -n "${SOURCE_POD}" ] && [ -n "${TARGET_POD}" ]; then
echo "Testing connectivity from ${SOURCE_POD} to ${TARGET_POD}"
kubectl exec -it "${SOURCE_POD}" -n "${NAMESPACE}" -- \
curl -v http://${TARGET_POD}.${NAMESPACE}.pod.cluster.local:8080 || \
echo "❌ Connection blocked"
fi
# Temporarily remove network policy for testing
echo "To test without network policy:"
echo "kubectl delete networkpolicy -n ${NAMESPACE} --all"
Performance Issues¶
Slow Reconciliation¶
Troubleshoot Slow Reconciliation:
#!/bin/bash
# scripts/troubleshoot-slow-reconciliation.sh
KUSTOMIZATION="${1:-apps-production}"
echo "⏱️ Troubleshooting slow reconciliation..."
# Check reconciliation duration
kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
jq -r '.status.conditions[] | select(.type == "Ready") |
"Last reconciliation: \(.lastTransitionTime)\nMessage: \(.message)"'
# Check reconciliation interval
INTERVAL=$(kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o jsonpath='{.spec.interval}')
echo "Reconciliation interval: ${INTERVAL}"
# Check number of resources
RESOURCE_COUNT=$(kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
jq '.status.inventory.entries | length')
echo "Number of resources: ${RESOURCE_COUNT}"
# Check for large manifests
echo "Checking manifest sizes..."
find manifests/ -name "*.yaml" -exec wc -l {} \; | sort -rn | head -5
# Force reconciliation and measure time
echo "Forcing reconciliation and measuring time..."
time flux reconcile kustomization "${KUSTOMIZATION}" -n flux-system --with-source
Resource Contention¶
Check Resource Contention:
#!/bin/bash
# scripts/check-resource-contention.sh
NAMESPACE="${1:-atp-production}"
echo "📊 Checking resource contention..."
# Check node resources
kubectl top nodes
# Check pod resources
kubectl top pods -n "${NAMESPACE}"
# Check for resource quotas
kubectl get resourcequota -n "${NAMESPACE}"
# Check for limit ranges
kubectl get limitrange -n "${NAMESPACE}"
# Find pods requesting too many resources
kubectl get pods -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.spec.containers[].resources.requests.cpu != null) |
"\(.metadata.name): CPU=\(.spec.containers[].resources.requests.cpu) Memory=\(.spec.containers[].resources.requests.memory)"'
Debugging Tools¶
kubectl Commands¶
Essential kubectl Commands:
# Get resources
kubectl get all -n atp-production
kubectl get pods -n atp-production -o wide
kubectl get events -n atp-production --sort-by='.lastTimestamp'
# Describe resources
kubectl describe pod <pod-name> -n atp-production
kubectl describe deployment <deployment> -n atp-production
# View logs
kubectl logs <pod-name> -n atp-production
kubectl logs <pod-name> -n atp-production --previous # Previous container
kubectl logs -l app=atp-gateway -n atp-production --tail=100
# Execute commands
kubectl exec -it <pod-name> -n atp-production -- /bin/sh
kubectl exec <pod-name> -n atp-production -- env
# Port forwarding
kubectl port-forward <pod-name> 8080:8080 -n atp-production
# Debugging
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -n atp-production
kubectl debug <pod-name> -n atp-production -it --image=busybox
Flux CLI Commands¶
Essential Flux Commands:
# Check FluxCD status
flux check
flux get all -A
# Get resources
flux get sources git -A
flux get kustomizations -A
flux get helmreleases -A
# Reconcile resources
flux reconcile source git atp-gitops-production -n flux-system
flux reconcile kustomization apps-production -n flux-system --with-source
flux reconcile helmrelease ingress-nginx -n ingress-nginx
# Suspend/Resume
flux suspend kustomization apps-production -n flux-system
flux resume kustomization apps-production -n flux-system
# Diff and dry-run
flux diff kustomization apps-production -n flux-system
flux build kustomization apps-production -n flux-system
# View logs
flux logs --kind=Kustomization --name=apps-production -n flux-system
Azure CLI for AKS Debugging¶
Azure CLI AKS Commands:
# Get cluster credentials
az aks get-credentials --resource-group atp-production-rg --name atp-production-aks
# Get cluster info
az aks show --resource-group atp-production-rg --name atp-production-aks
# List node pools
az aks nodepool list --resource-group atp-production-rg --cluster-name atp-production-aks
# Scale node pool
az aks nodepool scale \
--resource-group atp-production-rg \
--cluster-name atp-production-aks \
--name systempool \
--node-count 5
# Get diagnostic logs
az aks get-credentials --resource-group atp-production-rg --name atp-production-aks --admin
kubectl get nodes
Log Analysis in Log Analytics¶
KQL Queries for Troubleshooting:
// FluxCD reconciliation failures
ContainerLog
| where Namespace == "flux-system"
| where LogEntry contains "error" or LogEntry contains "failed"
| where LogEntry contains "reconcile"
| project TimeGenerated, PodName, LogEntry
| order by TimeGenerated desc
// Pod restart analysis
KubePodInventory
| where Namespace == "atp-production"
| where ContainerRestartCount > 0
| project TimeGenerated, Namespace, PodName, ContainerRestartCount
| order by ContainerRestartCount desc
// Image pull errors
ContainerLog
| where LogEntry contains "ImagePullBackOff" or LogEntry contains "ErrImagePull"
| project TimeGenerated, Namespace, PodName, LogEntry
| order by TimeGenerated desc
// Health check failures
ContainerLog
| where LogEntry contains "readiness probe failed" or LogEntry contains "liveness probe failed"
| project TimeGenerated, Namespace, PodName, LogEntry
| order by TimeGenerated desc
Common Error Patterns¶
Error Catalog¶
Common Errors and Solutions:
| Error | Cause | Solution |
|---|---|---|
ImagePullBackOff |
Image not found or auth failed | Check ACR credentials, verify image exists |
CrashLoopBackOff |
Application crashing | Check application logs, health endpoints |
Pending pod |
Insufficient resources | Check node capacity, resource quotas |
ErrImagePull |
Cannot pull image | Fix ACR authentication, network policies |
CreateContainerConfigError |
Secret/config not found | Check secret exists, mount paths |
Readiness probe failed |
Health endpoint not ready | Check application startup, probe config |
Network policy blocking |
Traffic blocked | Update network policy rules |
PVC pending |
Storage class not found | Check StorageClass exists |
Reconcile timeout |
Too many resources | Increase timeout, optimize manifests |
Decision Tree for Common Errors:
graph TD
START[Pod Not Running] --> CHECK{Error Type?}
CHECK -->|ImagePullBackOff| IMAGE[Check ACR Auth<br/>Verify Image Exists]
CHECK -->|CrashLoopBackOff| LOGS[Check Logs<br/>Health Endpoints]
CHECK -->|Pending| RESOURCES[Check Resources<br/>Node Capacity]
CHECK -->|NotReady| PROBE[Check Probes<br/>Application Health]
IMAGE --> FIX1[Fix Credentials<br/>Rebuild Image]
LOGS --> FIX2[Fix Application<br/>Update Config]
RESOURCES --> FIX3[Scale Nodes<br/>Adjust Requests]
PROBE --> FIX4[Fix Endpoints<br/>Adjust Probes]
FIX1 --> RECONCILE[Reconcile]
FIX2 --> RECONCILE
FIX3 --> RECONCILE
FIX4 --> RECONCILE
RECONCILE --> DONE[Verify Fixed]
style START fill:#FFB6C1
style DONE fill:#90EE90
Escalation Procedures¶
When to Escalate¶
Escalation Triggers:
| Severity | Criteria | Response Time |
|---|---|---|
| P0 - Critical | Production down, data loss | Immediate (15 min) |
| P1 - High | Partial outage, degraded performance | 1 hour |
| P2 - Medium | Non-critical issue, workaround available | 4 hours |
| P3 - Low | Minor issue, cosmetic | Next business day |
Escalation Decision Tree:
graph TD
START[Issue Detected] --> IMPACT{Impact?}
IMPACT -->|Production Down| P0[P0 - Escalate Immediately]
IMPACT -->|Degraded Service| P1[P1 - Escalate within 1h]
IMPACT -->|Workaround Available| P2[P2 - Escalate within 4h]
IMPACT -->|Minor Issue| P3[P3 - Next Business Day]
P0 --> ONCALL[Page On-Call Engineer]
P1 --> TEAM[Notify Team Lead]
P2 --> TICKET[Create Ticket]
P3 --> BACKLOG[Add to Backlog]
style P0 fill:#FF0000
style P1 fill:#FFA500
style P2 fill:#FFFF00
style P3 fill:#90EE90
Information to Collect¶
Pre-Escalation Checklist:
#!/bin/bash
# scripts/collect-debug-info.sh
ISSUE="${1}"
NAMESPACE="${2:-atp-production}"
echo "📋 Collecting debug information for escalation..."
# Create debug directory
DEBUG_DIR="debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "${DEBUG_DIR}"
# Cluster info
kubectl cluster-info > "${DEBUG_DIR}/cluster-info.txt"
kubectl get nodes -o wide > "${DEBUG_DIR}/nodes.txt"
# Resource status
kubectl get all -n "${NAMESPACE}" > "${DEBUG_DIR}/resources.txt"
kubectl get events -n "${NAMESPACE}" --sort-by='.lastTimestamp' > "${DEBUG_DIR}/events.txt"
# FluxCD status
flux get all -A > "${DEBUG_DIR}/flux-status.txt"
kubectl get kustomization -A -o yaml > "${DEBUG_DIR}/kustomizations.yaml"
# Logs
kubectl logs -n flux-system -l app=kustomize-controller --tail=200 > "${DEBUG_DIR}/flux-logs.txt"
kubectl logs -n "${NAMESPACE}" --all-containers --tail=100 > "${DEBUG_DIR}/app-logs.txt"
# Network policies
kubectl get networkpolicies -n "${NAMESPACE}" -o yaml > "${DEBUG_DIR}/network-policies.yaml"
# Secrets (sanitized)
kubectl get secrets -n "${NAMESPACE}" -o json | \
jq '.items[] | {name: .metadata.name, type: .type}' > "${DEBUG_DIR}/secrets-list.json"
# Package debug info
tar -czf "${DEBUG_DIR}.tar.gz" "${DEBUG_DIR}"
echo "✅ Debug information collected: ${DEBUG_DIR}.tar.gz"
Incident Severity Levels¶
Severity Level Definitions:
| Level | Description | Examples | Response |
|---|---|---|---|
| P0 - Critical | Complete service outage, data loss risk | All pods down, database inaccessible | Immediate, on-call escalation |
| P1 - High | Partial outage, significant impact | 50% pods down, slow response times | 1 hour, team notification |
| P2 - Medium | Degraded service, workaround available | Single service down, minor features broken | 4 hours, ticket creation |
| P3 - Low | Minor issue, no user impact | Documentation issue, cosmetic bug | Next business day, backlog |
Summary: Troubleshooting GitOps Issues¶
- FluxCD Sync Failures: Authentication issues (Git credentials), manifest syntax errors, resource conflicts (already exists), timeout errors, network connectivity issues
- Drift Detection and Resolution: Identifying drifted resources, manual changes detection, revert drift vs accept change decision tree, investigating drift causes
- Image Pull Errors: ACR authentication failures, image not found, ImagePullBackOff troubleshooting with diagnostic scripts
- Resource Conflicts: "already exists" errors, immutable field errors, owner reference conflicts, resolution strategies
- Secret Access Failures: Workload Identity misconfiguration, Key Vault permission issues, ExternalSecret sync failures
- Health Check Failures: Readiness probe timeouts, liveness probe failures, debugging health endpoints
- Networking Issues: Service discovery failures, DNS resolution problems, network policy blocking traffic
- Performance Issues: Slow reconciliation, resource contention, high CPU/memory usage, throttling and rate limiting
- Debugging Tools: kubectl commands (get, describe, logs, exec), Flux CLI commands (get, reconcile, suspend), Azure CLI for AKS, Log Analytics KQL queries
- Common Error Patterns: Error catalog with solutions, decision trees for troubleshooting, known issues and workarounds
- Escalation Procedures: When to escalate (severity levels), who to escalate to, information to collect (pre-escalation checklist), incident severity levels (P0-P3)
Day 2 Operations & Maintenance¶
Purpose: Define comprehensive day 2 operations, maintenance procedures, upgrade processes, capacity planning, security patching, performance tuning, and operational excellence practices for ATP's GitOps deployments, ensuring reliable, secure, and efficient long-term platform operations.
Routine Maintenance Tasks¶
Daily: Monitoring Checks, Alert Review¶
Daily Maintenance Checklist:
#!/bin/bash
# scripts/daily-maintenance-check.sh
echo "📋 Daily Maintenance Checklist - $(date +%Y-%m-%d)"
echo "=============================================="
# Check cluster health
echo "🏥 1. Cluster Health Check"
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
# Check FluxCD status
echo "🔄 2. FluxCD Status"
flux get all -A | grep -v Ready || echo " ⚠️ Some FluxCD resources not ready"
# Review critical alerts
echo "🚨 3. Critical Alerts Review"
# Query Azure Monitor for critical alerts from last 24 hours
cat <<EOF
Azure Monitor Query:
AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName contains "error" or MetricName contains "failure"
| where Value > 0
| summarize count() by MetricName
| order by count_ desc
EOF
# Check certificate expiration
echo "🔐 4. Certificate Expiration Check"
kubectl get certificates --all-namespaces -o json | \
jq -r '.items[] | select(.status.conditions[]?.status == "True") |
"\(.metadata.namespace)/\(.metadata.name): Expires \(.status.notAfter)"' | \
while read cert; do
EXPIRY=$(echo "${cert}" | cut -d' ' -f3-)
DAYS_LEFT=$(( ($(date -d "${EXPIRY}" +%s) - $(date +%s)) / 86400 ))
if [ "${DAYS_LEFT}" -lt 30 ]; then
echo " ⚠️ ${cert} (${DAYS_LEFT} days remaining)"
fi
done
# Check backup status
echo "💾 5. Backup Status"
velero backup get --namespace velero --limit 5
# Check resource utilization
echo "📊 6. Resource Utilization"
kubectl top nodes
kubectl top pods -n atp-production --sort-by=memory | head -10
echo "✅ Daily checks complete"
Daily Alert Review Procedure:
## Daily Alert Review Process
### Critical Alerts (P0/P1)
1. Review all critical alerts from last 24 hours
2. Verify alerts are actionable (not false positives)
3. Document any new alert patterns
4. Escalate unresolved critical alerts
### Warning Alerts
1. Review warning alerts weekly (not daily)
2. Tune alert thresholds if needed
3. Document patterns for capacity planning
### Alert Noise Reduction
1. Disable or adjust noisy alerts
2. Add alert grouping rules
3. Update alert runbooks
Weekly: Capacity Planning, Cost Review¶
Weekly Maintenance Checklist:
#!/bin/bash
# scripts/weekly-maintenance-check.sh
echo "📋 Weekly Maintenance Checklist - Week $(date +%V)"
echo "=============================================="
# Capacity planning
echo "📈 1. Capacity Planning Review"
# Check node utilization trends
kubectl top nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory
# Check pod density
POD_COUNT=$(kubectl get pods --all-namespaces --field-selector=status.phase=Running --no-headers | wc -l)
NODE_COUNT=$(kubectl get nodes --no-headers | wc -l)
AVG_PODS_PER_NODE=$((POD_COUNT / NODE_COUNT))
echo " Average pods per node: ${AVG_PODS_PER_NODE}"
# Storage growth trend
echo "💾 2. Storage Growth Analysis"
kubectl get pvc --all-namespaces -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.capacity.storage)"' | \
sort | uniq -c
# Cost review
echo "💰 3. Cost Review"
cat <<EOF
Azure Cost Management Query:
UsageDetails
| where TimeGenerated > ago(7d)
| where ResourceGroup contains "atp-production"
| summarize TotalCost=sum(Cost) by ResourceType
| order by TotalCost desc
EOF
# Review pending updates
echo "🔄 4. Pending Updates Review"
flux get sources -A | grep -v "latest"
flux get helmreleases -A | grep -v "latest"
# Review failed deployments
echo "❌ 5. Failed Deployment Review"
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide
echo "✅ Weekly checks complete"
Weekly Capacity Planning Report:
#!/bin/bash
# scripts/capacity-planning-report.sh
OUTPUT_FILE="capacity-report-$(date +%Y%m%d).md"
cat > "${OUTPUT_FILE}" <<EOF
# Capacity Planning Report - $(date +%Y-%m-%d)
## Node Utilization
\`\`\`
$(kubectl top nodes)
\`\`\`
## Pod Distribution
\`\`\`
$(kubectl get pods --all-namespaces -o wide | awk '{print $1, $8}' | sort | uniq -c)
\`\`\`
## Storage Usage
\`\`\`
$(kubectl get pvc --all-namespaces -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.capacity.storage)"')
\`\`\`
## Recommendations
- Review node pool sizes based on utilization trends
- Plan for expected growth in next quarter
- Consider right-sizing underutilized resources
EOF
echo "✅ Report generated: ${OUTPUT_FILE}"
Monthly: Security Patches, Access Reviews¶
Monthly Maintenance Checklist:
#!/bin/bash
# scripts/monthly-maintenance-check.sh
echo "📋 Monthly Maintenance Checklist - $(date +%Y-%m)"
echo "=============================================="
# Security patches
echo "🔒 1. Security Patch Review"
# Check for available Kubernetes version upgrades
az aks get-upgrades --resource-group atp-production-rg --name atp-production-aks
# Check container image vulnerabilities
echo " Scanning for vulnerabilities..."
# Use Trivy or Azure Defender to scan images
# Access reviews
echo "👥 2. Access Reviews"
# List all RBAC bindings
kubectl get rolebindings --all-namespaces -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.subjects[].name)"'
kubectl get clusterrolebindings -o json | \
jq -r '.items[] | "\(.metadata.name): \(.subjects[].name)"'
# Review ServiceAccount usage
echo " ServiceAccount review..."
kubectl get serviceaccounts --all-namespaces -o json | \
jq -r '.items[] | select(.metadata.name != "default") | "\(.metadata.namespace)/\(.metadata.name)"'
# Backup verification
echo "💾 3. Backup Verification"
# Test restore from latest backup
velero backup get --namespace velero | head -5
# Compliance check
echo "✅ 4. Compliance Check"
# Check network policies are applied
kubectl get networkpolicies --all-namespaces | wc -l
# Check pod security standards
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.securityContext == null) | "\(.metadata.namespace)/\(.metadata.name): Missing security context"'
echo "✅ Monthly checks complete"
Quarterly: DR Drills, Policy Updates¶
Quarterly Maintenance Schedule:
gantt
title Quarterly Maintenance Calendar
dateFormat YYYY-MM-DD
section Q1
DR Drill Production :2024-01-15, 1d
Policy Review :2024-01-20, 2d
Security Audit :2024-01-25, 3d
section Q2
DR Drill Production :2024-04-15, 1d
Policy Review :2024-04-20, 2d
Security Audit :2024-04-25, 3d
section Q3
DR Drill Production :2024-07-15, 1d
Policy Review :2024-07-20, 2d
Security Audit :2024-07-25, 3d
section Q4
DR Drill Production :2024-10-15, 1d
Policy Review :2024-10-20, 2d
Security Audit :2024-10-25, 3d
Quarterly DR Drill Procedure:
#!/bin/bash
# scripts/quarterly-dr-drill.sh
QUARTER="${1:-Q1}"
YEAR="${2:-2024}"
echo "🧪 Quarterly DR Drill - ${QUARTER} ${YEAR}"
echo "========================================="
# Step 1: Select random backup
echo "📥 Step 1: Selecting test backup..."
BACKUP=$(velero backup get --namespace velero | grep atp-production | tail -5 | shuf -n 1 | awk '{print $1}')
echo " Using backup: ${BACKUP}"
# Step 2: Restore to test namespace
echo "🔄 Step 2: Restoring to test namespace..."
TEST_NS="atp-production-dr-test-${QUARTER}-${YEAR}"
velero restore create "dr-drill-${QUARTER}-${YEAR}-$(date +%Y%m%d)" \
--from-backup "${BACKUP}" \
--namespace-mappings atp-production:${TEST_NS} \
--wait
# Step 3: Validate restore
echo "✅ Step 3: Validating restore..."
kubectl get all -n "${TEST_NS}"
kubectl get pvc -n "${TEST_NS}"
# Step 4: Test application functionality
echo "🧪 Step 4: Testing application..."
# Run smoke tests against restored environment
# Step 5: Document results
echo "📝 Step 5: Documenting results..."
cat > "dr-drill-report-${QUARTER}-${YEAR}.md" <<EOF
# DR Drill Report - ${QUARTER} ${YEAR}
**Date**: $(date +%Y-%m-%d)
**Backup Used**: ${BACKUP}
**Test Namespace**: ${TEST_NS}
## Results
- Restore: ✅ Success
- Application Functionality: ✅ Verified
- Data Integrity: ✅ Verified
## Lessons Learned
- [Add lessons learned here]
## Recommendations
- [Add recommendations here]
EOF
# Step 6: Cleanup
read -p "Delete test namespace ${TEST_NS}? (yes/no): " CONFIRM
if [ "${CONFIRM}" = "yes" ]; then
kubectl delete namespace "${TEST_NS}"
echo "✅ Test namespace deleted"
fi
echo "✅ DR drill complete"
FluxCD Upgrades¶
Upgrade Planning¶
FluxCD Upgrade Planning Checklist:
## FluxCD Upgrade Planning
### Pre-Upgrade
- [ ] Review FluxCD release notes
- [ ] Check breaking changes
- [ ] Test in dev environment first
- [ ] Schedule maintenance window
- [ ] Notify stakeholders
- [ ] Prepare rollback plan
### Upgrade Steps
1. Backup current FluxCD configuration
2. Upgrade CLI tools
3. Test upgrade in dev
4. Schedule production upgrade
5. Execute upgrade
6. Validate functionality
7. Monitor for issues
### Post-Upgrade
- [ ] Verify all Kustomizations working
- [ ] Check GitRepository connections
- [ ] Validate HelmReleases
- [ ] Review reconciliation logs
- [ ] Update documentation
FluxCD Version Compatibility Matrix:
| FluxCD Version | Kubernetes Min | Kubernetes Max | Breaking Changes |
|---|---|---|---|
| 2.1.x | 1.24+ | 1.27 | None |
| 2.2.x | 1.24+ | 1.28 | CRD changes |
| 2.3.x | 1.25+ | 1.29 | API version updates |
Testing in Dev/Test First¶
Test Upgrade Procedure:
#!/bin/bash
# scripts/test-fluxcd-upgrade.sh
TARGET_VERSION="${1:-2.2.0}"
NAMESPACE="${2:-flux-system}"
echo "🧪 Testing FluxCD upgrade to ${TARGET_VERSION}"
# Step 1: Backup current configuration
echo "💾 Step 1: Backing up current configuration..."
kubectl get gitrepository,kustomization,helmrelease -n "${NAMESPACE}" -o yaml > \
flux-backup-$(date +%Y%m%d).yaml
# Step 2: Install new FluxCD CLI
echo "⬇️ Step 2: Installing FluxCD CLI ${TARGET_VERSION}..."
curl -s https://fluxcd.io/install.sh | sudo bash
flux version
# Step 3: Upgrade FluxCD
echo "🔄 Step 3: Upgrading FluxCD..."
flux install --version=${TARGET_VERSION} --namespace="${NAMESPACE}"
# Step 4: Wait for controllers to be ready
echo "⏳ Step 4: Waiting for controllers..."
kubectl wait --for=condition=ready pod -l app=source-controller -n "${NAMESPACE}" --timeout=300s
kubectl wait --for=condition=ready pod -l app=kustomize-controller -n "${NAMESPACE}" --timeout=300s
kubectl wait --for=condition=ready pod -l app=helm-controller -n "${NAMESPACE}" --timeout=300s
# Step 5: Validate functionality
echo "✅ Step 5: Validating functionality..."
flux check
flux get all -A
# Step 6: Test reconciliation
echo "🔄 Step 6: Testing reconciliation..."
flux reconcile source git atp-gitops-dev -n "${NAMESPACE}" --with-source
echo "✅ Upgrade test complete"
Upgrade Procedure¶
Production Upgrade Runbook:
#!/bin/bash
# scripts/upgrade-fluxcd-production.sh
TARGET_VERSION="${1}"
MAINTENANCE_WINDOW="${2}" # e.g., "2024-01-15 02:00"
if [ -z "${TARGET_VERSION}" ]; then
echo "Usage: $0 <target-version> [maintenance-window]"
exit 1
fi
echo "🔄 FluxCD Production Upgrade to ${TARGET_VERSION}"
echo "Maintenance Window: ${MAINTENANCE_WINDOW}"
# Pre-upgrade checklist
echo "📋 Pre-Upgrade Checklist"
echo "1. Backup all FluxCD resources"
kubectl get gitrepository,kustomization,helmrelease -A -o yaml > \
flux-production-backup-$(date +%Y%m%d-%H%M%S).yaml
echo "2. Verify all Kustomizations are healthy"
flux get kustomizations -A | grep -v Ready && echo "⚠️ Some Kustomizations not ready" && exit 1
echo "3. Suspend auto-reconciliation for critical resources"
# flux suspend kustomization critical-apps-production -n flux-system
# Upgrade
echo "🔄 Upgrading FluxCD..."
flux install --version=${TARGET_VERSION} --namespace=flux-system
# Wait for readiness
echo "⏳ Waiting for controllers to be ready..."
kubectl wait --for=condition=ready pod -l app=source-controller -n flux-system --timeout=300s
kubectl wait --for=condition=ready pod -l app=kustomize-controller -n flux-system --timeout=300s
kubectl wait --for=condition=ready pod -l app=helm-controller -n flux-system --timeout=300s
# Resume reconciliation
# flux resume kustomization critical-apps-production -n flux-system
# Validate
echo "✅ Validating upgrade..."
flux check
flux get all -A
# Force reconciliation of all resources
echo "🔄 Forcing reconciliation..."
flux reconcile source git -A --with-source
flux reconcile kustomization -A --with-source
echo "✅ Upgrade complete"
Rollback Plan¶
FluxCD Rollback Procedure:
#!/bin/bash
# scripts/rollback-fluxcd.sh
PREVIOUS_VERSION="${1}"
BACKUP_FILE="${2}"
if [ -z "${PREVIOUS_VERSION}" ] || [ -z "${BACKUP_FILE}" ]; then
echo "Usage: $0 <previous-version> <backup-file>"
exit 1
fi
echo "⏪ Rolling back FluxCD to ${PREVIOUS_VERSION}"
# Step 1: Suspend all reconciliation
echo "⏸️ Suspending reconciliation..."
flux suspend kustomization -A
flux suspend helmrelease -A
# Step 2: Uninstall current FluxCD
echo "🗑️ Uninstalling current FluxCD..."
flux uninstall --silent
# Step 3: Install previous version
echo "⬇️ Installing previous version..."
flux install --version=${PREVIOUS_VERSION} --namespace=flux-system
# Step 4: Restore configuration from backup
echo "📥 Restoring configuration..."
kubectl apply -f "${BACKUP_FILE}"
# Step 5: Resume reconciliation
echo "▶️ Resuming reconciliation..."
flux resume kustomization -A
flux resume helmrelease -A
# Step 6: Validate
echo "✅ Validating rollback..."
flux check
flux get all -A
echo "✅ Rollback complete"
Post-Upgrade Validation¶
Post-Upgrade Validation Checklist:
#!/bin/bash
# scripts/validate-fluxcd-upgrade.sh
echo "✅ Post-Upgrade Validation"
# Check FluxCD version
echo "📋 1. FluxCD Version"
flux version
# Check all controllers are ready
echo "🏥 2. Controller Health"
flux check
# Verify all sources are ready
echo "📦 3. Source Status"
flux get sources -A | grep -v Ready && echo "⚠️ Some sources not ready"
# Verify all Kustomizations are ready
echo "🔄 4. Kustomization Status"
flux get kustomizations -A | grep -v Ready && echo "⚠️ Some Kustomizations not ready"
# Verify all HelmReleases are ready
echo "📦 5. HelmRelease Status"
flux get helmreleases -A | grep -v Ready && echo "⚠️ Some HelmReleases not ready"
# Test reconciliation
echo "🔄 6. Testing Reconciliation"
flux reconcile source git atp-gitops-production -n flux-system --with-source
flux reconcile kustomization apps-production -n flux-system --with-source
# Check for errors in logs
echo "📜 7. Checking for Errors"
kubectl logs -n flux-system -l app=kustomize-controller --tail=100 | grep -i error
echo "✅ Validation complete"
AKS Cluster Patching¶
Kubernetes Version Upgrades¶
AKS Upgrade Planning:
#!/bin/bash
# scripts/plan-aks-upgrade.sh
CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"
echo "📋 AKS Upgrade Planning for ${CLUSTER}"
# Check current version
CURRENT_VERSION=$(az aks show \
--resource-group "${RESOURCE_GROUP}" \
--name "${CLUSTER}" \
--query kubernetesVersion -o tsv)
echo "Current version: ${CURRENT_VERSION}"
# Check available upgrades
echo "Available upgrades:"
az aks get-upgrades \
--resource-group "${RESOURCE_GROUP}" \
--name "${CLUSTER}" \
--output table
# Check node pool versions
echo "Node pool versions:"
az aks nodepool list \
--resource-group "${RESOURCE_GROUP}" \
--cluster-name "${CLUSTER}" \
--query '[].{Name:name, Version:orchestratorVersion}' \
--output table
AKS Upgrade Procedure:
#!/bin/bash
# scripts/upgrade-aks-cluster.sh
CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"
TARGET_VERSION="${3}"
if [ -z "${TARGET_VERSION}" ]; then
echo "Usage: $0 <cluster> <resource-group> <target-version>"
exit 1
fi
echo "🔄 Upgrading AKS cluster to ${TARGET_VERSION}"
# Step 1: Pre-upgrade validation
echo "📋 Step 1: Pre-upgrade validation..."
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
# Step 2: Upgrade control plane
echo "⬆️ Step 2: Upgrading control plane..."
az aks upgrade \
--resource-group "${RESOURCE_GROUP}" \
--name "${CLUSTER}" \
--kubernetes-version "${TARGET_VERSION}" \
--control-plane-only
# Step 3: Wait for control plane upgrade
echo "⏳ Step 3: Waiting for control plane..."
az aks show \
--resource-group "${RESOURCE_GROUP}" \
--name "${CLUSTER}" \
--query "{Status:powerState.code, Version:kubernetesVersion}" \
--output table
# Step 4: Upgrade node pools
echo "⬆️ Step 4: Upgrading node pools..."
NODEPOOLS=$(az aks nodepool list \
--resource-group "${RESOURCE_GROUP}" \
--cluster-name "${CLUSTER}" \
--query '[].name' -o tsv)
for POOL in ${NODEPOOLS}; do
echo " Upgrading node pool: ${POOL}"
az aks nodepool upgrade \
--resource-group "${RESOURCE_GROUP}" \
--cluster-name "${CLUSTER}" \
--name "${POOL}" \
--kubernetes-version "${TARGET_VERSION}" \
--max-surge 33%
done
# Step 5: Post-upgrade validation
echo "✅ Step 5: Post-upgrade validation..."
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
echo "✅ Upgrade complete"
Node OS Patching¶
Node OS Patching Schedule:
# platform/node-patching/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: node-os-patch-check
namespace: kube-system
spec:
schedule: "0 2 * * 0" # Every Sunday at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: patch-check
image: mcr.microsoft.com/aks/aks-tools:latest
command:
- /bin/sh
- -c
- |
echo "Checking for available OS patches..."
az aks nodepool show \
--resource-group ${RESOURCE_GROUP} \
--cluster-name ${CLUSTER_NAME} \
--name systempool \
--query "nodeImageVersion" -o tsv
restartPolicy: OnFailure
Node Pool Rotation¶
Node Pool Rotation for Zero-Downtime Patching:
#!/bin/bash
# scripts/rotate-nodepool.sh
CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"
NODEPOOL="${3:-systempool}"
echo "🔄 Rotating node pool: ${NODEPOOL}"
# Step 1: Create new node pool
echo "➕ Step 1: Creating new node pool..."
NEW_POOL="${NODEPOOL}-new"
az aks nodepool add \
--resource-group "${RESOURCE_GROUP}" \
--cluster-name "${CLUSTER}" \
--name "${NEW_POOL}" \
--node-count 3 \
--node-vm-size Standard_D4s_v3 \
--max-surge 33%
# Step 2: Cordon old nodes
echo "🚫 Step 2: Cordoning old nodes..."
OLD_NODES=$(kubectl get nodes -l agentpool=${NODEPOOL} -o jsonpath='{.items[*].metadata.name}')
for NODE in ${OLD_NODES}; do
kubectl cordon "${NODE}"
done
# Step 3: Drain old nodes
echo "💧 Step 3: Draining old nodes..."
for NODE in ${OLD_NODES}; do
kubectl drain "${NODE}" --ignore-daemonsets --delete-emptydir-data --grace-period=300
done
# Step 4: Delete old node pool
echo "🗑️ Step 4: Deleting old node pool..."
az aks nodepool delete \
--resource-group "${RESOURCE_GROUP}" \
--cluster-name "${CLUSTER}" \
--name "${NODEPOOL}"
# Step 5: Rename new pool
echo "🏷️ Step 5: Renaming new pool..."
az aks nodepool scale \
--resource-group "${RESOURCE_GROUP}" \
--cluster-name "${CLUSTER}" \
--name "${NEW_POOL}" \
--node-count 3
# Rename requires manual Azure Portal or separate script
echo "✅ Node pool rotation complete"
Certificate Renewals¶
Monitoring Certificate Expiration¶
Certificate Expiration Monitoring:
#!/bin/bash
# scripts/monitor-certificate-expiration.sh
WARNING_DAYS="${1:-30}"
CRITICAL_DAYS="${2:-7}"
echo "🔐 Monitoring Certificate Expiration"
kubectl get certificates --all-namespaces -o json | \
jq -r '.items[] | select(.status.conditions[]?.type == "Ready" and .status.conditions[]?.status == "True") |
"\(.metadata.namespace)/\(.metadata.name)|\(.status.notAfter)"' | \
while IFS='|' read -r CERT EXPIRY; do
if [ -n "${EXPIRY}" ]; then
EXPIRY_EPOCH=$(date -d "${EXPIRY}" +%s)
CURRENT_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - CURRENT_EPOCH) / 86400 ))
if [ "${DAYS_LEFT}" -lt "${CRITICAL_DAYS}" ]; then
echo "🔴 CRITICAL: ${CERT} expires in ${DAYS_LEFT} days"
elif [ "${DAYS_LEFT}" -lt "${WARNING_DAYS}" ]; then
echo "🟡 WARNING: ${CERT} expires in ${DAYS_LEFT} days"
else
echo "✅ OK: ${CERT} expires in ${DAYS_LEFT} days"
fi
fi
done
Certificate Expiration Alert (PrometheusRule):
# monitoring/alerts/certificate-expiration.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: certificate-expiration
namespace: monitoring
spec:
groups:
- name: certificate
interval: 1h
rules:
- alert: CertificateExpiringSoon
expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate expiring soon"
description: "Certificate {{ $labels.name }} in {{ $labels.namespace }} expires in {{ $value }} days"
- alert: CertificateExpiringVerySoon
expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
for: 1h
labels:
severity: critical
annotations:
summary: "Certificate expiring very soon"
description: "Certificate {{ $labels.name }} in {{ $labels.namespace }} expires in {{ $value }} days"
Automatic Renewal with cert-manager¶
cert-manager Automatic Renewal Configuration:
# apps/atp-gateway/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: atp-gateway-tls
namespace: atp-production
spec:
secretName: atp-gateway-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
commonName: api.atp.connectsoft.example
dnsNames:
- api.atp.connectsoft.example
duration: 2160h # 90 days
renewBefore: 720h # Renew 30 days before expiration (automatic)
Manual Renewal Procedures¶
Manual Certificate Renewal:
#!/bin/bash
# scripts/manual-certificate-renewal.sh
CERT_NAME="${1}"
NAMESPACE="${2:-atp-production}"
echo "🔄 Manually renewing certificate: ${CERT_NAME}"
# Delete existing certificate (will trigger renewal)
kubectl delete certificate "${CERT_NAME}" -n "${NAMESPACE}"
# Wait for renewal
echo "⏳ Waiting for renewal..."
sleep 30
# Check new certificate status
kubectl get certificate "${CERT_NAME}" -n "${NAMESPACE}"
# Force cert-manager to reconcile
kubectl annotate certificate "${CERT_NAME}" -n "${NAMESPACE}" \
cert-manager.io/issue-temporary-certificate="true" --overwrite
Monitoring and Alerting Review¶
Reviewing Alert Noise¶
Alert Noise Analysis:
// Log Analytics: Analyze alert frequency
AzureActivity
| where OperationName == "Microsoft.Insights/metricAlerts/write"
| where TimeGenerated > ago(30d)
| summarize AlertCount=count() by Resource, AlertName
| order by AlertCount desc
| take 20
Alert Tuning Script:
#!/bin/bash
# scripts/tune-alerts.sh
ALERT_NAME="${1}"
echo "🎚️ Tuning alert: ${ALERT_NAME}"
# Query alert frequency
echo "📊 Alert frequency (last 30 days):"
# Use Azure Monitor API or Azure CLI
# Identify false positives
echo "❌ False positives to address:"
# Manual review required
# Adjust threshold
echo "⚙️ Current threshold: [threshold]"
echo "Suggested threshold: [new-threshold]"
# Update alert rule
az monitor metrics alert update \
--name "${ALERT_NAME}" \
--resource-group atp-production-rg \
--condition "avg Percentage CPU > 80" # Example
Adding New Alerts¶
Alert Creation Template:
# monitoring/alerts/template.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: atp-application-alerts
namespace: monitoring
spec:
groups:
- name: application
interval: 1m
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
team: atp
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec for {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
team: atp
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s for {{ $labels.service }}"
Capacity Planning¶
Monitoring Resource Usage Trends¶
Resource Usage Trend Analysis:
// Log Analytics: Node CPU utilization trend
Perf
| where ObjectName == "K8SNode"
| where CounterName == "cpuUsageNanoCores"
| where TimeGenerated > ago(90d)
| summarize AvgCPU=avg(CounterValue), MaxCPU=max(CounterValue) by bin(TimeGenerated, 1d), Computer
| render timechart
// Pod memory usage trend
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "memoryWorkingSetBytes"
| where TimeGenerated > ago(90d)
| summarize AvgMemory=avg(CounterValue) by bin(TimeGenerated, 1d), InstanceName
| render timechart
Capacity Planning Report:
#!/bin/bash
# scripts/capacity-planning-report.sh
OUTPUT="capacity-planning-$(date +%Y%m).md"
cat > "${OUTPUT}" <<EOF
# Capacity Planning Report - $(date +%B %Y)
## Current Utilization
### Nodes
\`\`\`
$(kubectl top nodes)
\`\`\`
### Pods per Node
- Average: $(kubectl get pods --all-namespaces --field-selector=status.phase=Running --no-headers | wc -l) pods / $(kubectl get nodes --no-headers | wc -l) nodes
## Trends (Last 90 Days)
[Insert trend charts from Log Analytics]
## Projections (Next 6 Months)
Based on current growth trends:
- Expected pod growth: X%
- Expected storage growth: Y%
- Expected cost increase: Z%
## Recommendations
1. **Node Pool Scaling**: [Recommendation]
2. **Storage**: [Recommendation]
3. **Resource Right-Sizing**: [Recommendation]
4. **Cost Optimization**: [Recommendation]
EOF
echo "✅ Report generated: ${OUTPUT}"
Security Patching¶
Container Base Image Updates¶
Base Image Update Procedure:
#!/bin/bash
# scripts/update-base-images.sh
echo "🔒 Scanning for base image updates..."
# Scan all images in ACR
az acr repository list --name connectsoft --output table | \
while read repo; do
echo "Scanning: ${repo}"
az acr task run \
--registry connectsoft \
--name update-base-images \
--context https://github.com/connectsoft/atp.git
done
# Check for vulnerabilities
az acr repository show \
--name connectsoft \
--repository atp/gateway \
--query "properties.manifest" -o json
Vulnerability Remediation¶
Vulnerability Remediation Workflow:
graph TD
START[Vulnerability Detected] --> SCAN[Scan Images]
SCAN --> SEVERITY{Severity?}
SEVERITY -->|Critical| IMMEDIATE[Immediate Remediation]
SEVERITY -->|High| PRIORITY[Priority Remediation]
SEVERITY -->|Medium| SCHEDULED[Scheduled Remediation]
SEVERITY -->|Low| BACKLOG[Add to Backlog]
IMMEDIATE --> PATCH[Apply Patch]
PRIORITY --> PATCH
SCHEDULED --> PATCH
PATCH --> TEST[Test in Dev/Test]
TEST --> DEPLOY[Deploy to Production]
DEPLOY --> VERIFY[Verify Fix]
style IMMEDIATE fill:#FF0000
style PRIORITY fill:#FFA500
style SCHEDULED fill:#FFFF00
Configuration Drift Audits¶
Scheduled Drift Detection Runs¶
Automated Drift Detection:
# platform/drift-detection/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: drift-detection
namespace: flux-system
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: drift-detection
image: fluxcd/flux-cli:latest
command:
- /bin/sh
- -c
- |
echo "Running drift detection..."
flux diff kustomization apps-production -n flux-system > /tmp/drift-report.txt
if [ -s /tmp/drift-report.txt ]; then
echo "⚠️ Drift detected!"
cat /tmp/drift-report.txt
# Send alert
else
echo "✅ No drift detected"
fi
restartPolicy: OnFailure
Performance Tuning¶
Reconciliation Interval Optimization¶
Optimize Reconciliation Intervals:
# Production: Less frequent (reduce load)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 10m # Production: 10 minutes
path: ./apps/atp-gateway/overlays/production
---
# Dev: More frequent (faster feedback)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-dev
namespace: flux-system
spec:
interval: 1m # Dev: 1 minute
path: ./apps/atp-gateway/overlays/dev
Documentation Updates¶
Documentation Maintenance Checklist¶
## Documentation Maintenance
### Weekly
- [ ] Update runbooks with lessons learned
- [ ] Document new procedures
### Monthly
- [ ] Review and update architecture diagrams
- [ ] Update troubleshooting guides
- [ ] Review and archive outdated docs
### Quarterly
- [ ] Comprehensive documentation audit
- [ ] Update all procedures
- [ ] Knowledge base cleanup
Team Training¶
Onboarding Checklist¶
## New Team Member Onboarding
### Week 1
- [ ] Access to Azure DevOps
- [ ] Access to AKS clusters
- [ ] GitOps repository access
- [ ] Review architecture documentation
### Week 2
- [ ] Hands-on GitOps exercises
- [ ] Troubleshooting practice
- [ ] Shadow on-call rotation
### Week 3
- [ ] Independent task assignment
- [ ] Code review participation
- [ ] Documentation contribution
On-Call Procedures¶
On-Call Rotation Schedule¶
gantt
title On-Call Rotation Schedule
dateFormat YYYY-MM-DD
section Team A
Engineer 1 On-Call :2024-01-01, 7d
Engineer 2 On-Call :2024-01-08, 7d
section Team B
Engineer 3 On-Call :2024-01-15, 7d
Engineer 4 On-Call :2024-01-22, 7d
On-Call Handoff Procedure¶
## On-Call Handoff Checklist
### Daily Handoff
- [ ] Review incidents from last 24 hours
- [ ] Check for unresolved issues
- [ ] Review scheduled maintenance
- [ ] Verify alert configurations
### Weekly Handoff
- [ ] Review week's incidents
- [ ] Document lessons learned
- [ ] Update runbooks
- [ ] Share knowledge with team
Post-Incident Review Template¶
## Post-Incident Review (PIR)
**Incident ID**: [ID]
**Date**: [Date]
**Severity**: [P0/P1/P2/P3]
**Duration**: [Duration]
**Impact**: [Description]
### Timeline
- [Time] - Issue detected
- [Time] - Escalation
- [Time] - Resolution
### Root Cause
[Root cause analysis]
### Actions Taken
[Steps taken to resolve]
### Lessons Learned
- [Lesson 1]
- [Lesson 2]
### Action Items
- [ ] [Action item 1]
- [ ] [Action item 2]
### Prevention
[How to prevent similar incidents]
Summary: Day 2 Operations & Maintenance¶
- Routine Maintenance Tasks: Daily (monitoring checks, alert review), weekly (capacity planning, cost review), monthly (security patches, access reviews), quarterly (DR drills, policy updates) with automated checklists
- FluxCD Upgrades: Upgrade planning, testing in dev/test first, upgrade procedure, rollback plan, post-upgrade validation
- AKS Cluster Patching: Kubernetes version upgrades, node OS patching, upgrade scheduling, node pool rotation, validation and rollback
- Certificate Renewals: Monitoring certificate expiration, automatic renewal with cert-manager, manual renewal procedures, certificate rotation testing
- Monitoring and Alerting Review: Reviewing alert noise, tuning thresholds, disabling false positives, adding new alerts
- Capacity Planning: Monitoring resource usage trends, node pool scaling decisions, storage growth planning, cost forecasting
- Security Patching: OS security updates, container base image updates, dependency updates, vulnerability remediation workflow
- Configuration Drift Audits: Scheduled drift detection runs, comparing Git to live state, identifying configuration inconsistencies, remediation procedures
- Performance Tuning: Reconciliation interval optimization, resource request/limit tuning, autoscaling adjustments, database query optimization
- Documentation Updates: Keeping runbooks current, updating architecture diagrams, recording lessons learned, knowledge base maintenance
- Team Training: Onboarding new team members, knowledge sharing sessions, hands-on exercises, certification paths
- On-Call Procedures: On-call rotation schedule, handoff procedures, escalation paths, post-incident reviews with templates
Compliance & Audit Evidence Collection¶
Purpose: Define comprehensive compliance controls, audit evidence collection procedures, SOC 2 Type II control mappings, GDPR compliance workflows, HIPAA audit trail requirements, Change Advisory Board (CAB) processes, deployment receipts, and automated compliance reporting for ATP's GitOps deployments, ensuring regulatory compliance and providing complete audit trails for all platform changes.
SOC 2 Type II Controls¶
CC8.1: Change Management¶
Change Management Control Requirements:
| Requirement | GitOps Implementation | Evidence |
|---|---|---|
| Authorized Changes | PR approval required | PR approval records in Azure DevOps |
| Change Testing | Automated tests in CI | Test results in Azure Pipelines |
| Change Documentation | Git commit messages, PR descriptions | Git history, PR records |
| Change Approval | Required approvals before merge | Approval timestamps and identities |
| Change Review | Code review process | Review comments and approvals |
GitOps Workflow Mapping to CC8.1:
graph LR
START[Developer Creates PR] --> REVIEW[Code Review]
REVIEW --> APPROVE{Approval<br/>Required?}
APPROVE -->|Yes| CAB[CAB Approval]
APPROVE -->|No| AUTO[Automated Tests]
CAB --> AUTO
AUTO --> MERGE[Merge to Main]
MERGE --> DEPLOY[FluxCD Deploys]
DEPLOY --> EVIDENCE[Audit Evidence<br/>Generated]
style CAB fill:#FFE5B4
style EVIDENCE fill:#90EE90
Evidence Collection for CC8.1:
#!/bin/bash
# scripts/collect-change-management-evidence.sh
PR_ID="${1}"
DATE_RANGE="${2:-30d}"
echo "📋 Collecting Change Management Evidence for PR ${PR_ID}"
# Get PR details
az repos pr show \
--id "${PR_ID}" \
--organization ${ORG} \
--project ${PROJECT} \
--output json > "change-evidence-pr-${PR_ID}.json"
# Extract evidence
cat "change-evidence-pr-${PR_ID}.json" | jq '{
pr_id: .pullRequestId,
title: .title,
created_by: .createdBy.uniqueName,
created_date: .creationDate,
reviewers: [.reviewers[] | {name: .uniqueName, vote: .vote, date: .votedForDate}],
status: .status,
merge_status: .mergeStatus,
completion_date: .completionOptions.completeWorkItems,
linked_work_items: .workItemRefs[].id
}'
# Get commit history
echo "📜 Commit History:"
az repos pr commits \
--id "${PR_ID}" \
--organization ${ORG} \
--project ${PROJECT} \
--output table
# Get build/test results
echo "🧪 Build/Test Results:"
az pipelines runs list \
--organization ${ORG} \
--project ${PROJECT} \
--query "[?sourceVersion == '${PR_COMMIT_SHA}']" \
--output table
CC6.1: Logical and Physical Access¶
Access Control Requirements:
| Requirement | Implementation | Evidence |
|---|---|---|
| Access Reviews | Quarterly RBAC reviews | Access review reports |
| Least Privilege | RBAC in Kubernetes, Azure AD | RBAC manifests in Git |
| Access Logging | Kubernetes audit logs, Azure AD logs | Log Analytics queries |
| Access Termination | Automated offboarding | Offboarding logs |
Access Review Evidence Collection:
#!/bin/bash
# scripts/collect-access-review-evidence.sh
REVIEW_DATE="${1:-$(date +%Y-%m-%d)}"
echo "👥 Collecting Access Review Evidence - ${REVIEW_DATE}"
# Review Kubernetes RBAC
echo "📋 Kubernetes RBAC Review:"
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json > \
"rbac-review-${REVIEW_DATE}.json"
# Review Azure AD groups
echo "🔐 Azure AD Group Memberships:"
az ad group member list \
--group "atp-developers" \
--output table > "azure-ad-access-${REVIEW_DATE}.txt"
# Review Key Vault access
echo "🔑 Key Vault Access Policies:"
az keyvault show \
--name atp-keyvault \
--query "properties.accessPolicies" \
-o json > "keyvault-access-${REVIEW_DATE}.json"
# Generate access review report
cat > "access-review-report-${REVIEW_DATE}.md" <<EOF
# Access Review Report - ${REVIEW_DATE}
## Kubernetes RBAC
\`\`\`
$(kubectl get rolebindings,clusterrolebindings --all-namespaces --no-headers | wc -l) bindings reviewed
\`\`\`
## Azure AD Access
\`\`\`
$(az ad group list --query "length([])") groups reviewed
\`\`\`
## Key Vault Access
\`\`\`
$(az keyvault show --name atp-keyvault --query "length(properties.accessPolicies)" -o tsv) access policies reviewed
\`\`\`
## Findings
- [ ] All access is justified
- [ ] No orphaned accounts
- [ ] Least privilege enforced
- [ ] Access terminated for offboarded users
## Reviewer
**Name**: [Reviewer Name]
**Date**: ${REVIEW_DATE}
**Signature**: [Digital Signature]
EOF
echo "✅ Access review evidence collected"
CC7.2: System Monitoring¶
System Monitoring Requirements:
| Requirement | Implementation | Evidence |
|---|---|---|
| Monitoring Coverage | Azure Monitor, Prometheus | Monitoring dashboards |
| Alert Configuration | Alert rules in Git | Alert manifests |
| Log Retention | 7-year retention in Log Analytics | Retention policies |
| Incident Response | Automated alerts, on-call | Incident logs |
Monitoring Evidence Collection:
// Log Analytics: System Monitoring Evidence
// Query for monitoring coverage evidence
Perf
| where TimeGenerated > ago(30d)
| summarize
MetricCount=count_distinct(CounterName),
ResourceCount=count_distinct(Computer),
DataPoints=count()
| extend EvidenceType="Monitoring Coverage"
| project EvidenceType, MetricCount, ResourceCount, DataPoints, TimeGenerated=now()
// Alert configuration evidence
union *
| where TimeGenerated > ago(30d)
| where Category == "Alert"
| summarize AlertCount=count(), UniqueAlerts=dcount(AlertName)
| extend EvidenceType="Alert Configuration"
| project EvidenceType, AlertCount, UniqueAlerts, TimeGenerated=now()
GitOps Workflow Mapping to Controls¶
SOC 2 Control Mapping Matrix:
| Control | GitOps Workflow | Evidence Source | Retention |
|---|---|---|---|
| CC8.1 - Change Management | PR approval, code review | Azure DevOps PR records | 7 years |
| CC6.1 - Access Control | RBAC manifests in Git | Git history, access reviews | 7 years |
| CC7.2 - Monitoring | Monitoring manifests in Git | Log Analytics, dashboards | 7 years |
| CC7.3 - System Operations | GitOps reconciliation | FluxCD logs, deployment receipts | 7 years |
| CC6.6 - Logical Access | Workload Identity, RBAC | Kubernetes audit logs | 7 years |
SOC 2 Evidence Collection Dashboard:
# monitoring/compliance/soc2-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: soc2-evidence-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "SOC 2 Compliance Evidence",
"panels": [
{
"title": "Change Management (CC8.1)",
"targets": [
{
"expr": "count(azure_devops_pr_approvals_total)",
"legendFormat": "PR Approvals"
}
]
},
{
"title": "Access Reviews (CC6.1)",
"targets": [
{
"expr": "count(kubernetes_rbac_bindings_total)",
"legendFormat": "RBAC Bindings"
}
]
},
{
"title": "Monitoring Coverage (CC7.2)",
"targets": [
{
"expr": "count(azure_monitor_metrics_total)",
"legendFormat": "Monitored Resources"
}
]
}
]
}
}
GDPR Compliance¶
Right to be Forgotten (Tenant Offboarding)¶
GDPR Right to be Forgotten Procedure:
#!/bin/bash
# scripts/gdpr-tenant-offboarding.sh
TENANT_ID="${1}"
REQUEST_DATE="${2:-$(date +%Y-%m-%d)}"
REQUESTOR="${3}"
if [ -z "${TENANT_ID}" ] || [ -z "${REQUESTOR}" ]; then
echo "Usage: $0 <tenant-id> [request-date] <requestor-email>"
exit 1
fi
echo "🗑️ GDPR Right to be Forgotten Request"
echo "Tenant: ${TENANT_ID}"
echo "Request Date: ${REQUEST_DATE}"
echo "Requestor: ${REQUESTOR}"
# Step 1: Verify request authorization
echo "✅ Step 1: Verifying request authorization..."
# Verify requestor has authority to request deletion
# Step 2: Export tenant data (for record keeping)
echo "📥 Step 2: Exporting tenant data..."
kubectl get all -n "tenant-${TENANT_ID}" -o yaml > \
"gdpr-export-${TENANT_ID}-${REQUEST_DATE}.yaml"
# Step 3: Delete tenant data
echo "🗑️ Step 3: Deleting tenant data..."
# Delete tenant namespace
kubectl delete namespace "tenant-${TENANT_ID}"
# Delete tenant secrets from Key Vault
az keyvault secret list \
--vault-name atp-keyvault \
--query "[?contains(name, 'tenant-${TENANT_ID}')].name" -o tsv | \
while read secret; do
az keyvault secret delete --vault-name atp-keyvault --name "${secret}"
done
# Delete tenant data from databases
# (Specific implementation depends on database type)
# Step 4: Remove from GitOps
echo "📝 Step 4: Removing tenant from GitOps..."
git rm -r "tenants/${TENANT_ID}/"
git commit -m "GDPR: Remove tenant ${TENANT_ID} per request on ${REQUEST_DATE}"
git push
# Step 5: Generate deletion certificate
echo "📜 Step 5: Generating deletion certificate..."
cat > "gdpr-deletion-certificate-${TENANT_ID}-${REQUEST_DATE}.md" <<EOF
# GDPR Data Deletion Certificate
**Tenant ID**: ${TENANT_ID}
**Request Date**: ${REQUEST_DATE}
**Requestor**: ${REQUESTOR}
**Completion Date**: $(date +%Y-%m-%d)
## Deletion Confirmation
✅ Tenant namespace deleted: tenant-${TENANT_ID}
✅ Secrets deleted from Key Vault
✅ Data deleted from databases
✅ GitOps configuration removed
✅ Backup data purged (where applicable)
## Data Retention Exception
The following data is retained for legal/compliance purposes:
- Audit logs (7-year retention)
- Financial transaction records (as required by law)
## Certification
I certify that all tenant data has been deleted per GDPR Article 17 (Right to be Forgotten) requirements, except where retention is required by law.
**Signed**: [Authorized Person]
**Date**: $(date +%Y-%m-%d)
EOF
echo "✅ GDPR deletion complete"
Data Residency Enforcement¶
Data Residency Configuration:
# tenants/tenant-eu/labels.yaml
apiVersion: v1
kind: Namespace
metadata:
name: tenant-eu
labels:
data-residency: "EU"
gdpr: "true"
region: "westeurope"
annotations:
compliance/data-residency: "EU Only"
compliance/gdpr: "true"
Data Residency Policy Enforcement:
#!/bin/bash
# scripts/verify-data-residency.sh
TENANT_ID="${1}"
REQUIRED_REGION="${2:-EU}"
echo "🌍 Verifying Data Residency for Tenant: ${TENANT_ID}"
# Check namespace labels
RESIDENCY=$(kubectl get namespace "tenant-${TENANT_ID}" \
-o jsonpath='{.metadata.labels.data-residency}')
if [ "${RESIDENCY}" != "${REQUIRED_REGION}" ]; then
echo "❌ Data residency violation: Expected ${REQUIRED_REGION}, found ${RESIDENCY}"
exit 1
fi
# Check Pod placement (node labels)
NODES=$(kubectl get nodes -l region=${REQUIRED_REGION} -o jsonpath='{.items[*].metadata.name}')
if [ -z "${NODES}" ]; then
echo "⚠️ No nodes in region ${REQUIRED_REGION}"
fi
# Check PersistentVolume placement
PVC_REGIONS=$(kubectl get pvc -n "tenant-${TENANT_ID}" -o json | \
jq -r '.items[].metadata.annotations."volume.kubernetes.io/selected-node"')
echo "✅ Data residency verified: ${RESIDENCY}"
Audit Logs and Retention¶
GDPR Audit Log Retention:
# platform/compliance/audit-log-retention.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: audit-log-retention
namespace: monitoring
data:
retention-policy.yaml: |
# GDPR Audit Log Retention Policy
retention:
default: 7y # 7-year retention per GDPR requirements
compliance:
gdpr: 7y
soc2: 7y
hipaa: 7y
storage:
backend: azure-blob
account: atpauditlogs
container: audit-logs
immutability: true # Immutable storage
encryption: true
Audit Log Export for GDPR:
#!/bin/bash
# scripts/export-gdpr-audit-logs.sh
TENANT_ID="${1}"
START_DATE="${2}"
END_DATE="${3}"
echo "📥 Exporting GDPR Audit Logs for Tenant: ${TENANT_ID}"
# Query Log Analytics for tenant-specific audit logs
az monitor log-analytics query \
--workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
--analytics-query "
KubernetesAudit
| where Namespace == 'tenant-${TENANT_ID}'
| where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
| project TimeGenerated, User, Action, Resource, ResponseCode
| order by TimeGenerated asc
" \
--output table > "gdpr-audit-logs-${TENANT_ID}-${START_DATE}-${END_DATE}.csv"
echo "✅ Audit logs exported"
Privacy by Design¶
Privacy by Design Implementation:
# platform/compliance/privacy-by-design.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: privacy-by-design-config
namespace: atp-production
data:
principles.yaml: |
# Privacy by Design Principles
principles:
- principle: Proactive not Reactive
implementation: Default privacy settings, data minimization
- principle: Privacy as Default
implementation: Encryption at rest and in transit, minimal data collection
- principle: Privacy Embedded into Design
implementation: Privacy considerations in architecture
- principle: Full Functionality
implementation: Privacy without sacrificing functionality
- principle: End-to-End Security
implementation: Encryption, access controls, audit logging
- principle: Visibility and Transparency
implementation: Audit logs, privacy notices, data subject rights
- principle: Respect for User Privacy
implementation: User consent, data deletion, portability
HIPAA Audit Trail¶
Access Logs¶
HIPAA Access Log Configuration:
# platform/compliance/hipaa-audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
namespaces: ["tenant-hipaa-*"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
resources:
- group: "*"
resources: ["*"]
- level: RequestResponse
namespaces: ["tenant-hipaa-*"]
verbs: ["create", "update", "patch", "delete"]
resources:
- group: "*"
resources: ["secrets", "configmaps", "persistentvolumeclaims"]
HIPAA Access Log Query:
// Log Analytics: HIPAA Access Logs
KubernetesAudit
| where Namespace startswith "tenant-hipaa"
| where TimeGenerated > ago(30d)
| where Verb in ("get", "list", "watch", "create", "update", "delete")
| project
TimeGenerated,
User,
Verb,
Resource,
Namespace,
ResponseCode,
RequestObject,
ResponseObject
| order by TimeGenerated desc
Deployment Logs¶
HIPAA Deployment Audit Trail:
#!/bin/bash
# scripts/generate-hipaa-deployment-log.sh
DEPLOYMENT="${1}"
NAMESPACE="${2:-tenant-hipaa-production}"
echo "📋 Generating HIPAA Deployment Audit Trail"
# Collect deployment evidence
cat > "hipaa-deployment-${DEPLOYMENT}-$(date +%Y%m%d).md" <<EOF
# HIPAA Deployment Audit Trail
**Deployment**: ${DEPLOYMENT}
**Namespace**: ${NAMESPACE}
**Date**: $(date +%Y-%m-%d)
**Time**: $(date +%H:%M:%S)
## Pre-Deployment Verification
- [ ] Change approved by authorized personnel
- [ ] Security scan passed
- [ ] Encryption verified
- [ ] Access controls verified
## Deployment Details
**Image**: $(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.template.spec.containers[0].image}')
**Git Commit**: $(git rev-parse HEAD)
**PR Number**: $(git log -1 --pretty=format:"%s" | grep -oP 'PR #\K\d+')
**Deployed By**: $(az ad signed-in-user show --query userPrincipalName -o tsv)
## Post-Deployment Verification
- [ ] Deployment successful
- [ ] Health checks passing
- [ ] Encryption operational
- [ ] Access logs enabled
## HIPAA Compliance
- [ ] Audit logging enabled
- [ ] Encryption at rest verified
- [ ] Encryption in transit verified
- [ ] Access controls enforced
- [ ] PHI data handling verified
EOF
echo "✅ HIPAA deployment audit trail generated"
Encryption Verification¶
HIPAA Encryption Verification:
#!/bin/bash
# scripts/verify-hipaa-encryption.sh
NAMESPACE="${1:-tenant-hipaa-production}"
echo "🔐 Verifying HIPAA Encryption Requirements"
# Check PVC encryption
echo "💾 Persistent Volume Encryption:"
kubectl get pvc -n "${NAMESPACE}" -o json | \
jq -r '.items[] | "\(.metadata.name): \(.spec.storageClassName)"' | \
while read pvc; do
SC=$(echo "${pvc}" | cut -d':' -f2 | xargs)
ENCRYPTED=$(kubectl get storageclass "${SC}" -o jsonpath='{.parameters.diskEncryptionSetID}')
if [ -n "${ENCRYPTED}" ]; then
echo " ✅ ${pvc}: Encrypted"
else
echo " ❌ ${pvc}: Not encrypted"
fi
done
# Check TLS/in-transit encryption
echo "🔒 In-Transit Encryption:"
kubectl get ingress -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.spec.tls == null) | "\(.metadata.name): Missing TLS"'
# Check secrets encryption
echo "🔑 Secret Encryption:"
kubectl get secrets -n "${NAMESPACE}" -o json | \
jq -r '.items[] | select(.type != "Opaque") | "\(.metadata.name): \(.type)"'
echo "✅ Encryption verification complete"
Incident Response Documentation¶
HIPAA Incident Response Template:
## HIPAA Incident Report
**Incident ID**: [ID]
**Date Discovered**: [Date]
**Date Reported**: [Date] (within 60 days)
**Severity**: [Low/Medium/High/Critical]
### Incident Description
[Description of the incident]
### PHI Impact Assessment
- [ ] No PHI affected
- [ ] PHI accessed but not compromised
- [ ] PHI compromised (breach)
### Affected Systems
- [List affected systems]
### Actions Taken
1. [Action 1]
2. [Action 2]
### Remediation
[Remediation steps]
### Breach Notification
- [ ] HHS notified (if breach > 500 individuals)
- [ ] Affected individuals notified (if breach)
- [ ] Media notification (if breach > 500 individuals)
### Lessons Learned
[Lessons learned]
### Prevention
[Prevention measures]
Change Advisory Board (CAB) Process¶
When CAB Approval is Required¶
CAB Approval Requirements:
| Change Type | CAB Required? | Rationale |
|---|---|---|
| Production Deployment | ✅ Yes | Production impact |
| Infrastructure Changes | ✅ Yes | Platform stability |
| Security Updates | ⚠️ Expedited | Security risk |
| Hotfixes | ⚠️ Post-deployment | Urgency |
| Dev/Test Changes | ❌ No | No production impact |
CAB Approval Decision Tree:
graph TD
START[Change Request] --> ENV{Environment?}
ENV -->|Production| CAB_REQUIRED[CAB Approval Required]
ENV -->|Staging| REVIEW[Team Lead Review]
ENV -->|Dev/Test| AUTO[No Approval Needed]
CAB_REQUIRED --> SEVERITY{Severity?}
SEVERITY -->|Critical| EXPEDITED[Expedited CAB]
SEVERITY -->|Normal| REGULAR[Regular CAB]
REGULAR --> MEETING[CAB Meeting]
EXPEDITED --> APPROVAL[Expedited Approval]
style CAB_REQUIRED fill:#FFE5B4
style EXPEDITED fill:#FFB6C1
CAB Meeting Schedule¶
CAB Meeting Schedule:
| Meeting Type | Frequency | Day | Time |
|---|---|---|---|
| Regular CAB | Weekly | Tuesday | 10:00 AM |
| Expedited CAB | As needed | Any | Within 24 hours |
| Emergency CAB | As needed | Any | Immediate |
Change Request Template¶
CAB Change Request Template:
## Change Request Form
**CR Number**: CR-YYYY-XXX
**Date**: [Date]
**Requestor**: [Name, Email]
**Change Type**: [Standard/Emergency/Expedited]
### Change Summary
**Title**: [Change title]
**Description**: [Detailed description]
### Business Justification
[Why is this change needed?]
### Technical Details
- **Services Affected**: [List services]
- **Environments**: [Dev/Test/Staging/Production]
- **Expected Duration**: [Duration]
- **Rollback Plan**: [Rollback procedure]
### Risk Assessment
- **Risk Level**: [Low/Medium/High/Critical]
- **Potential Impact**: [Impact description]
- **Mitigation**: [Mitigation steps]
### Testing
- [ ] Tested in Dev
- [ ] Tested in Test
- [ ] Tested in Staging
- [ ] Rollback tested
### Approval
- [ ] Technical Lead Approval
- [ ] CAB Approval
- [ ] Change Manager Approval
### Implementation
**Scheduled Date**: [Date]
**Scheduled Time**: [Time]
**Change Window**: [Window]
### Post-Implementation
- [ ] Implementation successful
- [ ] Verification completed
- [ ] Documentation updated
CAB Review Criteria¶
CAB Review Criteria Checklist:
## CAB Review Criteria
### Change Completeness
- [ ] Change request form complete
- [ ] Technical details provided
- [ ] Testing completed
- [ ] Rollback plan documented
### Risk Assessment
- [ ] Risk level appropriate
- [ ] Impact assessment complete
- [ ] Mitigation plan adequate
### Compliance
- [ ] Change documented in Git
- [ ] Approval trail maintained
- [ ] Audit requirements met
### Schedule
- [ ] Change window appropriate
- [ ] Stakeholders notified
- [ ] Resources available
Approval Documentation¶
CAB Approval Record:
# changes/cr-2024-001-approval.yaml
apiVersion: compliance.atp.connectsoft.io/v1
kind: ChangeApproval
metadata:
name: cr-2024-001
namespace: atp-production
spec:
changeRequest:
number: CR-2024-001
title: "Upgrade PostgreSQL to version 15"
requestor: "john.doe@connectsoft.example"
date: "2024-01-15"
cabApproval:
approved: true
approvalDate: "2024-01-18"
approvedBy:
- name: "Jane Smith"
role: "CAB Chair"
signature: "[Digital Signature]"
- name: "Bob Johnson"
role: "Technical Lead"
signature: "[Digital Signature]"
implementation:
scheduledDate: "2024-01-25"
scheduledTime: "02:00 UTC"
changeWindow: "02:00-04:00 UTC"
Deployment Approval Records¶
PR Approvals in Azure DevOps¶
Extract PR Approval Records:
#!/bin/bash
# scripts/extract-pr-approvals.sh
PR_ID="${1}"
PROJECT="${2:-atp-gitops}"
echo "📋 Extracting PR Approval Records for PR ${PR_ID}"
# Get PR details with approvals
az repos pr show \
--id "${PR_ID}" \
--organization ${ORG} \
--project ${PROJECT} \
--include-work-item-refs \
--output json | \
jq '{
pr_id: .pullRequestId,
title: .title,
created_by: .createdBy.displayName,
created_date: .creationDate,
status: .status,
reviewers: [.reviewers[] | {
name: .displayName,
email: .uniqueName,
vote: .vote,
vote_date: .votedForDate,
is_required: .isRequired
}],
completion_options: .completionOptions,
work_item_refs: [.workItemRefs[] | {
id: .id,
title: .title,
url: .url
}]
}' > "pr-approval-${PR_ID}.json"
# Generate approval certificate
cat > "pr-approval-certificate-${PR_ID}.md" <<EOF
# PR Approval Certificate
**PR Number**: ${PR_ID}
**Title**: $(jq -r '.title' pr-approval-${PR_ID}.json)
**Created**: $(jq -r '.created_date' pr-approval-${PR_ID}.json)
**Merged**: $(jq -r '.completionOptions.mergeCommitMessage' pr-approval-${PR_ID}.json)
## Approvers
$(jq -r '.reviewers[] | "- **\(.name)** (\(.email)) - Vote: \(.vote) - Date: \(.vote_date)"' pr-approval-${PR_ID}.json)
## Approval Status
$(jq -r 'if .reviewers | all(.vote >= 10) then "✅ Approved" else "❌ Not Approved" end' pr-approval-${PR_ID}.json)
## Linked Work Items
$(jq -r '.work_item_refs[] | "- [\(.id)] \(.title) - \(.url)"' pr-approval-${PR_ID}.json)
## Audit Trail
This PR approval record is maintained for 7 years per SOC 2 and GDPR requirements.
EOF
echo "✅ Approval records extracted"
Approver Identity and Timestamp¶
Approval Evidence Structure:
{
"approval_record": {
"pr_id": 12345,
"pr_title": "Deploy ATP Gateway v1.2.3 to Production",
"approvals": [
{
"approver": {
"name": "Jane Smith",
"email": "jane.smith@connectsoft.example",
"azure_ad_id": "a1b2c3d4-..."
},
"approval": {
"vote": 10,
"vote_date": "2024-01-20T10:30:00Z",
"comment": "Approved after review",
"timestamp": "2024-01-20T10:30:15Z"
},
"signature": {
"method": "Azure DevOps",
"hash": "sha256:abc123...",
"verified": true
}
}
],
"merged_by": {
"name": "John Doe",
"email": "john.doe@connectsoft.example",
"merge_date": "2024-01-20T11:00:00Z"
}
}
}
Justification and Risk Assessment¶
PR Justification Template:
## PR Justification
**PR**: #12345
**Title**: Deploy ATP Gateway v1.2.3 to Production
### Business Justification
[Why is this deployment needed?]
### Technical Justification
[Technical reasons for the change]
### Risk Assessment
- **Risk Level**: Medium
- **Potential Impact**: Service restart (5 minutes downtime)
- **Mitigation**: Rolling update, health checks
### Testing Completed
- [x] Unit tests passed
- [x] Integration tests passed
- [x] Staging deployment successful
- [x] Smoke tests passed
### Rollback Plan
[Rollback procedure if deployment fails]
### Approval Required
- [x] Technical Lead
- [x] CAB (for production)
Work Item Linking¶
Link PR to Work Items:
#!/bin/bash
# scripts/link-pr-to-workitems.sh
PR_ID="${1}"
WORK_ITEM_IDS="${2}" # Space-separated work item IDs
echo "🔗 Linking PR ${PR_ID} to work items: ${WORK_ITEM_IDS}"
for WI_ID in ${WORK_ITEM_IDS}; do
echo " Linking to work item: ${WI_ID}"
az repos pr work-item add \
--id "${PR_ID}" \
--work-item-id "${WI_ID}" \
--organization ${ORG} \
--project ${PROJECT}
done
# Verify links
echo "✅ Verifying links..."
az repos pr show \
--id "${PR_ID}" \
--organization ${ORG} \
--project ${PROJECT} \
--include-work-item-refs \
--query "workItemRefs[].id" \
--output table
Git Commit History as Audit Evidence¶
Signed Commits (GPG)¶
GPG Signing Configuration:
#!/bin/bash
# scripts/setup-gpg-signing.sh
echo "🔐 Setting up GPG signing for Git commits"
# Generate GPG key (if not exists)
if ! gpg --list-secret-keys --keyid-format LONG | grep -q "sec"; then
echo "Generating new GPG key..."
gpg --full-generate-key
fi
# Get GPG key ID
GPG_KEY_ID=$(gpg --list-secret-keys --keyid-format LONG | \
grep "^sec" | \
sed -n 's/.*\/\([A-Z0-9]\{16\}\).*/\1/p' | \
head -1)
echo "GPG Key ID: ${GPG_KEY_ID}"
# Configure Git to use GPG signing
git config --global user.signingkey "${GPG_KEY_ID}"
git config --global commit.gpgsign true
# Add GPG key to GitHub/Azure DevOps
echo "Add this public key to Azure DevOps:"
gpg --armor --export "${GPG_KEY_ID}"
echo "✅ GPG signing configured"
Verify Signed Commits:
#!/bin/bash
# scripts/verify-signed-commits.sh
COMMIT_RANGE="${1:-HEAD~10..HEAD}"
echo "✅ Verifying signed commits in range: ${COMMIT_RANGE}"
git log --pretty="format:%H %G? %aN %s" "${COMMIT_RANGE}" | \
while read commit signature author subject; do
case "${signature}" in
"G")
echo "✅ ${commit}: Good signature (${author})"
;;
"B")
echo "❌ ${commit}: Bad signature (${author})"
;;
"X")
echo "⚠️ ${commit}: Expired key (${author})"
;;
"Y")
echo "⚠️ ${commit}: Expired signature (${author})"
;;
"R")
echo "❌ ${commit}: Revoked key (${author})"
;;
"E")
echo "❌ ${commit}: Cannot verify (${author})"
;;
"N")
echo "❌ ${commit}: No signature (${author})"
;;
esac
done
Commit Message Standards¶
Conventional Commits for Audit Trail: