Skip to content

Quality Gates — Audit Trail Platform (ATP)

Quality by enforcement — ATP pipelines block progression when code quality, test coverage, security, or compliance thresholds are not met.


Purpose & Scope

This document defines the comprehensive quality gate framework for the ConnectSoft Audit Trail Platform (ATP), ensuring that every build, deployment, and release meets stringent standards for code quality, security, compliance, and operational excellence.

Purpose

Quality gates serve as automated checkpoints in the CI/CD pipeline, preventing low-quality or non-compliant code from progressing to production. ATP's quality gates enforce:

  1. Build Quality: Code compiles without errors or warnings, adheres to coding standards, and passes static analysis
  2. Test Coverage: Sufficient unit, integration, and E2E test coverage with 100% pass rates
  3. Security Posture: Zero critical/high vulnerabilities, no secrets in code, secure dependencies
  4. Compliance Adherence: SBOM generation, audit logging, PII redaction, regulatory alignment
  5. Performance Standards: Load tests, chaos tests, and observability validation in staging
  6. API Contract Stability: No breaking changes without versioning, backward compatibility maintained

By failing fast at each gate, ATP ensures that issues are detected and remediated early in the development lifecycle (shift-left), reducing the cost and risk of defects reaching production.


Scope

This document covers:

  • Quality Gate Categories: Build, test, security, compliance, performance, observability, API contracts
  • Thresholds & Metrics: Specific numeric thresholds per gate type (e.g., ≥70% code coverage, 0 critical CVEs)
  • Enforcement Mechanisms: Pipeline configurations, Azure DevOps tasks, custom scripts, approval workflows
  • Integration Points: Azure Pipelines stages (CI, staging, production), SonarQube, OWASP Dependency-Check, Trivy, OpenAPI-diff
  • Exception Handling: Risk acceptance process, suppression files, time-bound exemptions
  • Metrics & Dashboards: Quality gate pass/fail trends, remediation times, DORA metrics alignment
  • Governance: Quality gate ownership (RACI), threshold evolution roadmap, retrospective cadence

Gate enforcement applies to:

  • All ATP microservices (Ingestion, Query, Integrity, Export, Policy, Search, Gateway)
  • Infrastructure as Code (Pulumi C# stacks)
  • Shared libraries (ConnectSoft.Audit.Abstractions, ConnectSoft.Observability.OpenTelemetry)
  • Database migration scripts (EF Core migrations, SQL scripts)
  • CI/CD pipeline templates (ConnectSoft.AzurePipelines)

Out of Scope

This document does NOT cover:

  • Code review processes: Manual peer review workflows, pull request templates, approval policies (see development/code-review-guidelines.md)
  • Incident response: Post-production issue handling, on-call procedures (see operations/runbook.md)
  • Deployment strategies: Blue-green, canary, rolling deployment mechanics (see ci-cd/azure-pipelines.md)
  • Environment configuration: Environment-specific settings, secrets management (see ci-cd/environments.md)
  • Architecture decisions: Why specific quality thresholds were chosen (see ADRs in adrs/)

Readers & Ownership

Primary Audience:

  • Development Teams: Understand quality requirements, fix gate violations, write testable code
  • QA Engineers: Define test coverage thresholds, maintain test suites, analyze flaky tests
  • Security Officers: Set vulnerability thresholds, review suppressions, validate compliance gates
  • SRE Teams: Monitor performance gates, chaos test results, observability validation
  • Platform Engineers: Configure pipeline gates, maintain enforcement tooling, update thresholds

Document Ownership:

  • Author: Platform Engineering Team
  • Reviewers: Tech Lead (Build/Test Gates), Security Officer (Security/Compliance Gates), SRE Lead (Performance/Observability Gates)
  • Approver: Lead Architect
  • Maintenance Cadence: Quarterly review with threshold adjustments based on maturity

Artifacts Produced by Quality Gates

Quality gates generate compliance artifacts that serve as evidence for audits (SOC 2, GDPR, HIPAA):

  1. Build Artifacts:

    • Compiled binaries with version stamps
    • NuGet packages (.nupkg, .snupkg symbol packages)
    • Docker images with signed provenance (Cosign signatures)
  2. Test Evidence:

    • Test result files (.trx, JUnit XML)
    • Code coverage reports (Cobertura XML, HTML)
    • Load test results (JMeter .jtl, HTML reports)
    • Chaos test results (JSON, logs)
  3. Security Artifacts:

    • SBOM (Software Bill of Materials) in CycloneDX/SPDX format
    • Vulnerability scan reports (OWASP Dependency-Check JSON/HTML)
    • Container scan results (Trivy JSON)
    • Secrets scan reports (CredScan, GitGuardian)
  4. Compliance Evidence:

    • Audit logging validation results
    • PII redaction verification reports
    • Compliance checklist attestations (GDPR, HIPAA)
    • API contract diff reports (OpenAPI breaking changes)
  5. Performance Metrics:

    • Load test metrics (p50/p95/p99 latency, throughput, error rates)
    • Chaos test pass/fail results per scenario
    • Health check validation logs
  6. Approval Records:

    • Manual approval logs (approver identity, timestamp, justification)
    • Change Advisory Board (CAB) meeting minutes
    • Risk acceptance documentation (for suppressed vulnerabilities)

All artifacts are retained for 7 years (aligned with ATP's retention policy) and stored in immutable Azure Blob Storage with legal hold enabled for production builds.


Acceptance Criteria

This document is considered complete and accurate when:

  1. All gate types are documented with clear thresholds, enforcement mechanisms, and examples
  2. Integration with Azure Pipelines is fully described with working YAML configurations
  3. Exception handling processes are defined with suppression file formats and approval workflows
  4. Metrics and dashboards are specified with Azure DevOps dashboard configurations and KQL queries
  5. Cross-references to related ATP documentation (environments, pipelines, security) are complete
  6. Code examples are production-ready and tested in ATP pipelines
  7. Governance model defines RACI, ownership, and evolution roadmap

Success Metrics:

  • ≥95% of builds pass all quality gates on first attempt (target by Q2 2025)
  • ≤24 hours median time to remediate gate violations (target by Q2 2025)
  • Zero critical/high vulnerabilities in production (maintained since inception)
  • 100% of production builds have complete compliance artifacts (maintained)

Document Conventions

Symbols used: - ✅ Blocker gate: Pipeline fails immediately if threshold not met - ⚠️ Warning gate: Issue logged but pipeline continues (for non-critical metrics) - ℹ️ Informational: Metric tracked but no enforcement action - ❌ Action required: Gate failure requires developer intervention

Threshold notation:

  • ≥70%: Greater than or equal to 70%
  • <500ms: Less than 500 milliseconds
  • 0: Exactly zero (no tolerance)

Code block types:

  • yaml: Azure Pipelines YAML configurations
  • csharp: C# code examples (quality validation logic)
  • powershell: PowerShell scripts (validation, remediation)
  • bash: Bash scripts (Linux-based gates)
  • xml: Suppression files, .csproj configurations
  • json: SonarQube quality profiles, SBOM formats

Quality Gate Philosophy

ATP's quality gate framework is built on five core principles:

1. Shift-Left: Detect Issues Early

Principle: Identify and fix defects as early as possible in the development lifecycle to minimize cost and risk.

Implementation:

  • Pre-commit hooks: Lint checks, unit tests run locally before commit
  • PR validation: Automated PR builds run all CI gates (build, test, security)
  • Branch policies: Require build validation, code coverage, code review before merge
  • Fast feedback: CI gates complete in <10 minutes; developers notified immediately on failure

Benefits:

  • 10x cheaper to fix bugs in development vs. production
  • Faster development velocity (fewer context switches for late-stage fixes)
  • Higher developer confidence (code is validated before merge)

ATP Example:

# Pre-commit hook (.git/hooks/pre-commit)
#!/bin/bash
echo "Running pre-commit quality checks..."

# Lint check
dotnet format --verify-no-changes --severity error

# Unit tests (fast subset)
dotnet test --filter Category=Unit --no-build

if [ $? -ne 0 ]; then
  echo "❌ Pre-commit checks failed. Fix issues before committing."
  exit 1
fi

echo "✅ Pre-commit checks passed"

2. Fail Fast: Block Progression Immediately

Principle: Halt the pipeline as soon as a quality gate fails; do not continue to subsequent stages.

Implementation:

  • Exit code enforcement: All gate tasks return non-zero exit codes on failure
  • Pipeline stage dependencies: Subsequent stages depend on prior stage success (dependsOn: CI_Stage, condition: succeeded())
  • No silent warnings: All warnings treated as errors in build configuration (<TreatWarningsAsErrors>true</TreatWarningsAsErrors>)
  • Immediate notifications: Developers notified via email/Slack within 1 minute of gate failure

Benefits:

  • Prevents compounding issues (e.g., deploying broken code to staging)
  • Conserves CI/CD resources (no unnecessary downstream tasks)
  • Clear accountability (developer knows immediately which gate failed)

ATP Example:

# Azure Pipelines stage with fail-fast behavior
stages:
- stage: CI_Stage
  jobs:
  - job: Build_Test_Scan
    steps:
    - task: DotNetCoreCLI@2
      inputs:
        command: 'build'
        arguments: '--configuration Release /p:TreatWarningsAsErrors=true'
      displayName: 'Build with Warnings as Errors'
      # If build fails here, pipeline stops immediately

    - task: BuildQualityChecks@8
      inputs:
        checkCoverage: true
        coverageFailOption: 'fixed'
        coverageThreshold: 70
      displayName: 'Enforce Code Coverage Gate'
      # If coverage < 70%, pipeline stops immediately

- stage: Deploy_Staging
  dependsOn: CI_Stage
  condition: succeeded()  # Only runs if CI_Stage passed

3. Transparent Feedback: Clear Error Messages with Remediation

Principle: When a gate fails, provide actionable feedback with clear error messages and remediation steps.

Implementation:

  • Descriptive error messages: Include gate name, actual vs. expected value, remediation steps
  • Links to documentation: Error messages include URLs to relevant docs (e.g., coverage guide, security best practices)
  • Suggested fixes: Where possible, provide copy-paste commands to fix issues (e.g., dotnet format, dotnet add package)
  • Trend analysis: Show gate pass/fail trends to identify recurring issues

Benefits:

  • Faster remediation (developers don't waste time diagnosing issues)
  • Self-service resolution (reduced dependency on platform team)
  • Improved developer experience (clear guidance vs. cryptic errors)

ATP Example:

// Custom quality gate validation with clear feedback
public class CodeCoverageGateValidator
{
    public ValidationResult ValidateCoverage(CoverageReport report, double threshold)
    {
        var lineCoverage = report.LineRate * 100;

        if (lineCoverage < threshold)
        {
            return new ValidationResult
            {
                Passed = false,
                ErrorMessage = $@"
❌ Code Coverage Gate Failed

  Required: {threshold}% line coverage
  Actual:   {lineCoverage:F1}% line coverage
  Deficit:  {threshold - lineCoverage:F1}% ({report.UncoveredLines} lines not covered)

📋 Remediation Steps:
  1. Identify uncovered code: dotnet reportgenerator -reports:coverage.xml -targetdir:coverage-report
  2. Add unit tests for critical paths (Controllers, Services, Validators)
  3. Re-run tests: dotnet test --collect:'XPlat Code Coverage'
  4. Verify coverage: View HTML report in coverage-report/index.html

📚 Documentation: https://docs.connectsoft.example/quality-gates/coverage

🔍 Uncovered Files (Top 5):
{string.Join("\n", report.GetUncoveredFiles().Take(5).Select(f => $"  - {f.Name}: {f.Coverage:F1}% covered"))}
",
                Severity = ValidationSeverity.Error,
                Category = "Test Coverage"
            };
        }

        return ValidationResult.Success($"✅ Code coverage: {lineCoverage:F1}% (threshold: {threshold}%)");
    }
}

4. Continuous Improvement: Ratchet Thresholds Upward

Principle: Never lower quality standards; continuously improve thresholds based on team maturity and historical performance.

Implementation:

  • Quarterly reviews: Evaluate gate pass rates, remediation times, and threshold appropriateness
  • Baseline protection: Track coverage/quality metrics per build; alert if metrics regress
  • Incremental increases: Raise thresholds by 2-5% per quarter if sustained above target
  • Zero tolerance for critical issues: Security gates (critical CVEs, secrets) have no grandfathering

Benefits:

  • Prevents quality erosion over time (no "technical debt drift")
  • Incentivizes proactive quality improvements (teams strive to exceed thresholds)
  • Data-driven threshold evolution (based on actual team capability)

ATP Example:

# Threshold evolution tracking (quality-gate-history.yml)
coverageThresholds:
  - effectiveDate: 2024-01-01
    threshold: 65%
    rationale: Initial baseline for ATP launch

  - effectiveDate: 2024-04-01
    threshold: 68%
    rationale: Q1 2024 sustained at 72% avg; raised by 3%

  - effectiveDate: 2024-07-01
    threshold: 70%
    rationale: Q2 2024 sustained at 75% avg; raised to 70% (industry standard)

  - effectiveDate: 2025-01-01
    threshold: 73%
    rationale: Q4 2024 sustained at 78% avg; raised by 3%
    approvedBy: Lead Architect
    adrReference: ADR-042-coverage-threshold-increase

# Automated check: Prevent lowering thresholds
- script: |
    CURRENT=$(cat azure-pipelines.yml | grep coverageThreshold | awk '{print $2}')
    PREVIOUS=$(git show HEAD~1:azure-pipelines.yml | grep coverageThreshold | awk '{print $2}')

    if (( $(echo "$CURRENT < $PREVIOUS" | bc -l) )); then
      echo "❌ Error: Coverage threshold lowered from $PREVIOUS to $CURRENT"
      echo "   Quality thresholds can only be raised, never lowered."
      exit 1
    fi
  displayName: 'Validate Threshold Ratcheting'

5. Evidence-Based: All Gate Results Logged as Compliance Artifacts

Principle: Every quality gate execution produces auditable evidence that is retained for compliance, security audits, and retrospectives.

Implementation:

  • Artifact publishing: All gate results (test reports, scan results, SBOM) published to Azure Artifacts
  • Immutable storage: Compliance artifacts stored in Azure Blob Storage with WORM (Write Once Read Many) enabled
  • Metadata tagging: Artifacts tagged with build ID, commit SHA, approver identities, gate pass/fail status
  • Retention enforcement: 7-year retention for production builds; 90 days for dev/test builds

Benefits:

  • SOC 2/GDPR/HIPAA audit readiness (evidence available on-demand)
  • Forensic analysis of production incidents (trace back to build artifacts)
  • Quality trend analysis (identify gate failure patterns over time)

ATP Example:

# Publish compliance artifacts with metadata
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)/compliance'
    ArtifactName: 'compliance-evidence-$(Build.BuildNumber)'
  displayName: 'Publish Compliance Evidence'

- task: AzureCLI@2
  inputs:
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: |
      # Upload to immutable blob storage with metadata
      az storage blob upload-batch \
        --source "$(Build.ArtifactStagingDirectory)/compliance" \
        --destination compliance-artifacts \
        --account-name atpcomplianceblob \
        --metadata \
          BuildId=$(Build.BuildId) \
          CommitSha=$(Build.SourceVersion) \
          Pipeline=$(Build.DefinitionName) \
          Environment=Production \
          RetentionYears=7 \
          GatesPassed=true

      # Enable legal hold (immutability)
      az storage blob set-legal-hold \
        --container compliance-artifacts \
        --account-name atpcomplianceblob \
        --tags audit-evidence=true
  displayName: 'Archive Evidence with Legal Hold'

Summary

  • Purpose: Quality gates enforce code quality, security, compliance, and performance standards at every CI/CD stage
  • Scope: All ATP microservices, IaC, shared libraries, database migrations; covers build, test, security, compliance, performance, observability, API contract gates
  • Out of Scope: Code review processes, incident response, deployment strategies, environment config (covered in other documents)
  • Artifacts: Test results, coverage reports, SBOM, vulnerability scans, compliance evidence (retained 7 years for production)
  • Philosophy: 5 core principles (Shift-Left, Fail Fast, Transparent Feedback, Continuous Improvement, Evidence-Based)
  • Ownership: Platform Engineering (author), Tech Lead/Security/SRE (reviewers), Lead Architect (approver), Quarterly review cadence
  • Success Metrics: ≥95% first-attempt pass rate, ≤24h remediation time, zero critical/high CVEs in production, 100% artifact completeness

Gate Types Overview

ATP implements six primary quality gate categories, each enforced at different stages of the CI/CD pipeline. Gates are cumulative: a build must pass all prior gates before progressing to subsequent stages.

Gate Execution Flow

graph LR
    A[Code Commit] --> B[Build Quality Gates]
    B --> C[Test Coverage Gates]
    C --> D[Security Gates]
    D --> E[Compliance Gates]
    E --> F[CI Artifacts Published]
    F --> G[Deploy to Staging]
    G --> H[Performance Gates]
    H --> I[Observability Gates]
    I --> J[Deploy to Production]

    B -.->|Fail| K[Pipeline Stopped]
    C -.->|Fail| K
    D -.->|Fail| K
    E -.->|Fail| K
    H -.->|Fail| L[Block Prod Deploy]
    I -.->|Fail| L

    style K fill:#ff6b6b
    style L fill:#feca57
Hold "Alt" / "Option" to enable pan & zoom

Key Characteristics:

  1. Sequential Execution: Gates execute in order (Build → Test → Security → Compliance → Performance → Observability)
  2. Early Termination: First gate failure stops the pipeline immediately (fail fast)
  3. Stage-Specific: Some gates only run in specific environments (e.g., performance gates in staging)
  4. Artifact Dependencies: Later gates may analyze artifacts from earlier gates (e.g., SBOM from build stage)

Gate Category Summary

Gate Category Purpose Enforcement Point Blocker Typical Duration Owner
Build Quality Code compiles, no warnings, static analysis passes CI stage (every commit) ✅ Yes 2-4 minutes Tech Lead
Test Coverage Sufficient unit/integration tests, 100% pass rate CI stage (every commit) ✅ Yes 3-5 minutes QA Lead
Security No vulnerabilities, secrets detected, images scanned CI + pre-deploy ✅ Yes 5-8 minutes Security Officer
Compliance SBOM generated, audit logging validated, PII redacted CI + pre-deploy ✅ Yes 2-3 minutes Compliance Officer
Performance Load/chaos tests pass, latency/error thresholds met Staging (pre-prod) ⚠️ Warning (blocks prod) 10-15 minutes SRE Lead
Observability Metrics, traces, logs validated, health checks pass Staging (pre-prod) ⚠️ Warning (blocks prod) 3-5 minutes SRE Lead

Total CI Pipeline Duration (Build → Compliance): ~15-20 minutes
Total Staging Validation (Performance + Observability): ~13-20 minutes
End-to-End (Commit → Production-Ready): ~30-40 minutes


Gate Category 1: Build Quality

Purpose: Ensure code compiles successfully, adheres to coding standards, and passes static analysis before any further validation.

Enforcement Point: CI stage (triggered on every commit, PR, and main branch build)

Blocker Status: ✅ Yes — Pipeline fails immediately if build quality gates fail

Key Checks:

  1. Compilation: Code builds without errors using dotnet build
  2. Warnings as Errors: Zero warnings (all warnings treated as errors)
  3. Static Analysis: StyleCop, SonarQube, Meziantou, AsyncFixer rules enforced
  4. Code Style: EditorConfig rules, naming conventions, documentation comments
  5. Deprecated APIs: No usage of deprecated packages or APIs

Typical Failures:

  • Compilation errors (syntax, missing references, type mismatches)
  • Code style violations (naming, spacing, documentation)
  • Static analysis issues (async/await patterns, nullability, resource disposal)
  • Deprecated package usage (packages with known CVEs or EOL status)

Remediation Time: ≤30 minutes (most failures fixed by developers immediately)

Automation:

# Azure Pipelines: Build Quality Gate
- task: DotNetCoreCLI@2
  inputs:
    command: 'build'
    projects: '$(solution)'
    arguments: '--configuration Release /p:TreatWarningsAsErrors=true /p:EnforceCodeStyleInBuild=true'
  displayName: 'Build with Warnings as Errors'

- task: SonarQubePrepare@5
  inputs:
    SonarQube: 'SonarCloud-ConnectSoft'
    scannerMode: 'MSBuild'
    projectKey: 'ConnectSoft_ATP'

- task: SonarQubeAnalyze@5
  displayName: 'SonarQube Analysis'

- task: SonarQubePublish@5
  inputs:
    pollingTimeoutSec: '300'

Gate Category 2: Test Coverage

Purpose: Validate that sufficient automated tests exist and execute successfully, with adequate code coverage.

Enforcement Point: CI stage (after successful build)

Blocker Status: ✅ Yes — Pipeline fails if coverage thresholds not met or tests fail

Key Checks:

  1. Test Execution: All unit and integration tests pass (100% pass rate)
  2. Code Coverage: Line coverage ≥70%, branch coverage ≥60% (service-specific thresholds)
  3. Test Duration: Tests complete within acceptable time limits (unit <30s, integration <5min)
  4. Flaky Test Detection: No tests with <95% historical pass rate
  5. Test Quality: Minimum assertions per test, no skipped tests without justification

Typical Failures:

  • Test failures due to bugs introduced in code changes
  • Coverage drops below threshold (new code without tests)
  • Flaky tests (intermittent failures due to timing, external dependencies)
  • Test timeouts (long-running tests, deadlocks, infinite loops)

Remediation Time: ≤2 hours (write missing tests, fix failing tests)

Service-Specific Thresholds:

Service Line Coverage Branch Coverage Rationale
Ingestion ≥75% ≥65% Critical path for all audit events; high reliability requirement
Query ≥80% ≥70% Complex query logic with multiple filters; high test coverage essential
Integrity ≥85% ≥75% Security-critical tamper-evidence; highest coverage requirement
Export ≥70% ≥60% I/O-heavy with external dependencies; lower threshold acceptable
Policy ≥80% ≥70% Business rules enforcement; high coverage for rule validation
Search ≥70% ≥60% Integration-heavy with Elasticsearch; focus on integration tests
Gateway ≥65% ≥55% API routing and orchestration; lower threshold, focus on E2E tests

Automation:

# Azure Pipelines: Test Coverage Gate
- task: DotNetCoreCLI@2
  inputs:
    command: 'test'
    projects: '**/*Tests.csproj'
    arguments: '--configuration Release --collect:"XPlat Code Coverage" --settings:CodeCoverage.runsettings'
  displayName: 'Run Tests with Coverage'

- task: PublishCodeCoverageResults@1
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'
  displayName: 'Publish Coverage Results'

- task: BuildQualityChecks@8
  inputs:
    checkCoverage: true
    coverageFailOption: 'fixed'
    coverageType: 'lines'
    coverageThreshold: 70  # ATP minimum baseline
    treatBuildWarningsAsErrors: true
  displayName: 'Enforce Coverage Threshold'

Gate Category 3: Security

Purpose: Detect vulnerabilities, secrets, and security issues before code reaches production.

Enforcement Point: CI stage (after tests) + Pre-deployment validation (staging/prod)

Blocker Status: ✅ Yes — Pipeline fails on critical/high vulnerabilities or detected secrets

Key Checks:

  1. Dependency Scanning: OWASP Dependency-Check for vulnerable NuGet packages (CVSS ≥7)
  2. Secrets Detection: CredScan, GitGuardian for API keys, passwords, tokens in code
  3. Container Scanning: Trivy scan for Docker image vulnerabilities (before ACR push)
  4. SAST (Static Application Security Testing): SonarQube security rules (injection, XSS, crypto)
  5. License Compliance: Verify all dependencies have acceptable licenses (no GPL/AGPL)

Typical Failures:

  • Vulnerable dependencies (outdated packages with known CVEs)
  • Secrets in code (connection strings, API keys, passwords in appsettings.json or code)
  • Container image vulnerabilities (base image outdated, vulnerable OS packages)
  • Insecure coding patterns (SQL injection, XSS, weak crypto)

Remediation Time: ≤24 hours (critical/high), ≤30 days (medium/low)

Severity Thresholds:

Severity CVSS Score Action SLA Production Blocker
Critical 9.0-10.0 ❌ Block build immediately Fix within 24h ✅ Yes
High 7.0-8.9 ❌ Block build; require patching or risk acceptance Fix within 7 days ✅ Yes
Medium 4.0-6.9 ⚠️ Warning; track in security backlog Fix within 30 days ❌ No
Low 0.1-3.9 ℹ️ Info; track in backlog Fix in next major release ❌ No
None 0.0 ℹ️ Info; no action required N/A ❌ No

Automation:

# Azure Pipelines: Security Gates
- task: dependency-check-build-task@6
  inputs:
    projectName: 'ConnectSoft.ATP.Ingestion'
    scanPath: '$(Build.SourcesDirectory)'
    format: 'HTML,JSON,XML'
    failOnCVSS: 7  # Block on High/Critical (CVSS ≥7)
    suppressionFile: 'dependency-check-suppressions.xml'
  displayName: 'OWASP Dependency Scan'

- task: CredScan@3
  inputs:
    toolMajorVersion: 'V2'
    suppressionsFile: 'credscan-suppressions.json'
  displayName: 'Secrets Detection'

- script: |
    trivy image --severity HIGH,CRITICAL --exit-code 1 \
      $(containerRegistry)/$(imageRepository):$(Build.BuildNumber)
  displayName: 'Trivy Container Scan'
  condition: and(succeeded(), eq(variables['Build.Reason'], 'PullRequest'))

Gate Category 4: Compliance

Purpose: Ensure regulatory compliance, audit logging, PII protection, and supply chain transparency.

Enforcement Point: CI stage (after security) + Pre-deployment validation

Blocker Status: ✅ Yes — Pipeline fails if SBOM missing, audit logging incomplete, or PII detected in logs

Key Checks:

  1. SBOM Generation: CycloneDX/SPDX bill of materials for all dependencies
  2. Audit Logging Validation: All state-mutating operations emit audit events
  3. PII Redaction: No raw PII (email, phone, SSN) in log statements
  4. Compliance Checklist: GDPR/HIPAA safeguards validated (encryption, retention, tenant isolation)
  5. License Compliance: All dependencies have acceptable licenses (no copyleft in production)

Typical Failures:

  • Missing SBOM (build artifact not generated or published)
  • Audit logging gaps (new API endpoints without audit event emission)
  • PII in logs (raw email/phone logged without redaction)
  • Compliance checklist items incomplete (e.g., retention policies not configured)

Remediation Time: ≤4 hours (SBOM/logging), ≤1 day (PII redaction), ≤1 week (compliance checklist)

GDPR/HIPAA Compliance Checklist:

Control Requirement Validation Blocker
Encryption at Rest All databases, storage accounts encrypted (TDE, SSE) Azure Policy scan ✅ Yes (staging/prod)
Encryption in Transit TLS 1.3 enforced for all external APIs Network policy validation ✅ Yes (prod)
Tenant Isolation Multi-tenant data separation validated in integration tests Test results (tag: @tenantIsolation) ✅ Yes
Retention Policies Configurable retention per tenant (7 years default) Configuration validation ✅ Yes
DSAR Workflow Data Subject Access Request workflow implemented API contract test (export endpoint) ✅ Yes
Breach Notification Incident response procedure documented Document exists in repo ⚠️ Warning
Audit Logging All write operations emit audit events Custom validator (IAuditLogger.LogAsync) ✅ Yes
PII Redaction Sensitive fields redacted in logs/telemetry Custom validator (log parsing) ✅ Yes

Automation:

# Azure Pipelines: Compliance Gates
- task: CycloneDX@1
  inputs:
    projectPath: '$(Build.SourcesDirectory)'
    outputFormat: 'json,xml'
    outputPath: '$(Build.ArtifactStagingDirectory)/sbom'
  displayName: 'Generate SBOM (CycloneDX)'

- script: |
    ./scripts/validate-audit-logging.ps1 -Path "$(Build.SourcesDirectory)" -Threshold 100
  displayName: 'Validate Audit Logging Coverage'

- script: |
    ./scripts/validate-pii-redaction.ps1 -Path "$(Build.SourcesDirectory)"
  displayName: 'Validate PII Redaction'

- task: AzurePolicyCompliance@1
  inputs:
    azureSubscription: '$(azureSubscription)'
    resourceGroup: 'ATP-$(Environment)-RG'
    policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/...'
  displayName: 'Validate Azure Policy Compliance'

Gate Category 5: Performance

Purpose: Validate that application meets performance requirements under load and during failures.

Enforcement Point: Staging environment (before production deployment)

Blocker Status: ⚠️ Warning in staging, ✅ Blocker for production deployment

Key Checks:

  1. Load Testing: Simulate production traffic (500-1000 concurrent users, 10-15 minutes)
  2. Latency Thresholds: p50 <100ms, p95 <500ms, p99 <1000ms
  3. Error Rate: <0.1% (1 error per 1000 requests)
  4. Throughput: ≥1000 requests/second sustained
  5. Chaos Testing: Pod restarts, network latency, storage unavailability scenarios

Typical Failures:

  • High latency due to inefficient queries, N+1 problems, missing indexes
  • High error rate due to race conditions, deadlocks, resource exhaustion
  • Low throughput due to synchronous I/O, single-threaded bottlenecks
  • Chaos test failures due to lack of retries, circuit breakers, graceful degradation

Remediation Time: ≤1 week (performance optimization), ≤2 weeks (resilience improvements)

Performance Metrics:

Metric Target (ATP) Industry Standard Measurement Tool Action on Failure
p50 Latency <100ms <200ms JMeter, k6 ⚠️ Warning; investigate
p95 Latency <500ms <1000ms JMeter, k6 ❌ Block prod deployment
p99 Latency <1000ms <2000ms JMeter, k6 ⚠️ Warning; track
Error Rate <0.1% <1% JMeter, k6 ❌ Block prod deployment
Throughput ≥1000 RPS ≥500 RPS JMeter, k6 ℹ️ Info; track capacity
CPU Utilization <70% avg <80% avg Azure Monitor ⚠️ Warning; optimize
Memory Utilization <80% avg <85% avg Azure Monitor ⚠️ Warning; investigate leaks

Chaos Test Scenarios:

Scenario Pass Rate Blocker Expected Behavior
Pod Restart (random pod killed) 100% ✅ Yes Graceful shutdown, requests redistributed, no data loss
Network Latency (500ms added) 95% ❌ No Timeouts honored, retries triggered, circuit breaker opens
Storage Unavailable (SQL/Blob down 30s) 100% ✅ Yes Circuit breaker opens, degraded mode, no cascading failures
CPU Throttle (50% CPU limit) 90% ❌ No Graceful degradation, autoscaling triggered, no OOM kills
Memory Pressure (80% memory used) 95% ❌ No GC triggered, cache eviction, no OOM exceptions

Automation:

# Azure Pipelines: Performance Gates (Staging)
- task: JMeterLoadTest@1
  inputs:
    testPlan: 'load-tests/atp-load-test.jmx'
    targetUrl: '$(StagingUrl)'
    users: 500
    duration: 600  # 10 minutes
    thresholdP50: 100
    thresholdP95: 500
    thresholdErrorRate: 0.1
  displayName: 'Run Load Test (JMeter)'

- task: ChaosTest@1
  inputs:
    chaosManifest: 'chaos-tests/pod-restart.yaml'
    namespace: 'atp-staging'
    duration: 300  # 5 minutes
  displayName: 'Run Chaos Test (Pod Restart)'

Gate Category 6: Observability

Purpose: Validate that application emits sufficient telemetry (logs, metrics, traces) for production observability.

Enforcement Point: Staging environment (before production deployment)

Blocker Status: ⚠️ Warning in staging, ✅ Blocker for production deployment

Key Checks:

  1. OpenTelemetry Instrumentation: All HTTP endpoints, database calls, message bus operations instrumented
  2. Health Checks: /health/live and /health/ready return 200 OK
  3. Structured Logging: All logs use structured logging (JSON), no string concatenation
  4. Custom Metrics: Business KPIs exposed (audit events ingested, queries executed, export jobs completed)
  5. Trace Context Propagation: Trace IDs propagated across service boundaries (W3C Trace Context)

Typical Failures:

  • Missing instrumentation (new endpoints without activity spans)
  • Health check failures (dependency checks fail, timeouts)
  • Unstructured logs (string concatenation, missing correlation IDs)
  • Missing metrics (business KPIs not exposed for Prometheus scraping)

Remediation Time: ≤4 hours (instrumentation), ≤1 day (health checks), ≤2 hours (structured logging)

Observability Requirements:

Requirement Validation Tool Blocker
Activity Spans All HTTP endpoints have Activity spans Custom validator (DI container scan) ✅ Yes
Database Instrumentation All EF Core queries instrumented System.Diagnostics.DiagnosticSource listener ✅ Yes
Structured Logging All logs use ILogger<T> with structured parameters Custom log parser ✅ Yes
Health Checks /health/live and /health/ready return 200 HTTP test task ✅ Yes
Prometheus Metrics /metrics endpoint exposed and scrapable Prometheus validation ⚠️ Warning
Trace Context TraceParent header propagated (W3C) Integration test validation ✅ Yes

Health Check Components (must all pass):

// Health check dependencies (all must be healthy)
public static class HealthCheckExtensions
{
    public static IHealthChecksBuilder AddAtpHealthChecks(
        this IHealthChecksBuilder builder,
        IConfiguration configuration)
    {
        return builder
            // Liveness checks (process is alive)
            .AddCheck("self", () => HealthCheckResult.Healthy("Service is running"))

            // Readiness checks (dependencies available)
            .AddSqlServer(
                connectionString: configuration.GetConnectionString("DefaultConnection"),
                name: "sql-server",
                tags: new[] { "ready", "database" })

            .AddRedis(
                redisConnectionString: configuration.GetConnectionString("Redis"),
                name: "redis-cache",
                tags: new[] { "ready", "cache" })

            .AddAzureServiceBusTopic(
                connectionString: configuration.GetConnectionString("ServiceBus"),
                topicName: "audit-events",
                name: "service-bus",
                tags: new[] { "ready", "messaging" })

            .AddAzureBlobStorage(
                connectionString: configuration.GetConnectionString("BlobStorage"),
                containerName: "audit-attachments",
                name: "blob-storage",
                tags: new[] { "ready", "storage" })

            .AddApplicationInsightsPublisher();
    }
}

Automation:

# Azure Pipelines: Observability Gates (Staging)
- task: HttpTest@1
  inputs:
    url: '$(StagingUrl)/health/ready'
    method: 'GET'
    expectedStatusCode: 200
    retryCount: 3
    retryDelay: 5
  displayName: 'Validate Health Checks'

- script: |
    ./scripts/validate-otel-instrumentation.ps1 -Path "$(Build.SourcesDirectory)"
  displayName: 'Validate OpenTelemetry Instrumentation'

- script: |
    curl -s "$(StagingUrl)/metrics" | promtool check metrics
  displayName: 'Validate Prometheus Metrics'

Summary

  • 6 Gate Categories: Build Quality, Test Coverage, Security, Compliance, Performance, Observability
  • Sequential Execution: Gates run in order with early termination on failure (fail fast)
  • CI Gates (Build → Compliance): ~15-20 minutes, all blockers
  • Staging Gates (Performance + Observability): ~13-20 minutes, warnings that block production
  • Service-Specific Thresholds: Coverage varies by service (65%-85% based on criticality)
  • Severity-Based Actions: Critical/High vulnerabilities are blockers; Medium/Low are warnings
  • Compliance Focus: SBOM, audit logging, PII redaction, GDPR/HIPAA checklist all enforced
  • Performance Standards: p95 <500ms, error rate <0.1%, 1000+ RPS sustained
  • Observability Requirements: OpenTelemetry, health checks, structured logging, custom metrics all validated

Build Quality Gates (Deep Dive)

Build quality gates are the first line of defense in ATP's quality enforcement strategy. They execute immediately after code is committed, providing rapid feedback to developers before any tests run or security scans execute.

Philosophy: If code doesn't compile cleanly or violates coding standards, there's no point in running expensive test suites or security scans. Build quality gates ensure a baseline of code hygiene before proceeding.

Build Quality Gate Workflow

graph TD
    A[Code Committed] --> B[Restore NuGet Packages]
    B --> C[Compile Code]
    C --> D{Build Success?}
    D -->|No| E[Build Failed ❌]
    D -->|Yes| F[Run StyleCop Analysis]
    F --> G{StyleCop Pass?}
    G -->|No| H[Style Violations ❌]
    G -->|Yes| I[Run SonarQube Scan]
    I --> J{SonarQube Quality Gate?}
    J -->|No| K[Quality Gate Failed ❌]
    J -->|Yes| L[Run Meziantou/AsyncFixer]
    L --> M{Analyzers Pass?}
    M -->|No| N[Analyzer Violations ❌]
    M -->|Yes| O[Build Quality Passed ✅]

    E --> P[Pipeline Stopped]
    H --> P
    K --> P
    N --> P
    O --> Q[Proceed to Test Gates]

    style E fill:#ff6b6b
    style H fill:#ff6b6b
    style K fill:#ff6b6b
    style N fill:#ff6b6b
    style O fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical Build Quality Gate Duration: 2-4 minutes


Code Compilation

Purpose: Ensure all code compiles successfully with zero errors and zero warnings before any further validation.

Threshold:

  • Build Errors: 0 (absolute requirement)
  • Build Warnings: 0 (all warnings treated as errors)
  • Exit Code: dotnet build must return 0

Configuration (.csproj):

<Project Sdk="Microsoft.NET.Sdk.Web">
  <PropertyGroup>
    <TargetFramework>net8.0</TargetFramework>
    <Nullable>enable</Nullable>
    <ImplicitUsings>enable</ImplicitUsings>

    <!-- Build Quality Enforcement -->
    <TreatWarningsAsErrors>true</TreatWarningsAsErrors>
    <WarningsAsErrors />
    <NoWarn></NoWarn>  <!-- Empty: no warnings suppressed -->

    <!-- Code Analysis -->
    <EnforceCodeStyleInBuild>true</EnforceCodeStyleInBuild>
    <EnableNETAnalyzers>true</EnableNETAnalyzers>
    <AnalysisLevel>latest</AnalysisLevel>
    <AnalysisMode>All</AnalysisMode>

    <!-- Documentation Enforcement -->
    <GenerateDocumentationFile>true</GenerateDocumentationFile>
    <NoWarn>$(NoWarn);1591</NoWarn>  <!-- Temporarily allow missing XML docs -->

    <!-- Deterministic Builds (for reproducibility) -->
    <Deterministic>true</Deterministic>
    <ContinuousIntegrationBuild Condition="'$(CI)' == 'true'">true</ContinuousIntegrationBuild>
  </PropertyGroup>

  <!-- Static Analysis Packages -->
  <ItemGroup>
    <PackageReference Include="StyleCop.Analyzers" Version="1.2.0-beta.556">
      <PrivateAssets>all</PrivateAssets>
      <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
    </PackageReference>
    <PackageReference Include="Meziantou.Analyzer" Version="2.0.110">
      <PrivateAssets>all</PrivateAssets>
      <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
    </PackageReference>
    <PackageReference Include="AsyncFixer" Version="1.6.0">
      <PrivateAssets>all</PrivateAssets>
      <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
    </PackageReference>
    <PackageReference Include="Microsoft.CodeAnalysis.NetAnalyzers" Version="8.0.0">
      <PrivateAssets>all</PrivateAssets>
      <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
    </PackageReference>
  </ItemGroup>

  <!-- StyleCop Configuration -->
  <ItemGroup>
    <AdditionalFiles Include="stylecop.json" />
  </ItemGroup>
</Project>

Enforcement (Azure Pipelines):

# Build Compilation Gate
- task: DotNetCoreCLI@2
  inputs:
    command: 'build'
    projects: '$(solution)'
    arguments: >
      --configuration Release
      --no-restore
      /p:TreatWarningsAsErrors=true
      /p:EnforceCodeStyleInBuild=true
      /p:ContinuousIntegrationBuild=true
      /p:Deterministic=true
      /warnaserror
  displayName: 'Build with Warnings as Errors'

  # Fail pipeline on non-zero exit code
  continueOnError: false

  # Capture build logs for diagnostics
  env:
    DOTNET_CLI_TELEMETRY_OPTOUT: 1
    DOTNET_SKIP_FIRST_TIME_EXPERIENCE: 1

Common Build Failures & Remediation:

Failure Type Example Error Remediation Typical Time
Syntax Error CS1002: ; expected Fix syntax in code 1-5 min
Type Mismatch CS0029: Cannot implicitly convert type Fix type casting or generics 5-15 min
Missing Reference CS0246: The type or namespace could not be found Add NuGet package or project reference 5-10 min
Nullability Warning CS8600: Converting null literal or possible null value Add null checks or nullable annotations 10-30 min
Async/Await CS4014: Call is not awaited Add await or .ConfigureAwait(false) 5-10 min
Unused Variable CS0219: Variable is assigned but never used Remove variable or use it 1-2 min
Missing XML Doc CS1591: Missing XML comment for publicly visible type Add /// <summary> documentation 10-20 min

Build Performance Optimization:

# Local developer build (fast feedback)
dotnet build --configuration Debug --no-restore /p:TreatWarningsAsErrors=false

# CI build (full enforcement)
dotnet build --configuration Release --no-restore /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=true

# Parallel build for large solutions (8 CPUs)
dotnet build --configuration Release -m:8

# Build with binary log (for diagnostics)
dotnet build --configuration Release /bl:build.binlog

Static Code Analysis

ATP uses four complementary static analyzers to enforce code quality, each focusing on different aspects of code hygiene.

Analyzer 1: StyleCop (Code Style & Documentation)

Purpose: Enforce consistent code style, naming conventions, and documentation standards across all ATP services.

Rules Enforced: 125+ rules covering naming, spacing, ordering, documentation, maintainability

Configuration (stylecop.json):

{
  "$schema": "https://raw.githubusercontent.com/DotNetAnalyzers/StyleCopAnalyzers/master/StyleCop.Analyzers/StyleCop.Analyzers/Settings/stylecop.schema.json",
  "settings": {
    "documentationRules": {
      "companyName": "ConnectSoft",
      "copyrightText": "Copyright (c) {companyName}. All rights reserved.\nLicensed under the MIT license.",
      "headerDecoration": "-----------------------------------------------------------------------",
      "xmlHeader": true,
      "documentInterfaces": true,
      "documentExposedElements": true,
      "documentInternalElements": false,
      "documentPrivateElements": false,
      "documentPrivateFields": false,
      "fileNamingConvention": "stylecop"
    },
    "namingRules": {
      "allowCommonHungarianPrefixes": false,
      "allowedHungarianPrefixes": [],
      "includeInferredTupleElementNames": true,
      "tupleElementNameCasing": "camelCase"
    },
    "orderingRules": {
      "elementOrder": [
        "kind",
        "accessibility",
        "constant",
        "static",
        "readonly"
      ],
      "systemUsingDirectivesFirst": true,
      "usingDirectivesPlacement": "outsideNamespace",
      "blankLinesBetweenUsingGroups": "allow"
    },
    "maintainabilityRules": {
      "topLevelTypes": "multiple"
    },
    "layoutRules": {
      "newlineAtEndOfFile": "require",
      "allowConsecutiveUsings": true
    }
  }
}

Key StyleCop Rules (ATP-Specific):

Rule ID Rule Name Severity Example Violation Fix
SA1200 Using directives placement Error using inside namespace Move using outside namespace
SA1633 File header required Warning Missing copyright header Add standard file header
SA1600 Elements should be documented Warning Missing XML documentation Add /// <summary> tags
SA1309 Field names must not begin with underscore Error _field for public fields Use _field only for private fields
SA1101 Prefix local calls with this Disabled ATP preference: no this. prefix
SA1503 Braces for single-line statements Error if (x) DoSomething(); Add braces: if (x) { DoSomething(); }
SA1516 Elements should be separated by blank line Warning No blank line between methods Add blank line

StyleCop Suppression (when necessary):

// Global suppression (GlobalSuppressions.cs)
[assembly: SuppressMessage("StyleCop.CSharp.DocumentationRules", "SA1633:File should have header", Justification = "Reviewed: Standard header enforced by .editorconfig")]

// Local suppression (specific violation)
#pragma warning disable SA1600 // Elements should be documented
public class GeneratedClass  // Auto-generated, no docs needed
{
}
#pragma warning restore SA1600

Analyzer 2: SonarQube (Bugs, Code Smells, Security)

Purpose: Detect bugs, code smells, and security vulnerabilities through deep semantic analysis.

Rules Enforced: 500+ rules covering reliability, maintainability, security, code smells

Quality Profile (ConnectSoft-ATP-Default):

# SonarQube Quality Profile (ATP)
qualityGate:
  name: ConnectSoft-ATP-Default

  conditions:
    # Reliability: Zero bugs allowed
    - metric: bugs
      operator: GREATER_THAN
      threshold: 0
      description: "Zero tolerance for bugs"

    # Security: Zero vulnerabilities allowed
    - metric: vulnerabilities
      operator: GREATER_THAN
      threshold: 0
      description: "Zero tolerance for security issues"

    # Security: Zero security hotspots in review
    - metric: security_hotspots_reviewed
      operator: LESS_THAN
      threshold: 100
      description: "All security hotspots must be reviewed"

    # Maintainability: Max 10 code smells (minor issues)
    - metric: code_smells
      operator: GREATER_THAN
      threshold: 10
      description: "Limit technical debt"

    # Coverage: Minimum 70% line coverage
    - metric: coverage
      operator: LESS_THAN
      threshold: 70.0
      description: "Enforce minimum test coverage"

    # Duplication: Max 3% duplicated lines
    - metric: duplicated_lines_density
      operator: GREATER_THAN
      threshold: 3.0
      description: "Prevent copy-paste programming"

    # Complexity: Cognitive complexity ≤15 per method
    - metric: cognitive_complexity
      operator: GREATER_THAN
      threshold: 15
      description: "Keep methods simple"

    # New Code: Zero new bugs in new code
    - metric: new_bugs
      operator: GREATER_THAN
      threshold: 0
      onlyNewCode: true
      description: "No new bugs introduced"

    # New Code: 100% coverage on new code
    - metric: new_coverage
      operator: LESS_THAN
      threshold: 100.0
      onlyNewCode: true
      description: "All new code must be tested"

SonarQube Integration (Azure Pipelines):

# SonarQube Analysis Gate
stages:
- stage: CI_Stage
  jobs:
  - job: Build_Analyze
    steps:
    # 1. Prepare SonarQube
    - task: SonarQubePrepare@5
      inputs:
        SonarQube: 'SonarCloud-ConnectSoft'
        scannerMode: 'MSBuild'
        projectKey: 'ConnectSoft_ATP_Ingestion'
        projectName: 'ATP Ingestion Service'
        projectVersion: '$(Build.BuildNumber)'
        extraProperties: |
          sonar.organization=connectsoft
          sonar.sources=src
          sonar.tests=tests
          sonar.cs.opencover.reportsPaths=$(Agent.TempDirectory)/**/coverage.opencover.xml
          sonar.exclusions=**/Migrations/**,**/obj/**,**/bin/**
          sonar.coverage.exclusions=**/*Tests.cs,**/Program.cs,**/Startup.cs
          sonar.cpd.exclusions=**/Models/**,**/DTOs/**
      displayName: 'Prepare SonarQube Analysis'

    # 2. Restore NuGet packages
    - task: DotNetCoreCLI@2
      inputs:
        command: 'restore'
        projects: '$(solution)'
      displayName: 'Restore NuGet Packages'

    # 3. Build (SonarQube collects metrics)
    - task: DotNetCoreCLI@2
      inputs:
        command: 'build'
        projects: '$(solution)'
        arguments: '--configuration Release --no-restore'
      displayName: 'Build Solution'

    # 4. Run Tests (coverage data for SonarQube)
    - task: DotNetCoreCLI@2
      inputs:
        command: 'test'
        projects: '**/*Tests.csproj'
        arguments: '--configuration Release --no-build --collect:"XPlat Code Coverage" -- DataCollectionRunSettings.DataCollectors.DataCollector.Configuration.Format=opencover'
      displayName: 'Run Tests with Coverage'

    # 5. Analyze Code with SonarQube
    - task: SonarQubeAnalyze@5
      displayName: 'Run SonarQube Analysis'

    # 6. Publish Quality Gate Result
    - task: SonarQubePublish@5
      inputs:
        pollingTimeoutSec: '300'
      displayName: 'Publish Quality Gate Result'

    # 7. Break Build on Quality Gate Failure
    - script: |
        # Query SonarQube API for quality gate status
        QUALITY_GATE=$(curl -u $(SonarToken): \
          "https://sonarcloud.io/api/qualitygates/project_status?projectKey=ConnectSoft_ATP_Ingestion" \
          | jq -r '.projectStatus.status')

        if [ "$QUALITY_GATE" != "OK" ]; then
          echo "##vso[task.logissue type=error]SonarQube Quality Gate Failed: $QUALITY_GATE"
          echo "##vso[task.complete result=Failed;]Quality Gate Failed"
          exit 1
        fi

        echo "✅ SonarQube Quality Gate Passed"
      displayName: 'Validate Quality Gate'
      env:
        SonarToken: $(SonarQubeToken)

Top SonarQube Rules (ATP-Critical):

Rule ID Rule Name Type Severity Example Fix
S1172 Unused method parameters Code Smell Major public void Process(int unused) Remove or use parameter
S2589 Boolean expressions should not be gratuitous Bug Blocker if (x == true && x == false) Fix logic error
S2696 Instance methods should not write to static fields Bug Critical Instance method writes to static field Refactor to instance field
S3776 Cognitive complexity too high Code Smell Critical Method with complexity > 15 Refactor into smaller methods
S1135 Track uses of "TODO" tags Info Info // TODO: Fix this Create work item, remove TODO
S4790 Weak cryptographic algorithms Vulnerability Blocker MD5.Create() Use SHA256 or better
S2077 Formatting SQL queries is security-sensitive Security Hotspot Major $"SELECT * FROM Users WHERE Id={id}" Use parameterized queries
S1481 Unused local variables should be removed Code Smell Minor var unused = GetData(); Remove or use variable

SonarQube False Positive Suppression:

// Suppress specific rule for method
[SuppressMessage("SonarQube", "S3776:Cognitive Complexity of methods should not be too high", 
    Justification = "Complex business logic; covered by tests")]
public async Task<AuditEventResult> ProcessComplexEvent(AuditEvent evt)
{
    // Complex logic here
}

// Suppress for entire file (use sparingly)
#pragma warning disable S1135 // Track uses of "TODO" tags
// TODO: This entire file is a prototype
#pragma warning restore S1135

Analyzer 3: Meziantou.Analyzer (Best Practices)

Purpose: Enforce .NET best practices, async/await patterns, and performance optimizations.

Rules Enforced: 150+ rules covering async, collections, strings, disposal, naming

Key Meziantou Rules (ATP-Enabled):

Rule ID Rule Name Severity Example Fix
MA0001 StringComparison is missing Warning str.Contains("test") str.Contains("test", StringComparison.Ordinal)
MA0004 Use Task.ConfigureAwait(false) Warning await Task.Delay(100); await Task.Delay(100).ConfigureAwait(false);
MA0006 Use String.Equals instead of equality operator Warning str == "test" str.Equals("test", StringComparison.Ordinal)
MA0011 IFormatProvider is missing Warning int.Parse("123") int.Parse("123", CultureInfo.InvariantCulture)
MA0016 Prefer return collection abstraction Warning public List<T> Get() public IEnumerable<T> Get() or IReadOnlyList<T>
MA0026 Fix TODO comment Info // TODO: Implement Create work item, remove TODO
MA0040 Use a cancellation token Warning async Task DoWork() async Task DoWork(CancellationToken ct)
MA0051 Method is too long Warning Method > 60 lines Refactor into smaller methods
MA0056 Do not call overridable members in constructor Warning ctor() { VirtualMethod(); } Move to Initialize() method

.editorconfig Configuration (Meziantou):

# Meziantou Analyzer Rules
[*.cs]

# MA0001: StringComparison is missing
dotnet_diagnostic.MA0001.severity = warning

# MA0004: Use Task.ConfigureAwait(false)
dotnet_diagnostic.MA0004.severity = warning

# MA0006: Use String.Equals
dotnet_diagnostic.MA0006.severity = warning

# MA0011: IFormatProvider is missing
dotnet_diagnostic.MA0011.severity = warning

# MA0016: Prefer return collection abstraction
dotnet_diagnostic.MA0016.severity = warning

# MA0040: Use a cancellation token
dotnet_diagnostic.MA0040.severity = warning

# MA0051: Method is too long (disable for auto-generated code)
dotnet_diagnostic.MA0051.severity = none

# MA0056: Do not call overridable members in constructor
dotnet_diagnostic.MA0056.severity = error

Analyzer 4: AsyncFixer (Async/Await Correctness)

Purpose: Detect async/await anti-patterns that can cause deadlocks, performance issues, or incorrect behavior.

Rules Enforced: 6 critical async patterns

Key AsyncFixer Rules (ATP-Enabled):

Rule ID Rule Name Severity Example Issue Fix
AsyncFixer01 Unnecessary async/await Warning async Task<int> Get() => await Task.FromResult(1); Unnecessary overhead Task<int> Get() => Task.FromResult(1);
AsyncFixer02 Long-running or blocking operations Warning Task.Run(() => { Thread.Sleep(1000); }) Blocking thread pool Use await Task.Delay(1000)
AsyncFixer03 Fire-and-forget async void Error async void ProcessEvent() Unhandled exceptions async Task ProcessEventAsync()
AsyncFixer04 Fire-and-forget async call Warning ProcessEventAsync(); // not awaited Lost exceptions await ProcessEventAsync();
AsyncFixer05 Downcasting from Task to Task Error (Task<int>)task Runtime exception risk Use Task.FromResult<int>() or generics
AsyncFixer06 Missing ConfigureAwait(false) Warning await client.GetAsync(url); Potential deadlock in UI apps await client.GetAsync(url).ConfigureAwait(false);

AsyncFixer Examples & Remediation:

// ❌ BAD: Async void (AsyncFixer03)
public async void ProcessEvent(AuditEvent evt)  // Unhandled exceptions disappear
{
    await _repository.SaveAsync(evt);
}

// ✅ GOOD: Async Task
public async Task ProcessEventAsync(AuditEvent evt)  // Exceptions propagate correctly
{
    await _repository.SaveAsync(evt);
}

// ❌ BAD: Fire-and-forget (AsyncFixer04)
public void EnqueueEvent(AuditEvent evt)
{
    ProcessEventAsync(evt);  // Not awaited; exceptions lost
}

// ✅ GOOD: Awaited or properly handled
public async Task EnqueueEventAsync(AuditEvent evt)
{
    await ProcessEventAsync(evt);  // Exceptions propagate
}

// OR: Explicitly fire-and-forget with error handling
public void EnqueueEvent(AuditEvent evt)
{
    _ = ProcessEventAsync(evt).ContinueWith(t =>
    {
        if (t.IsFaulted)
        {
            _logger.LogError(t.Exception, "Event processing failed");
        }
    }, TaskScheduler.Default);
}

// ❌ BAD: Blocking in async (AsyncFixer02)
public async Task<string> GetDataAsync()
{
    return await Task.Run(() =>
    {
        Thread.Sleep(1000);  // Blocking thread pool thread
        return "data";
    });
}

// ✅ GOOD: Proper async
public async Task<string> GetDataAsync()
{
    await Task.Delay(1000);  // Non-blocking
    return "data";
}

// ❌ BAD: Missing ConfigureAwait (AsyncFixer06)
public async Task<AuditEvent> GetEventAsync(Guid id)
{
    var json = await _httpClient.GetStringAsync($"/api/events/{id}");  // Captures context
    return JsonSerializer.Deserialize<AuditEvent>(json);
}

// ✅ GOOD: ConfigureAwait(false) in library code
public async Task<AuditEvent> GetEventAsync(Guid id)
{
    var json = await _httpClient.GetStringAsync($"/api/events/{id}").ConfigureAwait(false);
    return JsonSerializer.Deserialize<AuditEvent>(json);
}

.editorconfig (Unified Analyzer Configuration)

ATP uses .editorconfig to centralize analyzer rule severity across all services.

.editorconfig (ATP Standard):

# ConnectSoft ATP .editorconfig
# Applied to all C# files in the repository

root = true

# All files
[*]
charset = utf-8
insert_final_newline = true
trim_trailing_whitespace = true
indent_style = space

# C# files
[*.cs]
indent_size = 4
end_of_line = lf

# Build Quality: Treat warnings as errors
dotnet_diagnostic.severity = error

# Nullable Reference Types
nullable = enable
dotnet_diagnostic.CS8600.severity = error  # Converting null literal
dotnet_diagnostic.CS8601.severity = error  # Possible null reference assignment
dotnet_diagnostic.CS8602.severity = error  # Dereference of a possibly null reference
dotnet_diagnostic.CS8603.severity = error  # Possible null reference return
dotnet_diagnostic.CS8604.severity = error  # Possible null reference argument

# StyleCop Rules (selective enforcement)
dotnet_diagnostic.SA1101.severity = none      # Prefix local calls with this (disabled)
dotnet_diagnostic.SA1200.severity = error     # Using directives placement
dotnet_diagnostic.SA1309.severity = error     # Field names must not begin with underscore (public)
dotnet_diagnostic.SA1503.severity = error     # Braces for single-line statements
dotnet_diagnostic.SA1516.severity = warning   # Elements separated by blank line
dotnet_diagnostic.SA1600.severity = warning   # Elements should be documented
dotnet_diagnostic.SA1633.severity = none      # File header (handled by .editorconfig)

# SonarQube Rules (critical only)
dotnet_diagnostic.S1172.severity = warning    # Unused parameters
dotnet_diagnostic.S2589.severity = error      # Boolean expressions gratuitous
dotnet_diagnostic.S2696.severity = error      # Instance methods write to static fields
dotnet_diagnostic.S3776.severity = warning    # Cognitive complexity
dotnet_diagnostic.S4790.severity = error      # Weak cryptographic algorithms

# Meziantou Rules
dotnet_diagnostic.MA0001.severity = warning   # StringComparison missing
dotnet_diagnostic.MA0004.severity = warning   # ConfigureAwait missing
dotnet_diagnostic.MA0006.severity = warning   # Use String.Equals
dotnet_diagnostic.MA0011.severity = warning   # IFormatProvider missing
dotnet_diagnostic.MA0040.severity = warning   # Use cancellation token

# AsyncFixer Rules
dotnet_diagnostic.AsyncFixer01.severity = warning  # Unnecessary async/await
dotnet_diagnostic.AsyncFixer02.severity = warning  # Blocking operations
dotnet_diagnostic.AsyncFixer03.severity = error    # Async void
dotnet_diagnostic.AsyncFixer04.severity = warning  # Fire-and-forget
dotnet_diagnostic.AsyncFixer06.severity = warning  # Missing ConfigureAwait

# Code Style
csharp_prefer_braces = true:error
csharp_prefer_simple_using_statement = true:suggestion
csharp_style_namespace_declarations = file_scoped:warning
csharp_style_var_for_built_in_types = false:suggestion
csharp_style_var_when_type_is_apparent = true:suggestion
csharp_style_var_elsewhere = false:suggestion

Build Quality Metrics & Dashboard

Azure DevOps Dashboard (Build Quality Widget):

# Build Quality Dashboard Configuration
dashboard:
  name: "ATP Build Quality"
  widgets:
    - type: buildQuality
      title: "Build Success Rate"
      query: "Build Success Rate (Last 30 Days)"
      metric: successRate
      target: 95%

    - type: codeQuality
      title: "SonarQube Quality Gate"
      query: "Quality Gate Pass Rate"
      metric: qualityGatePass
      target: 100%

    - type: codeAnalysis
      title: "Analyzer Violations"
      query: "Analyzer Violations (Last 7 Days)"
      metrics:
        - StyleCop: 0
        - SonarQube: 0
        - Meziantou: < 10
        - AsyncFixer: 0

Build Quality KQL Queries (Application Insights):

// Build success rate by service (last 30 days)
customEvents
| where name == "BuildCompleted"
| where timestamp > ago(30d)
| extend Service = tostring(customDimensions.Service)
| extend Success = tostring(customDimensions.Success) == "true"
| summarize 
    TotalBuilds = count(),
    SuccessfulBuilds = countif(Success),
    SuccessRate = 100.0 * countif(Success) / count()
    by Service
| order by SuccessRate asc

// Top build failure reasons (last 7 days)
customEvents
| where name == "BuildFailed"
| where timestamp > ago(7d)
| extend Reason = tostring(customDimensions.FailureReason)
| summarize FailureCount = count() by Reason
| order by FailureCount desc
| take 10

// Average build duration trend (last 90 days)
customEvents
| where name == "BuildCompleted"
| where timestamp > ago(90d)
| extend DurationSeconds = todouble(customDimensions.DurationSeconds)
| summarize AvgDuration = avg(DurationSeconds) by bin(timestamp, 1d)
| render timechart

Summary

  • Build Quality Gates: First line of defense with 2-4 minute execution time
  • Code Compilation: Zero errors, zero warnings (TreatWarningsAsErrors=true)
  • 4 Static Analyzers: StyleCop (style), SonarQube (bugs/smells/security), Meziantou (best practices), AsyncFixer (async correctness)
  • StyleCop: 125+ rules enforcing code style, naming, documentation
  • SonarQube: 500+ rules with quality gate (0 bugs, 0 vulnerabilities, ≤10 code smells, ≥70% coverage, ≤3% duplication)
  • Meziantou: 150+ rules for .NET best practices (StringComparison, ConfigureAwait, IFormatProvider, cancellation tokens)
  • AsyncFixer: 6 critical rules preventing async anti-patterns (no async void, ConfigureAwait, fire-and-forget)
  • .editorconfig: Centralized analyzer configuration with severity levels per rule
  • Enforcement: All analyzers run during build; pipeline fails on any error-level violation
  • Typical Remediation: 1-30 minutes per build failure depending on type

Test Coverage Gates (Deep Dive)

Test coverage gates ensure that sufficient automated tests exist and execute successfully with adequate code coverage. ATP enforces 100% test pass rate and service-specific coverage thresholds to maintain high reliability and prevent regression.

Philosophy: Code without tests is legacy code. ATP requires that all new code is accompanied by comprehensive unit and integration tests, with coverage thresholds calibrated to each service's criticality and complexity.

Test Coverage Gate Workflow

graph TD
    A[Build Successful] --> B[Restore Test Projects]
    B --> C[Run Unit Tests]
    C --> D{All Tests Pass?}
    D -->|No| E[Test Failures ❌]
    D -->|Yes| F[Run Integration Tests]
    F --> G{All Tests Pass?}
    G -->|No| H[Integration Test Failures ❌]
    G -->|Yes| I[Collect Code Coverage]
    I --> J[Generate Coverage Report]
    J --> K{Coverage ≥ Threshold?}
    K -->|No| L[Coverage Too Low ❌]
    K -->|Yes| M[Detect Flaky Tests]
    M --> N{Flaky Rate < 5%?}
    N -->|No| O[Flaky Tests Detected ⚠️]
    N -->|Yes| P[Test Coverage Passed ✅]

    E --> Q[Pipeline Stopped]
    H --> Q
    L --> Q
    O --> R[Warning: Fix Flaky Tests]
    P --> S[Proceed to Security Gates]

    style E fill:#ff6b6b
    style H fill:#ff6b6b
    style L fill:#ff6b6b
    style O fill:#feca57
    style P fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical Test Coverage Gate Duration: 3-5 minutes


Coverage Thresholds (Per Service)

ATP enforces service-specific coverage thresholds based on each service's criticality, complexity, and architectural patterns. Security-critical services have higher thresholds than I/O-heavy integration services.

Service Threshold Matrix:

Service Line Coverage Branch Coverage Min Tests Max Test Duration Rationale
Ingestion ≥75% ≥65% 100+ 5 minutes Critical path for all audit events; high reliability requirement; complex validation logic
Query ≥80% ≥70% 150+ 5 minutes Complex query logic with dynamic filters, pagination, sorting; high test coverage essential
Integrity ≥85% ≥75% 80+ 3 minutes Security-critical tamper-evidence, hash chains, digital signatures; highest coverage requirement
Export ≥70% ≥60% 60+ 7 minutes I/O-heavy with external dependencies (Blob, CSV, PDF); lower threshold acceptable
Policy ≥80% ≥70% 120+ 4 minutes Business rules enforcement; high coverage for rule validation and policy evaluation
Search ≥70% ≥60% 80+ 6 minutes Integration-heavy with Elasticsearch; focus on integration tests over unit tests
Gateway ≥65% ≥55% 50+ 4 minutes API routing and orchestration; lower threshold, focus on E2E tests and contract validation

Threshold Rationale:

# Why different thresholds per service?

ingestion:
  threshold: 75%
  rationale: |
    - Critical path: All audit events flow through Ingestion
    - Complex validation: Schema validation, tenant isolation, duplicate detection
    - High reliability: 99.9% uptime SLA
    - Consequence of failure: Audit events lost (catastrophic)
  riskProfile: Critical

query:
  threshold: 80%
  rationale: |
    - Complex logic: Dynamic query building, filter composition, pagination
    - High variability: Many query permutations (100+ filter combinations)
    - Performance-critical: Query performance directly impacts user experience
    - Consequence of failure: Incorrect results (compliance risk)
  riskProfile: High

integrity:
  threshold: 85%
  rationale: |
    - Security-critical: Tamper-evidence, hash chain validation, signature verification
    - Zero-tolerance: Any integrity failure undermines entire audit trail
    - Cryptographic complexity: Hash algorithms, Merkle trees, digital signatures
    - Consequence of failure: Audit trail integrity compromised (catastrophic)
  riskProfile: Critical

export:
  threshold: 70%
  rationale: |
    - I/O-heavy: File generation, streaming, Blob uploads
    - External dependencies: PDF libraries, CSV serialization
    - Lower complexity: Mostly data transformation and serialization
    - Consequence of failure: Export fails (retryable, not data loss)
  riskProfile: Medium

gateway:
  threshold: 65%
  rationale: |
    - API routing: Minimal business logic, mostly orchestration
    - E2E coverage: Tested via E2E tests rather than unit tests
    - Thin layer: Delegates to downstream services
    - Consequence of failure: Request routing error (visible, fast feedback)
  riskProfile: Low

Test Execution & Coverage Collection

Test Execution Pipeline (Azure Pipelines):

# Test Coverage Gate
- task: DotNetCoreCLI@2
  inputs:
    command: 'test'
    projects: '**/*Tests.csproj'
    arguments: >
      --configuration Release
      --no-build
      --collect:"XPlat Code Coverage"
      --settings:CodeCoverage.runsettings
      --logger:"trx;LogFileName=TestResults.trx"
      --
      DataCollectionRunSettings.DataCollectors.DataCollector.Configuration.Format=cobertura
  displayName: 'Run Tests with Coverage'

  # Fail on any test failure
  continueOnError: false

  # Test result publication
  publishTestResults: true
  testResultsFormat: 'VSTest'
  testResultsFiles: '**/TestResults.trx'

# Publish Coverage Results
- task: PublishCodeCoverageResults@1
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'
    reportDirectory: '$(Agent.TempDirectory)/coverage-report'
    pathToSources: '$(Build.SourcesDirectory)/src'
  displayName: 'Publish Coverage Report'

# Enforce Coverage Threshold
- task: BuildQualityChecks@8
  inputs:
    checkCoverage: true
    coverageFailOption: 'fixed'
    coverageType: 'lines'
    coverageThreshold: 70  # ATP baseline (overridden per service)
    coverageVariance: 0    # No tolerance for coverage drops
    baseBranchRef: 'refs/heads/main'
    treatBuildWarningsAsErrors: true
    baselineEnabled: true
    baselineType: 'previous'
  displayName: 'Enforce Coverage Threshold'

CodeCoverage.runsettings (Configuration):

<?xml version="1.0" encoding="utf-8"?>
<RunSettings>
  <DataCollectionRunSettings>
    <DataCollectors>
      <DataCollector friendlyName="XPlat code coverage">
        <Configuration>
          <Format>cobertura,opencover</Format>
          <Exclude>[*.Tests]*,[*]*.Migrations.*,[*]*.Program,[*]*.Startup</Exclude>
          <ExcludeByAttribute>Obsolete,GeneratedCode,CompilerGenerated</ExcludeByAttribute>
          <ExcludeByFile>**/*Designer.cs,**/obj/**,**/bin/**</ExcludeByFile>
          <IncludeDirectory>src/</IncludeDirectory>
          <SingleHit>false</SingleHit>
          <UseSourceLink>true</UseSourceLink>
          <IncludeTestAssembly>false</IncludeTestAssembly>
          <SkipAutoProps>true</SkipAutoProps>
        </Configuration>
      </DataCollector>
    </DataCollectors>
  </DataCollectionRunSettings>

  <RunConfiguration>
    <MaxCpuCount>0</MaxCpuCount>  <!-- Use all available CPUs -->
    <ResultsDirectory>./TestResults</ResultsDirectory>
    <TestSessionTimeout>600000</TestSessionTimeout>  <!-- 10 minutes -->
  </RunConfiguration>

  <MSTest>
    <Parallelize>
      <Workers>0</Workers>  <!-- Auto-detect based on CPU cores -->
      <Scope>ClassLevel</Scope>
    </Parallelize>
  </MSTest>
</RunSettings>

Baseline Protection

Purpose: Prevent coverage regression by comparing current coverage to previous builds and failing if coverage drops.

Mechanism: Azure DevOps Build Quality Checks task tracks coverage per build and fails if coverage decreases.

Configuration:

# Baseline Protection Configuration
- task: BuildQualityChecks@8
  inputs:
    checkCoverage: true
    coverageFailOption: 'fixed'  # Fixed threshold (not dynamic)
    coverageType: 'lines'
    coverageThreshold: $(coverageThreshold)  # Service-specific variable
    coverageVariance: 0  # Zero tolerance for coverage drops

    # Baseline Comparison
    baselineEnabled: true
    baselineType: 'previous'  # Compare to previous build
    baseBranchRef: 'refs/heads/main'

    # Include/Exclude Filters
    includePartiallySucceeded: false
    treatBuildWarningsAsErrors: true

    # Failure Behavior
    failTaskOnBaselineViolation: true
    createBuildIssue: true  # Create work item for coverage drop
  displayName: 'Baseline Protection: Enforce Coverage'

Baseline Scenarios:

Scenario Previous Coverage Current Coverage Variance Result Action
Coverage Maintained 75.0% 75.2% +0.2% ✅ Pass None
Coverage Improved 75.0% 78.5% +3.5% ✅ Pass Celebrate!
Coverage Dropped (Minor) 75.0% 74.8% -0.2% ❌ Fail Add tests for new code
Coverage Dropped (Major) 75.0% 68.0% -7.0% ❌ Fail Investigate untested code; may require new baseline
First Build N/A 72.0% N/A ✅ Pass Establishes baseline
Refactoring 75.0% 65.0% (new baseline) -10.0% ⚠️ Conditional Requires Force New Baseline approval

Coverage Drop Notification:

// Custom coverage drop detector
public class CoverageRegressionDetector
{
    public async Task<CoverageRegressionResult> DetectRegressionAsync(
        string buildId, 
        double currentCoverage,
        double threshold)
    {
        // Get previous build coverage
        var previousBuild = await GetPreviousBuildAsync(buildId);
        var previousCoverage = previousBuild?.Coverage ?? 0;

        var regression = new CoverageRegressionResult
        {
            CurrentCoverage = currentCoverage,
            PreviousCoverage = previousCoverage,
            Delta = currentCoverage - previousCoverage,
            Threshold = threshold,
            PassesThreshold = currentCoverage >= threshold,
            HasRegression = currentCoverage < previousCoverage
        };

        if (regression.HasRegression)
        {
            // Create Azure DevOps work item
            await CreateCoverageRegressionWorkItemAsync(new WorkItem
            {
                Title = $"Coverage Regression Detected: {regression.Delta:F1}% drop in Build {buildId}",
                Description = $@"
Code coverage dropped from {previousCoverage:F1}% to {currentCoverage:F1}% (delta: {regression.Delta:F1}%).

**Previous Build**: {previousBuild.BuildNumber} ({previousBuild.CommitSha})
**Current Build**: {buildId}

**Uncovered Code**:
{await GetUncoveredCodeSummaryAsync(buildId)}

**Action Required**:
1. Review uncovered code in coverage report
2. Add unit tests for critical paths
3. Re-run build to validate coverage improvement

**Coverage Report**: [View Report]({GetCoverageReportUrl(buildId)})
",
                AssignedTo = previousBuild.Requester,
                Priority = regression.Delta > 5 ? 1 : 2,  // P1 if >5% drop
                Tags = new[] { "coverage-regression", "quality-gate", "test-coverage" }
            });
        }

        return regression;
    }
}

Force New Baseline

Purpose: Allow intentional coverage drops after major refactoring or architecture changes, with proper approval and documentation.

When to Use:

  1. Major Refactoring: Large code deletions or restructuring (e.g., removing deprecated code)
  2. Architecture Changes: Moving logic between services (coverage shifts from one service to another)
  3. Test Cleanup: Removing obsolete tests after feature removal
  4. Coverage Calculation Changes: Updating .runsettings exclusions or analyzers

Approval Workflow:

stateDiagram-v2
    [*] --> Requested: Engineer triggers Force New Baseline
    Requested --> TechnicalReview: Create ADR documenting change

    TechnicalReview --> ArchitectApproval: Tech Lead approves justification
    TechnicalReview --> Rejected: Insufficient justification

    ArchitectApproval --> Approved: Lead Architect approves
    ArchitectApproval --> Rejected: Coverage drop unjustified

    Approved --> BaselineCreated: Set BQC.ForceNewBaseline=true
    BaselineCreated --> Validated: Monitor next 3 builds

    Validated --> [*]: New baseline established
    Rejected --> [*]: Use existing baseline
Hold "Alt" / "Option" to enable pan & zoom

Procedure:

# Step 1: Create Architecture Decision Record (ADR)
# File: adrs/adr-NNN-force-new-coverage-baseline.md

---
title: ADR-042: Force New Coverage Baseline for Query Service Refactoring
status: Accepted
date: 2025-01-15
decision-makers: Lead Architect, Tech Lead
consulted: QA Lead, SRE Team
informed: Development Team
---

## Context
Query service underwent major refactoring to separate read/write paths (CQRS).
~30% of code moved to new QueryRead project, causing coverage to drop from 80% to 62% in QueryWrite project.

## Decision
Force new baseline at 62% for QueryWrite service, with commitment to raise to 75% within Q1 2025.

## Consequences
- Coverage threshold lowered temporarily (62% for QueryWrite)
- Baseline protection disabled for 1 build
- Monitoring for 3 builds to ensure stability
- Action plan: Add 50+ unit tests for QueryWrite within 2 sprints

## Approval
- Lead Architect: ✅ Approved (2025-01-15)
- Tech Lead: ✅ Approved (2025-01-15)
- QA Lead: ✅ Consulted (2025-01-14)
# Step 2: Set Pipeline Variable (Azure DevOps)
# UI: Pipelines → Edit → Variables → Add Variable

variableName: BQC.ForceNewBaseline
value: true
scope: Single build  # Reset to false after baseline created
# Step 3: Update BuildQualityChecks Task
- task: BuildQualityChecks@8
  inputs:
    checkCoverage: true
    coverageThreshold: 62  # NEW BASELINE (was 80%)
    baselineEnabled: true

    # Force new baseline if variable set
    baselineType: ${{ if eq(variables['BQC.ForceNewBaseline'], 'true') }}:
      'current'  # Use current build as new baseline
    ${{ else }}:
      'previous'  # Compare to previous build
  displayName: 'Enforce Coverage with Baseline'
# Step 4: Monitor Next 3 Builds
#!/bin/bash
# validate-new-baseline.sh

BUILD_ID=$1
EXPECTED_COVERAGE=62.0

for i in {1..3}; do
  echo "Validating build $BUILD_ID (attempt $i/3)..."

  COVERAGE=$(az pipelines runs show --id $BUILD_ID --org https://dev.azure.com/ConnectSoft \
    --query "buildNumber" -o json | jq -r '.coverage')

  if (( $(echo "$COVERAGE < $EXPECTED_COVERAGE" | bc -l) )); then
    echo "❌ Coverage dropped below new baseline: $COVERAGE% < $EXPECTED_COVERAGE%"
    exit 1
  fi

  echo "✅ Build $BUILD_ID coverage: $COVERAGE% (baseline: $EXPECTED_COVERAGE%)"

  # Wait for next build
  sleep 3600  # 1 hour
done

echo "✅ New baseline validated over 3 builds"

Test Quality Metrics

Beyond Coverage Percentage: ATP tracks test quality metrics to ensure tests are effective, not just numerous.

Test Quality Scorecard

Metric Target Measurement Blocker Purpose
Test Pass Rate 100% Count(Passed) / Count(Total) ✅ Yes All tests must pass; no flaky tolerance
Test Duration Unit <30s, Integration <5min Execution time per test category ⚠️ Warning Fast feedback; slow tests indicate issues
Flaky Test Rate <5% Tests with <95% historical pass rate ⚠️ Warning Flaky tests erode confidence
Assertion Density ≥1.5 per test Count(Assertions) / Count(Tests) ℹ️ Info Ensure tests actually validate behavior
Quarantined Tests ≤3 per service Tests marked with [Ignore] or [Fact(Skip=)] ⚠️ Warning Quarantined tests must be fixed or removed
Test Coverage on New Code 100% Coverage on changed lines ✅ Yes All new code must be tested

Test Pass Rate (100% Required)

Threshold: 100% — Every test must pass; no tolerance for failures.

Enforcement:

# Test execution fails on first test failure
- task: DotNetCoreCLI@2
  inputs:
    command: 'test'
    arguments: '--no-build --logger trx --blame-hang-timeout 5m'
  displayName: 'Run Unit Tests'
  continueOnError: false  # Fail immediately on test failure

Common Test Failures:

Failure Type Symptom Root Cause Remediation
Assertion Failure Expected: 200, Actual: 500 Business logic bug, incorrect test expectation Fix code or update test assertion
Null Reference NullReferenceException Missing null checks, incomplete mocking Add null checks, fix mock setup
Timeout Test exceeds 5-minute limit Deadlock, infinite loop, external dependency Add timeout, fix async code, mock external calls
Flaky Test Passes sometimes, fails sometimes Race condition, timing dependency, shared state Fix concurrency issues, isolate test state
Dependency Failure SQL/Redis connection failed Service container not started Verify services: in pipeline, check health

Test Failure Notification:

// Emit test failure event for alerting
public class TestFailureNotifier : ITestExecutionListener
{
    public void OnTestFailed(TestResult result)
    {
        var telemetry = new EventTelemetry("TestFailed");
        telemetry.Properties["TestName"] = result.TestName;
        telemetry.Properties["FailureReason"] = result.ErrorMessage;
        telemetry.Properties["StackTrace"] = result.ErrorStackTrace;
        telemetry.Properties["Duration"] = result.Duration.ToString();
        telemetry.Properties["BuildId"] = Environment.GetEnvironmentVariable("BUILD_BUILDID");

        _telemetryClient.TrackEvent(telemetry);

        // Alert on P0 test failures (security, integrity tests)
        if (result.Categories.Contains("Security") || result.Categories.Contains("Integrity"))
        {
            _alertService.SendPagerDutyAlert(
                severity: "high",
                title: $"Critical Test Failed: {result.TestName}",
                description: result.ErrorMessage);
        }
    }
}

Test Duration Thresholds

Purpose: Ensure tests provide fast feedback; slow tests indicate design issues (e.g., integration tests disguised as unit tests).

Thresholds:

Test Category Max Duration (Per Test) Max Suite Duration Enforcement
Unit Tests 100ms 30 seconds ⚠️ Warning if exceeded
Integration Tests 5 seconds 5 minutes ⚠️ Warning if exceeded
E2E Tests 30 seconds 15 minutes ℹ️ Info (E2E expected to be slower)

Slow Test Detection:

// Detect slow tests during execution
[AttributeUsage(AttributeTargets.Method)]
public class PerformanceTestAttribute : FactAttribute
{
    public int MaxDurationMs { get; set; } = 100;  // Default: 100ms for unit tests

    public PerformanceTestAttribute()
    {
        // Custom test framework hook to measure duration
    }
}

// Usage
[PerformanceTest(MaxDurationMs = 100)]
public async Task Should_Validate_Event_Within_100ms()
{
    var validator = new AuditEventValidator();
    var evt = CreateValidEvent();

    var result = await validator.ValidateAsync(evt);  // Must complete in <100ms

    Assert.True(result.IsValid);
}

Slow Test Report (Azure Pipelines):

#!/bin/bash
# detect-slow-tests.sh

# Parse test results XML
TEST_RESULTS=$(find . -name "TestResults.trx" -type f)

# Extract tests exceeding duration threshold
xq -x '//UnitTestResult[@duration > "PT0.1S"]/@testName' $TEST_RESULTS | while read TEST_NAME; do
  DURATION=$(xq -x "//UnitTestResult[@testName='$TEST_NAME']/@duration" $TEST_RESULTS)

  echo "⚠️ Slow Test Detected: $TEST_NAME (Duration: $DURATION)"

  # Create work item for slow test optimization
  az boards work-item create \
    --title "Slow Test: $TEST_NAME" \
    --type "Task" \
    --description "Test duration: $DURATION (threshold: 100ms)\n\nOptimize or reclassify as integration test." \
    --assigned-to "qa-team@connectsoft.example" \
    --fields Priority=3
done

Flaky Test Detection

Purpose: Identify unreliable tests that pass/fail intermittently, eroding confidence in the test suite.

Threshold: Tests with <95% historical pass rate are flagged as flaky.

Detection Mechanism:

// Flaky test analyzer (Azure Function)
[FunctionName("DetectFlakyTests")]
public async Task RunAsync(
    [TimerTrigger("0 0 2 * * *")] TimerInfo timer,  // Daily at 2 AM
    ILogger log)
{
    log.LogInformation("Analyzing test results for flaky tests...");

    var last30Days = DateTime.UtcNow.AddDays(-30);

    // Query Azure DevOps Test Analytics
    var testRuns = await _devOpsClient.GetTestRunsAsync(
        project: "ConnectSoft",
        minLastUpdatedDate: last30Days);

    var flakyTests = new List<FlakyTestResult>();

    foreach (var testRun in testRuns)
    {
        var results = await _devOpsClient.GetTestResultsAsync(testRun.Id);

        var testStats = results
            .GroupBy(r => r.TestCaseTitle)
            .Select(g => new
            {
                TestName = g.Key,
                TotalRuns = g.Count(),
                PassedRuns = g.Count(r => r.Outcome == "Passed"),
                PassRate = g.Count(r => r.Outcome == "Passed") / (double)g.Count()
            })
            .Where(t => t.PassRate < 0.95 && t.TotalRuns >= 5)  // Flaky: <95% pass, min 5 runs
            .ToList();

        flakyTests.AddRange(testStats.Select(s => new FlakyTestResult
        {
            TestName = s.TestName,
            PassRate = s.PassRate,
            TotalRuns = s.TotalRuns,
            FailureCount = s.TotalRuns - s.PassedRuns
        }));
    }

    if (flakyTests.Any())
    {
        log.LogWarning($"Detected {flakyTests.Count} flaky tests");

        // Create work item for each flaky test
        foreach (var flaky in flakyTests)
        {
            await _devOpsClient.CreateWorkItemAsync(new
            {
                Fields = new Dictionary<string, object>
                {
                    ["System.Title"] = $"Flaky Test: {flaky.TestName}",
                    ["System.WorkItemType"] = "Bug",
                    ["System.Description"] = $@"
Test has {flaky.PassRate:P0} pass rate over {flaky.TotalRuns} runs (threshold: 95%).

**Failure Count**: {flaky.FailureCount}
**Pass Rate**: {flaky.PassRate:P1}

**Action Required**:
1. Investigate test for race conditions, timing dependencies, shared state
2. Fix root cause or quarantine test (mark with [Ignore])
3. Re-enable after fix and validate 100% pass rate over 10 runs
",
                    ["System.Tags"] = "flaky-test; test-quality",
                    ["Microsoft.VSTS.Common.Priority"] = 2
                }
            });
        }

        // Send summary to QA team
        await SendFlakyTestReportAsync(flakyTests);
    }
    else
    {
        log.LogInformation("✅ No flaky tests detected");
    }
}

Flaky Test Quarantine:

// Quarantine flaky test until fixed
[Fact(Skip = "Flaky: Timing-dependent; see work item #12345")]
public async Task Should_Process_Event_Concurrently()
{
    // Test disabled until race condition fixed
}

// OR: Mark with custom attribute for reporting
[Fact]
[Trait("Category", "Flaky")]
[Trait("WorkItem", "12345")]
public async Task Should_Process_Event_Concurrently()
{
    // Test runs but tracked as flaky
}

Coverage Exclusions

Purpose: Exclude auto-generated code, third-party code, and infrastructure code from coverage calculations to focus on business logic.

Exclusion Categories:

<!-- CodeCoverage.runsettings -->
<Configuration>
  <!-- Exclude by Assembly Name -->
  <Exclude>
    [*.Tests]*,                    <!-- All test assemblies -->
    [*]*.Migrations.*,             <!-- EF Core migrations -->
    [*]*.Program,                  <!-- Program.cs entry point -->
    [*]*.Startup,                  <!-- Startup.cs DI config -->
    [xunit.*]*,                    <!-- xUnit framework -->
    [Moq]*                         <!-- Moq mocking framework -->
  </Exclude>

  <!-- Exclude by Attribute -->
  <ExcludeByAttribute>
    Obsolete,                      <!-- Deprecated code -->
    GeneratedCode,                 <!-- Auto-generated (T4, Swagger) -->
    CompilerGenerated,             <!-- Compiler-generated (closures) -->
    ExcludeFromCodeCoverage        <!-- Explicitly excluded -->
  </ExcludeByAttribute>

  <!-- Exclude by File Pattern -->
  <ExcludeByFile>
    **/*Designer.cs,               <!-- WinForms/WPF designers -->
    **/obj/**,                     <!-- Build output -->
    **/bin/**,                     <!-- Build output -->
    **/Migrations/**,              <!-- EF migrations -->
    **/*.Generated.cs,             <!-- Generated files -->
    **/GlobalUsings.cs             <!-- Global usings (C# 10+) -->
  </ExcludeByFile>

  <!-- Include Only Source Directories -->
  <IncludeDirectory>src/</IncludeDirectory>

  <!-- Skip Auto-Properties (getters/setters) -->
  <SkipAutoProps>true</SkipAutoProps>
</Configuration>

Explicit Exclusion (via Attribute):

// Exclude infrastructure code from coverage
[ExcludeFromCodeCoverage]
public class ApplicationDbContextFactory : IDesignTimeDbContextFactory<ApplicationDbContext>
{
    // Design-time factory for EF migrations (not covered by tests)
    public ApplicationDbContext CreateDbContext(string[] args)
    {
        var optionsBuilder = new DbContextOptionsBuilder<ApplicationDbContext>();
        optionsBuilder.UseSqlServer("Server=(localdb)\\mssqllocaldb;Database=DesignTime;");
        return new ApplicationDbContext(optionsBuilder.Options);
    }
}

// Exclude obsolete code from coverage (scheduled for removal)
[Obsolete("Use ProcessEventV2Async instead")]
[ExcludeFromCodeCoverage]
public async Task ProcessEventAsync(AuditEvent evt)
{
    // Legacy method; coverage not enforced
}

Test Organization & Naming

Purpose: Consistent test organization and naming improve discoverability, maintainability, and coverage analysis.

Test Project Structure:

ConnectSoft.ATP.Ingestion.Tests/
├── Unit/
│   ├── Controllers/
│   │   ├── AuditEventsControllerTests.cs
│   │   └── HealthControllerTests.cs
│   ├── Services/
│   │   ├── EventValidationServiceTests.cs
│   │   └── TenantIsolationServiceTests.cs
│   ├── Validators/
│   │   └── AuditEventValidatorTests.cs
│   └── Models/
│       └── AuditEventTests.cs
├── Integration/
│   ├── Repositories/
│   │   ├── AuditEventRepositoryTests.cs  # Requires SQL container
│   │   └── CacheRepositoryTests.cs       # Requires Redis container
│   ├── MessageBus/
│   │   └── EventPublisherTests.cs        # Requires RabbitMQ container
│   └── EndToEnd/
│       └── IngestionWorkflowTests.cs     # Full workflow (API → DB → Bus)
├── TestHelpers/
│   ├── Builders/
│   │   └── AuditEventBuilder.cs          # Test data builder
│   ├── Fixtures/
│   │   └── DatabaseFixture.cs            # Shared test fixtures
│   └── Mocks/
│       └── MockTimeProvider.cs           # Time abstraction mock
└── CodeCoverage.runsettings

Test Naming Convention:

// Pattern: Should_ExpectedBehavior_When_StateUnderTest

// ✅ GOOD: Clear, descriptive test names
public class AuditEventValidatorTests
{
    [Fact]
    public void Should_ReturnValid_When_EventHasAllRequiredFields()
    {
        // Arrange
        var evt = new AuditEventBuilder()
            .WithTenantId(Guid.NewGuid())
            .WithAction("UserLogin")
            .WithTimestamp(DateTime.UtcNow)
            .Build();

        var validator = new AuditEventValidator();

        // Act
        var result = validator.Validate(evt);

        // Assert
        Assert.True(result.IsValid);
        Assert.Empty(result.Errors);
    }

    [Fact]
    public void Should_ReturnInvalid_When_TenantIdIsMissing()
    {
        // Arrange
        var evt = new AuditEventBuilder()
            .WithAction("UserLogin")
            .WithTimestamp(DateTime.UtcNow)
            .Build();  // Missing TenantId

        var validator = new AuditEventValidator();

        // Act
        var result = validator.Validate(evt);

        // Assert
        Assert.False(result.IsValid);
        Assert.Contains(result.Errors, e => e.PropertyName == "TenantId");
    }

    [Theory]
    [InlineData(null)]
    [InlineData("")]
    [InlineData("   ")]
    public void Should_ReturnInvalid_When_ActionIsNullOrWhitespace(string action)
    {
        // Arrange
        var evt = new AuditEventBuilder()
            .WithTenantId(Guid.NewGuid())
            .WithAction(action)
            .WithTimestamp(DateTime.UtcNow)
            .Build();

        var validator = new AuditEventValidator();

        // Act
        var result = validator.Validate(evt);

        // Assert
        Assert.False(result.IsValid);
        Assert.Contains(result.Errors, e => e.PropertyName == "Action");
    }
}

// ❌ BAD: Unclear test names
public class AuditEventValidatorTests
{
    [Fact]
    public void Test1()  // What does this test?
    {
        var validator = new AuditEventValidator();
        var result = validator.Validate(new AuditEvent());
        Assert.False(result.IsValid);
    }
}

Coverage Report Analysis

Purpose: Provide actionable insights into uncovered code to guide test authoring.

Coverage Report Formats:

# Generate multiple coverage formats
- task: reportgenerator@5
  inputs:
    reports: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'
    targetdir: '$(Build.ArtifactStagingDirectory)/coverage-report'
    reporttypes: 'HtmlInline_AzurePipelines;Cobertura;Badges;MarkdownSummary'
  displayName: 'Generate Coverage Report'

# Publish as Azure DevOps artifact
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)/coverage-report'
    ArtifactName: 'code-coverage-$(Build.BuildNumber)'
  displayName: 'Publish Coverage Report'

Coverage Report Summary (Markdown):

# Code Coverage Summary

**Build**: 1.0.123  
**Date**: 2025-01-15 14:30:00 UTC  
**Branch**: main

---

## Overall Coverage

| Metric | Value | Threshold | Status |
|--------|-------|-----------|--------|
| Line Coverage | 73.5% | ≥70% | ✅ Pass |
| Branch Coverage | 64.2% | ≥60% | ✅ Pass |
| Method Coverage | 78.1% | — | ℹ️ Info |

---

## Coverage by Project

| Project | Line Coverage | Branch Coverage | Uncovered Lines |
|---------|---------------|-----------------|-----------------|
| Ingestion.API | 76.3% (↑1.2%) | 67.1% | 234 / 988 |
| Ingestion.Domain | 82.1% (↑0.5%) | 74.3% | 89 / 497 |
| Ingestion.Infrastructure | 65.4% (↓2.1%) ⚠️ | 58.9% | 412 / 1192 |

---

## Uncovered Code (Top 5 Files)

1. **EventRepository.cs** (48.2% covered)
   - Lines 45-89: Delete methods (no tests)
   - Lines 123-156: Bulk insert (no tests)
   - **Action**: Add integration tests for delete/bulk operations

2. **TenantIsolationService.cs** (55.7% covered)
   - Lines 34-67: Edge case validation (no tests)
   - **Action**: Add unit tests for edge cases

3. **CacheInvalidationService.cs** (61.3% covered)
   - Lines 12-45: Redis connection retry logic (no tests)
   - **Action**: Add integration tests with Redis failures

4. **EventSerializer.cs** (68.9% covered)
   - Lines 78-102: Error handling paths (no tests)
   - **Action**: Add tests for serialization errors

5. **HealthCheckService.cs** (72.4% covered)
   - Lines 56-89: Dependency health checks (no tests)
   - **Action**: Add tests for dependency failures

Per-Service Coverage Configuration

Purpose: Apply service-specific thresholds via pipeline variables to enforce different coverage requirements.

Azure DevOps Variable Groups:

# Variable Group: ATP-Coverage-Thresholds
variables:
  - name: Coverage.Ingestion.Line
    value: 75
  - name: Coverage.Ingestion.Branch
    value: 65

  - name: Coverage.Query.Line
    value: 80
  - name: Coverage.Query.Branch
    value: 70

  - name: Coverage.Integrity.Line
    value: 85
  - name: Coverage.Integrity.Branch
    value: 75

  - name: Coverage.Export.Line
    value: 70
  - name: Coverage.Export.Branch
    value: 60

  - name: Coverage.Policy.Line
    value: 80
  - name: Coverage.Policy.Branch
    value: 70

  - name: Coverage.Search.Line
    value: 70
  - name: Coverage.Search.Branch
    value: 60

  - name: Coverage.Gateway.Line
    value: 65
  - name: Coverage.Gateway.Branch
    value: 55

Pipeline Variable Usage:

# azure-pipelines.yml (per service)
variables:
  - group: ATP-Coverage-Thresholds
  - name: coverageThreshold
    value: $[variables['Coverage.Ingestion.Line']]  # Service-specific

- task: BuildQualityChecks@8
  inputs:
    checkCoverage: true
    coverageThreshold: $(coverageThreshold)  # Uses service-specific value
  displayName: 'Enforce Coverage: $(coverageThreshold)%'

Summary

  • Test Coverage Gates: Execute after successful build; 3-5 minute duration; 100% test pass rate required
  • Service-Specific Thresholds: Ingestion (75%), Query (80%), Integrity (85%), Export (70%), Policy (80%), Search (70%), Gateway (65%)
  • Threshold Rationale: Based on service criticality, complexity, and risk profile (Critical > High > Medium > Low)
  • Baseline Protection: Prevents coverage regression by comparing to previous builds; zero tolerance for coverage drops
  • Force New Baseline: Requires ADR documentation, Lead Architect approval, monitored over 3 builds
  • Test Quality Metrics: Pass rate (100%), duration (unit <30s, integration <5min), flaky rate (<5%), assertion density (≥1.5)
  • Flaky Test Detection: Daily automated scan flagging tests with <95% historical pass rate; work items created automatically
  • Coverage Exclusions: Auto-generated code, migrations, Program.cs, test assemblies excluded via .runsettings
  • Test Organization: Unit/Integration/E2E folder structure; Should_ExpectedBehavior_When_StateUnderTest naming
  • Coverage Reports: HTML, Cobertura XML, Markdown summary with uncovered code analysis
  • Per-Service Configuration: Azure DevOps variable groups for service-specific thresholds

Security Gates (Deep Dive)

Security gates are critical enforcement points that prevent vulnerable code, exposed secrets, and insecure dependencies from reaching production. ATP enforces zero tolerance for critical/high vulnerabilities and implements automated secret detection with mandatory rotation workflows.

Philosophy: Security is non-negotiable. ATP blocks builds with critical/high vulnerabilities, detected secrets, or insecure configurations. Every security finding is tracked, remediated, or formally risk-accepted with time-bound approvals.

Security Gate Workflow

graph TD
    A[Test Coverage Passed] --> B[Dependency Scanning]
    B --> C{Critical/High CVEs?}
    C -->|Yes| D[Dependency Scan Failed ❌]
    C -->|No| E[Secrets Detection]
    E --> F{Secrets Found?}
    F -->|Yes| G[Secrets Detected ❌]
    F -->|No| H[SAST Analysis]
    H --> I{Security Hotspots?}
    I -->|Yes| J[SAST Failed ❌]
    I -->|No| K[Container Scan]
    K --> L{Image Vulnerabilities?}
    L -->|Yes| M[Container Scan Failed ❌]
    L -->|No| N[License Compliance]
    N --> O{Incompatible Licenses?}
    O -->|Yes| P[License Violation ❌]
    O -->|No| Q[Security Gates Passed ✅]

    D --> R[Pipeline Stopped]
    G --> R
    J --> R
    M --> R
    P --> R
    Q --> S[Proceed to Compliance Gates]

    style D fill:#ff6b6b
    style G fill:#ff6b6b
    style J fill:#ff6b6b
    style M fill:#ff6b6b
    style P fill:#ff6b6b
    style Q fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical Security Gate Duration: 5-8 minutes


Dependency Scanning (OWASP Dependency-Check)

Purpose: Detect vulnerable NuGet packages and transitive dependencies with known CVEs (Common Vulnerabilities and Exposures).

Tool: OWASP Dependency-Check — Open-source vulnerability scanner with NVD (National Vulnerability Database) integration

Threshold:

  • CVSS ≥9.0 (Critical): ❌ Block build immediately; fix within 24 hours
  • CVSS 7.0-8.9 (High): ❌ Block build; fix within 7 days or document risk acceptance
  • CVSS 4.0-6.9 (Medium): ⚠️ Warning; fix within 30 days
  • CVSS 0.1-3.9 (Low): ℹ️ Info; track in security backlog

Azure Pipelines Configuration:

# OWASP Dependency-Check Gate
- task: dependency-check-build-task@6
  inputs:
    projectName: 'ConnectSoft.ATP.Ingestion'
    scanPath: '$(Build.SourcesDirectory)'
    format: 'HTML,JSON,XML'
    failOnCVSS: 7  # Block on High/Critical (CVSS ≥7)
    suppressionFile: 'dependency-check-suppressions.xml'

    # NVD API Configuration (faster updates)
    nvdApiKey: $(NVD_API_KEY)
    enableExperimental: false

    # Data directory (cache for faster scans)
    dataDirectory: '$(Pipeline.Workspace)/dependency-check-data'

    # Advanced options
    enableRetired: true  # Check retired dependencies
    failBuildOnCVSS: 7
    warnOnCVSSViolation: true
  displayName: 'OWASP Dependency Scan'

  # Fail pipeline on critical/high vulnerabilities
  continueOnError: false

# Publish scan results
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.SourcesDirectory)/dependency-check-report.html'
    ArtifactName: 'dependency-check-$(Build.BuildNumber)'
  displayName: 'Publish Dependency Scan Report'
  condition: always()  # Publish even on failure

CVSS Severity Matrix:

Severity CVSS Score ATP Action SLA Approval Required Production Blocker
Critical 9.0-10.0 ❌ Block build immediately Fix within 24h None (must fix) ✅ Yes
High 7.0-8.9 ❌ Block build; patch or risk-accept Fix within 7 days Security Officer ✅ Yes
Medium 4.0-6.9 ⚠️ Warning; track in backlog Fix within 30 days Tech Lead ❌ No (warning only)
Low 0.1-3.9 ℹ️ Info; track in backlog Fix in next release None ❌ No
None 0.0 ℹ️ Info; no action N/A None ❌ No

Example Vulnerability Report:

// dependency-check-report.json (excerpt)
{
  "dependencies": [
    {
      "fileName": "System.Text.Json.dll",
      "filePath": "/usr/share/dotnet/shared/Microsoft.NETCore.App/8.0.0/System.Text.Json.dll",
      "sha256": "abc123...",
      "vulnerabilities": [
        {
          "name": "CVE-2024-12345",
          "severity": "CRITICAL",
          "cvssv3": {
            "baseScore": 9.8,
            "attackVector": "NETWORK",
            "attackComplexity": "LOW",
            "privilegesRequired": "NONE",
            "userInteraction": "NONE",
            "scope": "UNCHANGED",
            "confidentialityImpact": "HIGH",
            "integrityImpact": "HIGH",
            "availabilityImpact": "HIGH"
          },
          "description": "System.Text.Json deserialization vulnerability allows remote code execution",
          "references": [
            "https://nvd.nist.gov/vuln/detail/CVE-2024-12345",
            "https://github.com/dotnet/runtime/security/advisories/GHSA-xxxx-xxxx-xxxx"
          ]
        }
      ]
    }
  ]
}

Vulnerability Suppression Workflow

Purpose: Allow temporary exceptions for false positives or mitigated vulnerabilities with formal approval and time-bound expiration.

Suppression File (dependency-check-suppressions.xml):

<?xml version="1.0" encoding="UTF-8"?>
<suppressions xmlns="https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.3.xsd">

  <!-- Example 1: False Positive -->
  <suppress>
    <notes>
      False positive: CVE-2023-12345 affects Linux builds only; ATP runs on Windows.
      Approved by: security-team@connectsoft.example
      Approval Date: 2025-01-10
      Expires: 2025-07-10 (6 months)
      Review Date: 2025-06-30
    </notes>
    <packageUrl regex="true">^pkg:nuget/Newtonsoft\.Json@12\.0\.3$</packageUrl>
    <cve>CVE-2023-12345</cve>
  </suppress>

  <!-- Example 2: Mitigated Risk -->
  <suppress>
    <notes>
      Risk Acceptance: CVE-2024-67890 in System.IdentityModel.Tokens.Jwt 6.x
      Mitigation: Input validation prevents exploit; upgrade blocked by breaking changes.
      Approved by: Lead Architect (John Doe), Security Officer (Jane Smith)
      Approval Date: 2025-01-15
      Expires: 2025-04-15 (3 months)
      Action Plan: Upgrade to 7.x in Q2 2025 (requires API changes)
    </notes>
    <packageUrl regex="true">^pkg:nuget/System\.IdentityModel\.Tokens\.Jwt@6\.\d+\.\d+$</packageUrl>
    <cve>CVE-2024-67890</cve>
  </suppress>

  <!-- Example 3: Vendor-Confirmed Fix -->
  <suppress until="2025-03-01">
    <notes>
      Vendor has confirmed fix in next release (March 2025).
      Workaround applied: Input sanitization before library call.
      Approved by: Security Officer
      Temporary suppression until vendor patch available.
    </notes>
    <packageUrl regex="true">^pkg:nuget/ThirdPartyLibrary@.*$</packageUrl>
    <cve>CVE-2024-11111</cve>
  </suppress>

</suppressions>

Suppression Approval Process:

stateDiagram-v2
    [*] --> VulnerabilityDetected: OWASP scan finds CVE
    VulnerabilityDetected --> Triage: Security team investigates

    Triage --> FalsePositive: Not exploitable in ATP context
    Triage --> TruePositive: Legitimate vulnerability

    FalsePositive --> DocumentSuppression: Create suppression entry

    TruePositive --> PatchAvailable: Check for patch
    PatchAvailable --> ApplyPatch: Upgrade dependency
    PatchAvailable --> RiskAcceptance: No patch available

    RiskAcceptance --> MitigationExists: Evaluate controls
    MitigationExists --> DocumentSuppression: Mitigated; temporary suppression
    MitigationExists --> BlockBuild: No mitigation; must fix

    DocumentSuppression --> SecurityReview: Security Officer reviews
    SecurityReview --> Approved: Suppression approved
    SecurityReview --> Rejected: Must fix or block

    Approved --> TimeBoundSuppression: Add to suppressions.xml with expiry
    ApplyPatch --> [*]: Build passes
    BlockBuild --> [*]: Build fails
    Rejected --> BlockBuild
    TimeBoundSuppression --> [*]: Build passes with suppression
Hold "Alt" / "Option" to enable pan & zoom

Risk Acceptance Form:

# Security Risk Acceptance Form
# File: security-risk-acceptances/CVE-2024-67890-System.IdentityModel.Tokens.Jwt.md

---
title: Risk Acceptance - CVE-2024-67890 (System.IdentityModel.Tokens.Jwt)
cve: CVE-2024-67890
cvssScore: 7.5 (High)
package: System.IdentityModel.Tokens.Jwt
version: 6.34.0
detectedDate: 2025-01-15
approvalDate: 2025-01-17
expirationDate: 2025-04-17
status: Approved
---

## Vulnerability Description
JWT signature validation bypass in System.IdentityModel.Tokens.Jwt 6.x allows attackers to forge tokens.

**Reference**: https://nvd.nist.gov/vuln/detail/CVE-2024-67890

## Impact Assessment
- **Exploitability**: Requires attacker to know signing algorithm (RS256 used in ATP)
- **Attack Vector**: Network-based; requires JWT manipulation
- **Affected Components**: Gateway service (all others validate via Gateway)
- **Tenant Impact**: Could allow cross-tenant access if exploited

## Mitigation Controls
1. **Additional Validation**: Custom JWT validator checks audience, issuer, expiration
2. **Rate Limiting**: API rate limiting prevents brute-force attempts
3. **Monitoring**: Anomaly detection alerts on unusual token patterns
4. **Network Segmentation**: Gateway isolated in separate subnet

## Justification for Temporary Acceptance
- Vendor fix scheduled for System.IdentityModel.Tokens.Jwt 7.0 (March 2025)
- Upgrade to 7.0 requires breaking API changes (planned for Q2 2025)
- Mitigation controls reduce risk from High (7.5) to Medium (4.2)

## Action Plan
- Q1 2025: Implement additional validation layer (completed)
- Q2 2025: Upgrade to System.IdentityModel.Tokens.Jwt 7.x
- Q2 2025: Remove suppression after upgrade

## Approval
- **Security Officer**: ✅ Approved (Jane Smith, 2025-01-17)
- **Lead Architect**: ✅ Approved (John Doe, 2025-01-17)
- **SRE Lead**: ✅ Consulted (Mike Johnson, 2025-01-16)

## Review Schedule
- Monthly review: 2025-02-17, 2025-03-17
- Expiration: 2025-04-17 (auto-removed from suppressions.xml)

Secrets Detection (CredScan / GitGuardian)

Purpose: Detect hardcoded secrets (API keys, passwords, tokens, certificates) in source code and prevent them from being committed.

Tool: CredScan (Microsoft Credential Scanner) or GitGuardian for GitHub repos

Enforcement: ❌ Block build immediately if secrets detected; no exceptions.

Azure Pipelines Configuration:

# Secrets Detection Gate
- task: CredScan@3
  inputs:
    toolMajorVersion: 'V2'
    suppressionsFile: 'credscan-suppressions.json'
    outputFormat: 'sarif'
    debugMode: false

    # Scan all text files
    scanFolder: '$(Build.SourcesDirectory)'

    # Exclude known safe files
    excludePathsFromScan: |
      **/node_modules/**
      **/bin/**
      **/obj/**
      **/*.min.js
      **/packages/**
  displayName: 'Scan for Secrets (CredScan)'

  # Always fail on secrets
  continueOnError: false

# Analyze CredScan results
- task: PostAnalysis@2
  inputs:
    CredScan: true
    ToolLogsNotFoundAction: 'Error'  # Fail if CredScan didn't run
  displayName: 'Post-Analysis: Validate No Secrets'

Detected Secret Patterns:

Pattern Type Regex Pattern Example Action
API Keys [a-zA-Z0-9]{32,} with entropy check api_key=sk_live_123abc456def... ❌ Block; rotate key
Connection Strings Server=.*;Password=.*; Server=sql.example.com;Password=P@ssw0rd ❌ Block; use Key Vault
JWT Tokens eyJ[a-zA-Z0-9_-]*\.eyJ[a-zA-Z0-9_-]*\. eyJhbGciOiJIUzI1NiIsInR5cCI6... ❌ Block; remove token
Private Keys -----BEGIN (RSA|PRIVATE) KEY----- -----BEGIN RSA PRIVATE KEY----- ❌ Block; use Key Vault
Azure Storage Keys AccountKey=[A-Za-z0-9+/]{88}== AccountKey=abc123...xyz== ❌ Block; regenerate key
AWS Credentials AKIA[0-9A-Z]{16} AKIAIOSFODNN7EXAMPLE ❌ Block; rotate credentials
GitHub Tokens ghp_[a-zA-Z0-9]{36} ghp_abc123def456... ❌ Block; revoke token

Secrets Detection Example:

// ❌ BAD: Hardcoded connection string (CredScan WILL detect)
public class DatabaseConnection
{
    private const string ConnectionString = "Server=atp-sql-prod.database.windows.net;Password=P@ssw0rd123!";  // ❌ BLOCKED
}

// ✅ GOOD: Connection string from configuration
public class DatabaseConnection
{
    private readonly string _connectionString;

    public DatabaseConnection(IConfiguration configuration)
    {
        _connectionString = configuration.GetConnectionString("DefaultConnection");  // ✅ SAFE
    }
}

// ❌ BAD: API key in appsettings.json (CredScan detects in JSON files)
{
  "ExternalApi": {
    "ApiKey": "sk_live_123abc456def789ghi"  // ❌ BLOCKED
  }
}

// ✅ GOOD: API key from Key Vault
{
  "ExternalApi": {
    "ApiKey": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod.vault.azure.net/secrets/ExternalApiKey)"  // ✅ SAFE
  }
}

// ❌ BAD: Password in test code
[Fact]
public void Should_Connect_To_Database()
{
    var connStr = "Server=localhost;Password=TestP@ss123!";  // ❌ BLOCKED (even in tests)
    // ...
}

// ✅ GOOD: Password from environment variable
[Fact]
public void Should_Connect_To_Database()
{
    var connStr = Environment.GetEnvironmentVariable("TEST_DB_CONNECTION_STRING");  // ✅ SAFE
    // ...
}

CredScan Suppression (False Positives Only):

// credscan-suppressions.json
{
  "suppressions": [
    {
      "placeholder": "Password123!",
      "_justification": "Example password in documentation comment; not actual secret"
    },
    {
      "placeholder": "sk_test_123456789",
      "_justification": "Test API key example in unit test; not a real key"
    },
    {
      "file": "docs/examples/connection-string.md",
      "_justification": "Documentation example with fake credentials"
    }
  ]
}

Secret Rotation Procedure (when secrets detected):

#!/bin/bash
# rotate-leaked-secret.sh

SECRET_TYPE=$1  # api-key, connection-string, certificate, etc.
SECRET_NAME=$2  # Name in Key Vault

echo "⚠️  Secret leaked: $SECRET_NAME"
echo "Initiating emergency rotation..."

# Step 1: Revoke compromised secret immediately
az keyvault secret set-attributes \
  --vault-name atp-keyvault-prod-eus \
  --name $SECRET_NAME \
  --enabled false

echo "✅ Secret disabled in Key Vault"

# Step 2: Generate new secret
NEW_SECRET=$(openssl rand -base64 32)

az keyvault secret set \
  --vault-name atp-keyvault-prod-eus \
  --name $SECRET_NAME \
  --value "$NEW_SECRET"

echo "✅ New secret generated and stored"

# Step 3: Restart services to pick up new secret
az webapp restart \
  --name atp-ingestion-prod-eus \
  --resource-group ATP-Prod-EUS-RG

echo "✅ Services restarted with new secret"

# Step 4: Notify security team
az boards work-item create \
  --type "Incident" \
  --title "Secret Leak Detected: $SECRET_NAME" \
  --description "Secret detected in code commit. Rotated immediately.\n\nSecret: $SECRET_NAME\nBuild: $(Build.BuildNumber)\nCommit: $(Build.SourceVersion)" \
  --assigned-to "security-team@connectsoft.example" \
  --fields Priority=1

echo "✅ Incident created for security review"

Container Image Scanning (Trivy)

Purpose: Scan Docker images for vulnerabilities in base images, OS packages, and application dependencies before pushing to Azure Container Registry (ACR).

Tool: Trivy — Open-source container vulnerability scanner

Threshold:

  • Critical: ❌ Block push to registry
  • High: ❌ Block push to registry (require patch or risk acceptance)
  • Medium: ⚠️ Warning; track in security backlog
  • Low: ℹ️ Info; no action required

Azure Pipelines Configuration:

# Container Image Scanning Gate
- task: Docker@2
  inputs:
    command: 'build'
    dockerfile: '$(dockerfile)'
    repository: '$(imageRepository)'
    tags: |
      $(Build.BuildNumber)
      latest
  displayName: 'Build Docker Image'

# Trivy scan (before push)
- script: |
    # Install Trivy
    wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | sudo apt-key add -
    echo "deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main" | sudo tee -a /etc/apt/sources.list.d/trivy.list
    sudo apt-get update && sudo apt-get install trivy

    # Scan image for HIGH/CRITICAL vulnerabilities
    trivy image \
      --severity HIGH,CRITICAL \
      --exit-code 1 \
      --no-progress \
      --format json \
      --output trivy-report.json \
      $(containerRegistry)/$(imageRepository):$(Build.BuildNumber)

    # Generate HTML report for artifact
    trivy image \
      --severity HIGH,CRITICAL,MEDIUM,LOW \
      --format template \
      --template "@contrib/html.tpl" \
      --output trivy-report.html \
      $(containerRegistry)/$(imageRepository):$(Build.BuildNumber)
  displayName: 'Trivy Scan Docker Image'
  continueOnError: false  # Block on HIGH/CRITICAL

# Publish Trivy report
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: 'trivy-report.html'
    ArtifactName: 'trivy-scan-$(Build.BuildNumber)'
  displayName: 'Publish Trivy Report'
  condition: always()

# Only push if scan passed
- task: Docker@2
  inputs:
    command: 'push'
    repository: '$(imageRepository)'
    containerRegistry: '$(dockerRegistryServiceConnection)'
    tags: |
      $(Build.BuildNumber)
      latest
  displayName: 'Push Docker Image to ACR'
  condition: succeeded()  # Only push if Trivy scan passed

Trivy Report Example:

// trivy-report.json (excerpt)
{
  "Results": [
    {
      "Target": "connectsoft.azurecr.io/atp/ingestion:1.0.123",
      "Class": "os-pkgs",
      "Type": "ubuntu",
      "Vulnerabilities": [
        {
          "VulnerabilityID": "CVE-2024-99999",
          "PkgName": "openssl",
          "InstalledVersion": "3.0.2-0ubuntu1.10",
          "FixedVersion": "3.0.2-0ubuntu1.12",
          "Severity": "CRITICAL",
          "Description": "OpenSSL buffer overflow allows remote code execution",
          "References": [
            "https://nvd.nist.gov/vuln/detail/CVE-2024-99999"
          ],
          "PrimaryURL": "https://ubuntu.com/security/CVE-2024-99999",
          "Title": "openssl: buffer overflow in SSL handshake"
        }
      ]
    },
    {
      "Target": "app/ConnectSoft.ATP.Ingestion.dll",
      "Class": "lang-pkgs",
      "Type": "nuget",
      "Vulnerabilities": [
        {
          "VulnerabilityID": "GHSA-xxxx-yyyy-zzzz",
          "PkgName": "System.Text.Json",
          "InstalledVersion": "8.0.0",
          "FixedVersion": "8.0.1",
          "Severity": "HIGH",
          "Description": "Deserialization vulnerability in System.Text.Json"
        }
      ]
    }
  ]
}

Container Hardening Checklist (enforced by Trivy + manual review):

Hardening Control Validation Blocker Notes
Non-Root User Trivy checks USER directive ✅ Yes Must run as non-root (UID 1000+)
No Secrets in Layers CredScan + Trivy ✅ Yes Secrets must be injected at runtime
Minimal Base Image Image size < 200MB ⚠️ Warning Prefer distroless or Alpine
Up-to-Date Base Image Base image < 30 days old ⚠️ Warning Rebuild monthly to get security patches
Health Check HEALTHCHECK directive present ⚠️ Warning Required for Kubernetes liveness/readiness
Read-Only Filesystem Trivy config check ⚠️ Warning Prefer read-only root filesystem
Drop Capabilities Trivy config check ⚠️ Warning Drop all capabilities except NET_BIND_SERVICE

Dockerfile Best Practices (enforced by Trivy):

# ✅ GOOD: Secure Dockerfile
FROM mcr.microsoft.com/dotnet/aspnet:8.0-jammy AS base

# Run as non-root user
RUN groupadd -r atpuser && useradd -r -g atpuser atpuser
USER atpuser

WORKDIR /app
EXPOSE 8080

FROM mcr.microsoft.com/dotnet/sdk:8.0-jammy AS build
WORKDIR /src

# Copy only necessary files (avoid secrets)
COPY ["src/ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj", "ConnectSoft.ATP.Ingestion/"]
RUN dotnet restore "ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj"

COPY src/ .
RUN dotnet build "ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8080/health/live || exit 1

ENTRYPOINT ["dotnet", "ConnectSoft.ATP.Ingestion.dll"]

# ❌ BAD: Dockerfile anti-patterns (Trivy will flag)
# FROM ubuntu:latest  # Non-specific tag; use specific version
# RUN apt-get update && apt-get install -y curl  # Missing clean up
# ENV DB_PASSWORD=P@ssw0rd123!  # Hardcoded secret
# USER root  # Running as root
# COPY . .  # Copies everything including secrets

SAST (Static Application Security Testing)

Purpose: Detect security vulnerabilities in application code through static analysis (SQL injection, XSS, weak crypto, etc.).

Tool: SonarQube Security Rules (integrated with build quality gates)

Security Rules Enforced:

Rule ID Vulnerability Type Severity Example Remediation
S2077 SQL Injection Blocker $"SELECT * FROM Users WHERE Id={id}" Use parameterized queries or ORM
S3649 SQL Injection (LINQ) Blocker context.Users.FromSqlRaw($"SELECT * WHERE Id={id}") Use FromSqlInterpolated
S5131 Cross-Site Scripting (XSS) Critical @Html.Raw(userInput) Use @userInput (auto-escaped)
S4790 Weak Cryptography Blocker MD5.Create(), DES.Create() Use SHA256, AES256
S2068 Hardcoded Credentials Blocker var password = "P@ssw0rd"; Load from configuration
S3330 HTTP Not HTTPS Critical new HttpClient().GetAsync("http://...") Use HTTPS
S5122 CORS Misconfiguration Critical AllowAnyOrigin() Specify allowed origins
S5042 Zip Slip Critical zipEntry.FullName without validation Validate paths before extraction

SAST Examples & Fixes:

// ❌ BAD: SQL Injection (S2077, S3649)
public async Task<User> GetUserAsync(string userId)
{
    var sql = $"SELECT * FROM Users WHERE UserId = '{userId}'";  // ❌ Injectable
    return await _context.Users.FromSqlRaw(sql).FirstOrDefaultAsync();
}

// ✅ GOOD: Parameterized query
public async Task<User> GetUserAsync(string userId)
{
    return await _context.Users
        .FromSqlInterpolated($"SELECT * FROM Users WHERE UserId = {userId}")  // ✅ Parameterized
        .FirstOrDefaultAsync();
}

// OR: Use LINQ (preferred)
public async Task<User> GetUserAsync(string userId)
{
    return await _context.Users
        .Where(u => u.UserId == userId)  // ✅ LINQ (safe)
        .FirstOrDefaultAsync();
}

// ❌ BAD: XSS Vulnerability (S5131)
public IActionResult DisplayMessage(string message)
{
    ViewBag.Message = message;
    return View();  // View uses @Html.Raw(ViewBag.Message) ❌
}

// ✅ GOOD: Auto-escaped output
public IActionResult DisplayMessage(string message)
{
    ViewBag.Message = message;
    return View();  // View uses @ViewBag.Message ✅ (auto-escaped)
}

// ❌ BAD: Weak Cryptography (S4790)
public string HashPassword(string password)
{
    using var md5 = MD5.Create();  // ❌ MD5 is cryptographically broken
    var hash = md5.ComputeHash(Encoding.UTF8.GetBytes(password));
    return Convert.ToBase64String(hash);
}

// ✅ GOOD: Strong Cryptography
public string HashPassword(string password)
{
    using var sha256 = SHA256.Create();  // ✅ SHA256 is acceptable
    var hash = sha256.ComputeHash(Encoding.UTF8.GetBytes(password));
    return Convert.ToBase64String(hash);
}

// ✅ BETTER: Use BCrypt/Argon2 for password hashing
public string HashPassword(string password)
{
    return BCrypt.Net.BCrypt.HashPassword(password, workFactor: 12);  // ✅ Industry standard
}

// ❌ BAD: CORS Misconfiguration (S5122)
public void ConfigureServices(IServiceCollection services)
{
    services.AddCors(options =>
    {
        options.AddPolicy("AllowAll", builder =>
        {
            builder.AllowAnyOrigin()  // ❌ Allows any origin (security risk)
                   .AllowAnyMethod()
                   .AllowAnyHeader();
        });
    });
}

// ✅ GOOD: Restrictive CORS
public void ConfigureServices(IServiceCollection services)
{
    services.AddCors(options =>
    {
        options.AddPolicy("ATPPolicy", builder =>
        {
            builder.WithOrigins("https://atp.connectsoft.com", "https://app.connectsoft.com")  // ✅ Specific origins
                   .WithMethods("GET", "POST", "PUT", "DELETE")  // ✅ Specific methods
                   .WithHeaders("Content-Type", "Authorization");  // ✅ Specific headers
        });
    });
}

License Compliance Scanning

Purpose: Ensure all dependencies have acceptable licenses that comply with ConnectSoft's legal policies (no GPL/AGPL in production).

Tool: dotnet-project-licenses or FOSSA

Acceptable Licenses (Whitelist):

License Category ATP Usage Notes
MIT Permissive ✅ Allowed Most NuGet packages
Apache 2.0 Permissive ✅ Allowed Common in .NET ecosystem
BSD (⅔-Clause) Permissive ✅ Allowed Widely used
ISC Permissive ✅ Allowed Similar to MIT
MS-PL Permissive ✅ Allowed Microsoft Public License
GPL 2.0/3.0 Copyleft ❌ Prohibited Requires source disclosure
AGPL 3.0 Copyleft ❌ Prohibited Network copyleft (service distribution)
LGPL 2.⅓.0 Weak Copyleft ⚠️ Review Required Allowed if dynamically linked
Custom/Proprietary Commercial ⚠️ Review Required Requires legal review

License Scanning (Azure Pipelines):

# License Compliance Gate
- script: |
    # Install license scanner
    dotnet tool install --global dotnet-project-licenses

    # Generate license report
    dotnet-project-licenses \
      --input $(Build.SourcesDirectory) \
      --output $(Build.ArtifactStagingDirectory)/licenses \
      --export-license-texts \
      --projects-filter "^(?!.*Tests).*$"  # Exclude test projects

    # Check for prohibited licenses
    PROHIBITED=$(jq -r '.projects[].packages[] | select(.license == "GPL-2.0" or .license == "GPL-3.0" or .license == "AGPL-3.0") | .packageName' \
      $(Build.ArtifactStagingDirectory)/licenses/licenses.json)

    if [ -n "$PROHIBITED" ]; then
      echo "##vso[task.logissue type=error]Prohibited licenses detected:"
      echo "$PROHIBITED"
      exit 1
    fi

    echo "✅ All dependencies have acceptable licenses"
  displayName: 'Validate License Compliance'

# Publish license report
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)/licenses'
    ArtifactName: 'licenses-$(Build.BuildNumber)'
  displayName: 'Publish License Report'

License Report Example:

// licenses.json (excerpt)
{
  "projects": [
    {
      "projectName": "ConnectSoft.ATP.Ingestion",
      "packages": [
        {
          "packageName": "System.Text.Json",
          "packageVersion": "8.0.0",
          "license": "MIT",
          "licenseUrl": "https://licenses.nuget.org/MIT"
        },
        {
          "packageName": "Newtonsoft.Json",
          "packageVersion": "13.0.3",
          "license": "MIT",
          "licenseUrl": "https://github.com/JamesNK/Newtonsoft.Json/blob/master/LICENSE.md"
        },
        {
          "packageName": "ProblematicLibrary",
          "packageVersion": "1.0.0",
          "license": "GPL-3.0",  // ❌ PROHIBITED
          "licenseUrl": "https://www.gnu.org/licenses/gpl-3.0.en.html"
        }
      ]
    }
  ]
}

Security Gate Metrics & Monitoring

Purpose: Track security posture over time and identify trends in vulnerability detection/remediation.

Security Metrics Dashboard:

# Azure DevOps Security Dashboard
dashboard:
  name: "ATP Security Posture"

  widgets:
    - type: vulnerabilityTrend
      title: "Open Vulnerabilities (Last 90 Days)"
      query: |
        customEvents
        | where name == "VulnerabilityDetected"
        | summarize count() by Severity, bin(timestamp, 1d)
      target: 0 Critical/High

    - type: secretsDetection
      title: "Secrets Detected (Last 30 Days)"
      query: "CredScan Results"
      target: 0

    - type: remediationTime
      title: "Mean Time to Remediate (MTTR)"
      query: |
        customEvents
        | where name in ("VulnerabilityDetected", "VulnerabilityRemediated")
        | summarize MTTR = avg(datetime_diff('hour', RemediatedAt, DetectedAt))
      target: < 24h (Critical), < 7d (High)

Security KQL Queries:

// Open vulnerabilities by severity (last 30 days)
customEvents
| where name == "VulnerabilityDetected"
| where timestamp > ago(30d)
| extend Severity = tostring(customDimensions.Severity)
| extend Status = tostring(customDimensions.Status)
| where Status == "Open"
| summarize Count = count() by Severity
| order by 
    case(
        Severity == "Critical", 1,
        Severity == "High", 2,
        Severity == "Medium", 3,
        Severity == "Low", 4,
        5
    )

// Mean Time to Remediate (MTTR) by severity
customEvents
| where name in ("VulnerabilityDetected", "VulnerabilityRemediated")
| where timestamp > ago(90d)
| extend VulnerabilityId = tostring(customDimensions.VulnerabilityId)
| extend Severity = tostring(customDimensions.Severity)
| summarize 
    DetectedAt = minif(timestamp, name == "VulnerabilityDetected"),
    RemediatedAt = maxif(timestamp, name == "VulnerabilityRemediated")
    by VulnerabilityId, Severity
| where isnotnull(RemediatedAt)
| extend MTTR_Hours = datetime_diff('hour', RemediatedAt, DetectedAt)
| summarize 
    AvgMTTR = avg(MTTR_Hours),
    P50MTTR = percentile(MTTR_Hours, 50),
    P95MTTR = percentile(MTTR_Hours, 95)
    by Severity

// Secret detection incidents (last 6 months)
customEvents
| where name == "SecretDetected"
| where timestamp > ago(180d)
| extend SecretType = tostring(customDimensions.SecretType)
| extend Repository = tostring(customDimensions.Repository)
| extend Commit = tostring(customDimensions.CommitSha)
| summarize 
    IncidentCount = count(),
    MostRecentIncident = max(timestamp)
    by SecretType, Repository
| order by IncidentCount desc

Security Gate Enforcement Policy

Purpose: Define clear policies for vulnerability remediation SLAs and escalation procedures.

Remediation SLA Matrix:

Severity CVSS Score Detection → Fix SLA Escalation (SLA Breach) Production Blocker
Critical 9.0-10.0 24 hours Security Officer → CISO ✅ Yes
High 7.0-8.9 7 days Security Officer → Lead Architect ✅ Yes
Medium 4.0-6.9 30 days Tech Lead → Security Officer ❌ No
Low 0.1-3.9 Next Release None ❌ No

Escalation Workflow:

graph TD
    A[Vulnerability Detected] --> B{Severity?}
    B -->|Critical| C[24h SLA Timer Starts]
    B -->|High| D[7d SLA Timer Starts]
    B -->|Medium| E[30d SLA Timer Starts]
    B -->|Low| F[Track in Backlog]

    C --> G{Fixed in 24h?}
    D --> H{Fixed in 7d?}
    E --> I{Fixed in 30d?}

    G -->|Yes| J[Closed]
    G -->|No| K[Escalate to CISO]

    H -->|Yes| J
    H -->|No| L[Escalate to Lead Architect]

    I -->|Yes| J
    I -->|No| M[Escalate to Security Officer]

    K --> N[Emergency Patch Required]
    L --> O[Risk Acceptance or Patch]
    M --> P[Prioritize in Next Sprint]

    F --> Q[Fix in Next Major Release]

    style K fill:#ff6b6b
    style L fill:#feca57
    style M fill:#feca57
Hold "Alt" / "Option" to enable pan & zoom

Automated SLA Monitoring:

// Monitor vulnerability remediation SLAs
[FunctionName("MonitorVulnerabilitySLAs")]
public async Task RunAsync(
    [TimerTrigger("0 0 */6 * * *")] TimerInfo timer,  // Every 6 hours
    ILogger log)
{
    log.LogInformation("Checking vulnerability remediation SLAs...");

    var openVulnerabilities = await GetOpenVulnerabilitiesAsync();
    var breachedSLAs = new List<VulnerabilitySLA>();

    foreach (var vuln in openVulnerabilities)
    {
        var sla = CalculateSLA(vuln.Severity);
        var ageHours = (DateTime.UtcNow - vuln.DetectedAt).TotalHours;

        if (ageHours > sla.Hours)
        {
            breachedSLAs.Add(new VulnerabilitySLA
            {
                CVE = vuln.CVE,
                Severity = vuln.Severity,
                DetectedAt = vuln.DetectedAt,
                AgeHours = ageHours,
                SLAHours = sla.Hours,
                BreachHours = ageHours - sla.Hours,
                AssignedTo = vuln.AssignedTo
            });

            // Escalate based on severity
            if (vuln.Severity == "Critical")
            {
                await EscalateToCISOAsync(vuln);
            }
            else if (vuln.Severity == "High")
            {
                await EscalateToArchitectAsync(vuln);
            }
            else if (vuln.Severity == "Medium")
            {
                await EscalateToSecurityOfficerAsync(vuln);
            }
        }
    }

    if (breachedSLAs.Any())
    {
        log.LogWarning($"SLA breaches detected: {breachedSLAs.Count} vulnerabilities");
        await SendSLABreachReportAsync(breachedSLAs);
    }
    else
    {
        log.LogInformation("✅ All vulnerabilities within SLA");
    }
}

private SLA CalculateSLA(string severity) => severity switch
{
    "Critical" => new SLA { Hours = 24, Escalation = "CISO" },
    "High" => new SLA { Hours = 168, Escalation = "Lead Architect" },  // 7 days
    "Medium" => new SLA { Hours = 720, Escalation = "Security Officer" },  // 30 days
    _ => new SLA { Hours = int.MaxValue, Escalation = "None" }
};

Dependency Update Strategy

Purpose: Proactively update dependencies to minimize vulnerability exposure.

Update Cadence:

dependencyUpdates:
  automated:
    schedule: Weekly (Monday 2 AM)
    scope: Patch versions only (1.2.3 → 1.2.4)
    tool: Dependabot or Renovate
    automerge: true  # If tests pass

  minor:
    schedule: Monthly (1st Monday)
    scope: Minor versions (1.2.x → 1.3.0)
    tool: Manual PR by platform team
    automerge: false  # Requires review

  major:
    schedule: Quarterly
    scope: Major versions (1.x → 2.0)
    tool: Manual PR with ADR
    automerge: false  # Requires architect approval

  security:
    schedule: Immediate (on CVE disclosure)
    scope: Any version with security patch
    tool: Emergency PR
    automerge: false  # Requires security officer approval

Dependabot Configuration (.github/dependabot.yml):

version: 2
updates:
  # .NET dependencies
  - package-ecosystem: "nuget"
    directory: "/"
    schedule:
      interval: "weekly"
      day: "monday"
      time: "02:00"

    # Auto-merge patch updates if tests pass
    open-pull-requests-limit: 10
    reviewers:
      - "platform-team"

    # Grouping strategy
    groups:
      security-updates:
        patterns:
          - "*"
        update-types:
          - "security"

      patch-updates:
        patterns:
          - "*"
        update-types:
          - "patch"

    # Labels for PR categorization
    labels:
      - "dependencies"
      - "automated"

    # Ignore specific dependencies
    ignore:
      - dependency-name: "System.Text.Json"
        versions: ["8.0.0"]  # Pinned for compatibility

Summary

  • Security Gates: 5-8 minute execution; zero tolerance for critical/high vulnerabilities and secrets
  • Dependency Scanning: OWASP Dependency-Check with NVD integration; CVSS ≥7 blocks build
  • Severity Thresholds: Critical (24h SLA), High (7d SLA), Medium (30d SLA), Low (next release)
  • Suppression Workflow: Mermaid approval flow (Triage → Patch/RiskAcceptance → Security Review → Time-Bound Suppression)
  • Risk Acceptance: Formal template with impact assessment, mitigation controls, approval signatures, expiration dates
  • Secrets Detection: CredScan blocks on any detected secrets (API keys, passwords, tokens, certificates)
  • Secret Patterns: 7 pattern types (API keys, connection strings, JWTs, private keys, Azure/AWS/GitHub credentials)
  • Secret Rotation: Emergency rotation script with Key Vault disable/regenerate/restart workflow
  • Container Scanning: Trivy scans Docker images; blocks push on Critical/High OS/package vulnerabilities
  • Container Hardening: 7 controls (non-root user, no secrets, minimal image, health check, read-only FS)
  • SAST: SonarQube security rules (SQL injection, XSS, weak crypto, CORS, hardcoded credentials)
  • License Compliance: Whitelist (MIT, Apache, BSD); prohibited (GPL, AGPL); scan with dotnet-project-licenses
  • Update Strategy: Weekly patch updates (automated), monthly minor updates, quarterly major updates, immediate security updates
  • SLA Monitoring: Automated Azure Function checks every 6 hours; escalates Critical (CISO), High (Architect), Medium (Security Officer)

SBOM & Supply Chain Gates (Deep Dive)

Software Bill of Materials (SBOM) and supply chain security gates ensure transparency and traceability of all software components. ATP generates comprehensive SBOMs for every build and implements cryptographic signing for container images to prevent supply chain attacks.

Philosophy: In the era of Log4Shell and SolarWinds, supply chain security is paramount. ATP enforces complete visibility into all dependencies, cryptographic verification of artifacts, and immutable provenance tracking from code commit to production deployment.

SBOM & Supply Chain Workflow

graph TD
    A[Security Gates Passed] --> B[Generate SBOM]
    B --> C{SBOM Valid?}
    C -->|No| D[SBOM Generation Failed ❌]
    C -->|Yes| E[Validate SBOM Content]
    E --> F{All Dependencies Listed?}
    F -->|No| G[Incomplete SBOM ❌]
    F -->|Yes| H[Sign Build Artifacts]
    H --> I[Sign Container Image]
    I --> J{Signature Valid?}
    J -->|No| K[Signing Failed ❌]
    J -->|Yes| L[Generate Provenance]
    L --> M[Publish SBOM + Provenance]
    M --> N[Supply Chain Gates Passed ✅]

    D --> O[Pipeline Stopped]
    G --> O
    K --> O
    N --> P[Proceed to Compliance Gates]

    style D fill:#ff6b6b
    style G fill:#ff6b6b
    style K fill:#ff6b6b
    style N fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical SBOM Gate Duration: 2-3 minutes


SBOM Generation (CycloneDX)

Purpose: Generate a complete inventory of all software components (NuGet packages, Docker base images, OS packages) with version, license, and vulnerability information.

Tool: CycloneDX — OWASP-standardized SBOM format (also supports SPDX)

Requirements:

  1. Every build must generate an SBOM (no exceptions)
  2. Published as build artifact for audit trail and compliance
  3. Includes all dependencies: Direct, transitive, dev dependencies
  4. Metadata captured: Versions, licenses, CVEs, hashes (SHA256)
  5. Retention: 7 years for production builds (immutable storage)

Azure Pipelines Configuration:

# SBOM Generation Gate
- task: CycloneDX@1
  inputs:
    projectPath: '$(Build.SourcesDirectory)'
    outputFormat: 'json,xml'
    outputPath: '$(Build.ArtifactStagingDirectory)/sbom'

    # Include detailed metadata
    includeSerialNumber: true
    includeLicenseText: true

    # Scan depth
    scanType: 'solution'  # Scan entire solution

    # Output naming
    outputFilename: 'atp-ingestion-sbom-$(Build.BuildNumber)'
  displayName: 'Generate SBOM (CycloneDX)'
  continueOnError: false  # Fail if SBOM generation fails

# Validate SBOM was generated
- script: |
    SBOM_FILE="$(Build.ArtifactStagingDirectory)/sbom/atp-ingestion-sbom-$(Build.BuildNumber).json"

    if [ ! -f "$SBOM_FILE" ]; then
      echo "##vso[task.logissue type=error]SBOM file not found: $SBOM_FILE"
      exit 1
    fi

    # Validate SBOM is valid JSON
    if ! jq empty "$SBOM_FILE" 2>/dev/null; then
      echo "##vso[task.logissue type=error]SBOM is not valid JSON"
      exit 1
    fi

    # Validate SBOM has components
    COMPONENT_COUNT=$(jq '.components | length' "$SBOM_FILE")

    if [ "$COMPONENT_COUNT" -lt 10 ]; then
      echo "##vso[task.logissue type=error]SBOM has too few components: $COMPONENT_COUNT (expected >10)"
      exit 1
    fi

    echo "✅ SBOM validated: $COMPONENT_COUNT components"
  displayName: 'Validate SBOM Content'

# Publish SBOM as build artifact
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)/sbom'
    ArtifactName: 'sbom-$(Build.BuildNumber)'
  displayName: 'Publish SBOM Artifact'
  condition: always()  # Publish even on failure for audit

# Upload SBOM to immutable storage (production only)
- task: AzureCLI@2
  inputs:
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: |
      az storage blob upload \
        --account-name atpcomplianceblob \
        --container-name sbom-archive \
        --name "atp-ingestion/$(Build.BuildNumber)/sbom.json" \
        --file "$(Build.ArtifactStagingDirectory)/sbom/atp-ingestion-sbom-$(Build.BuildNumber).json" \
        --metadata \
          BuildId=$(Build.BuildId) \
          CommitSha=$(Build.SourceVersion) \
          Pipeline=$(Build.DefinitionName) \
          GeneratedAt=$(date -u +%Y-%m-%dT%H:%M:%SZ)

      # Enable legal hold (7-year retention)
      az storage blob set-legal-hold \
        --account-name atpcomplianceblob \
        --container-name sbom-archive \
        --blob-name "atp-ingestion/$(Build.BuildNumber)/sbom.json" \
        --legal-hold true \
        --tags compliance=true retention=7years
  displayName: 'Archive SBOM with Legal Hold'
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))

SBOM Example (CycloneDX JSON):

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.5",
  "serialNumber": "urn:uuid:12345678-1234-1234-1234-123456789012",
  "version": 1,
  "metadata": {
    "timestamp": "2025-01-15T14:30:00Z",
    "tools": [
      {
        "vendor": "OWASP",
        "name": "CycloneDX",
        "version": "3.0.0"
      }
    ],
    "component": {
      "type": "application",
      "name": "ConnectSoft.ATP.Ingestion",
      "version": "1.0.123",
      "purl": "pkg:nuget/ConnectSoft.ATP.Ingestion@1.0.123",
      "properties": [
        {
          "name": "build:commitSha",
          "value": "a1b2c3d4e5f6..."
        },
        {
          "name": "build:pipelineId",
          "value": "12345"
        },
        {
          "name": "build:timestamp",
          "value": "2025-01-15T14:30:00Z"
        }
      ]
    }
  },
  "components": [
    {
      "type": "library",
      "name": "System.Text.Json",
      "version": "8.0.0",
      "purl": "pkg:nuget/System.Text.Json@8.0.0",
      "licenses": [
        {
          "license": {
            "id": "MIT",
            "url": "https://licenses.nuget.org/MIT"
          }
        }
      ],
      "hashes": [
        {
          "alg": "SHA-256",
          "content": "abc123def456..."
        }
      ],
      "externalReferences": [
        {
          "type": "website",
          "url": "https://www.nuget.org/packages/System.Text.Json"
        }
      ]
    },
    {
      "type": "library",
      "name": "Newtonsoft.Json",
      "version": "13.0.3",
      "purl": "pkg:nuget/Newtonsoft.Json@13.0.3",
      "licenses": [
        {
          "license": {
            "id": "MIT"
          }
        }
      ],
      "vulnerabilities": [
        {
          "bom-ref": "vuln-1",
          "id": "CVE-2024-12345",
          "source": {
            "name": "NVD",
            "url": "https://nvd.nist.gov/vuln/detail/CVE-2024-12345"
          },
          "ratings": [
            {
              "source": {
                "name": "NVD"
              },
              "score": 5.3,
              "severity": "medium",
              "method": "CVSSv3",
              "vector": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:L/A:N"
            }
          ]
        }
      ]
    }
  ]
}

SBOM Validation Requirements:

Requirement Validation Blocker Purpose
BOM Format Must be valid CycloneDX JSON/XML ✅ Yes SBOM parser compatibility
Component Count ≥10 components (sanity check) ✅ Yes Ensure dependencies captured
Version Info All components have versions ✅ Yes Vulnerability correlation
License Info ≥90% of components have license data ⚠️ Warning License compliance tracking
Hash Integrity All components have SHA-256 hashes ✅ Yes (prod) Supply chain verification
Vulnerability Data Known CVEs included if present ℹ️ Info Security awareness
Provenance Build metadata (commit, pipeline, timestamp) ✅ Yes Audit trail

SBOM Content Validation

Purpose: Ensure SBOM is complete and accurate, not just generated.

Validation Script (PowerShell):

<#
.SYNOPSIS
    Validate SBOM completeness and accuracy.
.DESCRIPTION
    Checks SBOM for required fields, component count, license data, and provenance.
#>

param(
    [string]$SbomPath = "sbom/atp-ingestion-sbom.json",
    [int]$MinComponents = 10,
    [double]$MinLicenseCoverage = 0.9  # 90%
)

Write-Host "Validating SBOM: $SbomPath"

# Load SBOM
$sbom = Get-Content -Path $SbomPath | ConvertFrom-Json

# Validation 1: BOM Format
if ($sbom.bomFormat -ne "CycloneDX") {
    Write-Error "Invalid BOM format: $($sbom.bomFormat) (expected: CycloneDX)"
    exit 1
}

Write-Host "✅ BOM Format: $($sbom.bomFormat) $($sbom.specVersion)"

# Validation 2: Component Count
$componentCount = $sbom.components.Count

if ($componentCount -lt $MinComponents) {
    Write-Error "Too few components: $componentCount (expected: ≥$MinComponents)"
    exit 1
}

Write-Host "✅ Component Count: $componentCount"

# Validation 3: License Coverage
$componentsWithLicense = $sbom.components | Where-Object { $_.licenses.Count -gt 0 }
$licenseCoverage = $componentsWithLicense.Count / $componentCount

if ($licenseCoverage -lt $MinLicenseCoverage) {
    Write-Warning "Low license coverage: $($licenseCoverage * 100)% (expected: ≥$($MinLicenseCoverage * 100)%)"
}
else {
    Write-Host "✅ License Coverage: $($licenseCoverage * 100)%"
}

# Validation 4: Provenance Metadata
$buildMetadata = $sbom.metadata.component.properties | Where-Object { $_.name -like "build:*" }

if ($buildMetadata.Count -lt 3) {
    Write-Error "Missing build provenance metadata (expected: commitSha, pipelineId, timestamp)"
    exit 1
}

Write-Host "✅ Provenance Metadata: $($buildMetadata.Count) properties"

# Validation 5: Vulnerability Data (optional but recommended)
$componentsWithVulns = $sbom.components | Where-Object { $_.vulnerabilities.Count -gt 0 }

if ($componentsWithVulns.Count -gt 0) {
    Write-Warning "Components with known vulnerabilities: $($componentsWithVulns.Count)"

    $criticalVulns = $componentsWithVulns.vulnerabilities | Where-Object { $_.ratings[0].severity -eq "critical" }

    if ($criticalVulns.Count -gt 0) {
        Write-Error "SBOM contains components with CRITICAL vulnerabilities: $($criticalVulns.Count)"
        exit 1
    }
}

Write-Host "✅ SBOM validation passed"

Provenance & Signing (Cosign)

Purpose: Cryptographically sign build artifacts and container images to ensure authenticity and integrity, preventing tampering and supply chain attacks.

Tool: Cosign (part of Sigstore project) — Container image signing and verification

Requirements:

  1. All production images must be signed with Cosign before push to ACR
  2. Signature verification enforced at deployment time (Kubernetes admission controller)
  3. Provenance attestation includes commit SHA, pipeline ID, build timestamp, approver identities
  4. Key management: Signing keys stored in Azure Key Vault (managed HSM for production)

Cosign Signing Workflow:

# Container Image Signing Gate
steps:
  # 1. Build Docker image
  - task: Docker@2
    inputs:
      command: 'build'
      dockerfile: '$(dockerfile)'
      repository: '$(imageRepository)'
      tags: '$(Build.BuildNumber)'
    displayName: 'Build Docker Image'

  # 2. Trivy scan (must pass before signing)
  - script: |
      trivy image --severity HIGH,CRITICAL --exit-code 1 \
        $(imageRepository):$(Build.BuildNumber)
    displayName: 'Trivy Scan'

  # 3. Install Cosign
  - script: |
      # Install Cosign CLI
      COSIGN_VERSION=2.2.2
      wget "https://github.com/sigstore/cosign/releases/download/v${COSIGN_VERSION}/cosign-linux-amd64"
      sudo mv cosign-linux-amd64 /usr/local/bin/cosign
      sudo chmod +x /usr/local/bin/cosign

      # Verify installation
      cosign version
    displayName: 'Install Cosign'

  # 4. Fetch signing key from Key Vault
  - task: AzureKeyVault@2
    inputs:
      azureSubscription: '$(azureSubscription)'
      keyVaultName: 'atp-keyvault-prod-eus'
      secretsFilter: 'CosignSigningKey,CosignPassword'
      runAsPreJob: false
    displayName: 'Fetch Cosign Signing Key'

  # 5. Sign container image
  - script: |
      # Export Cosign private key
      echo "$(CosignSigningKey)" > cosign.key

      # Sign image with provenance
      COSIGN_PASSWORD=$(CosignPassword) cosign sign \
        --key cosign.key \
        --annotations "build.commitSha=$(Build.SourceVersion)" \
        --annotations "build.pipelineId=$(Build.BuildId)" \
        --annotations "build.pipelineName=$(Build.DefinitionName)" \
        --annotations "build.timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
        --annotations "build.branch=$(Build.SourceBranch)" \
        --annotations "build.buildNumber=$(Build.BuildNumber)" \
        $(containerRegistry)/$(imageRepository):$(Build.BuildNumber)

      # Clean up key
      rm -f cosign.key

      echo "✅ Image signed successfully"
    displayName: 'Sign Container Image with Cosign'
    env:
      COSIGN_PASSWORD: $(CosignPassword)

  # 6. Generate provenance attestation (SLSA)
  - script: |
      cosign attest \
        --key cosign.key \
        --predicate <(
          cat <<EOF
      {
        "buildType": "https://tekton.dev/attestations/chains/pipelinerun@v2",
        "builder": {
          "id": "https://dev.azure.com/ConnectSoft/$(Build.DefinitionName)"
        },
        "invocation": {
          "configSource": {
            "uri": "$(Build.Repository.Uri)",
            "digest": {
              "sha1": "$(Build.SourceVersion)"
            },
            "entryPoint": "azure-pipelines.yml"
          }
        },
        "metadata": {
          "buildStartedOn": "$(Build.StartTime)",
          "buildFinishedOn": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
          "completeness": {
            "parameters": true,
            "environment": true,
            "materials": true
          },
          "reproducible": false
        },
        "materials": [
          {
            "uri": "$(Build.Repository.Uri)",
            "digest": {
              "sha1": "$(Build.SourceVersion)"
            }
          }
        ]
      }
      EOF
        ) \
        $(containerRegistry)/$(imageRepository):$(Build.BuildNumber)

      echo "✅ Provenance attestation generated"
    displayName: 'Generate Provenance Attestation'

  # 7. Verify signature (self-test)
  - script: |
      # Export public key
      cosign public-key --key cosign.key > cosign.pub

      # Verify signature
      cosign verify \
        --key cosign.pub \
        $(containerRegistry)/$(imageRepository):$(Build.BuildNumber)

      if [ $? -eq 0 ]; then
        echo "✅ Signature verified successfully"
      else
        echo "##vso[task.logissue type=error]Signature verification failed"
        exit 1
      fi
    displayName: 'Verify Image Signature'

  # 8. Push signed image to ACR
  - task: Docker@2
    inputs:
      command: 'push'
      repository: '$(imageRepository)'
      containerRegistry: '$(dockerRegistryServiceConnection)'
      tags: '$(Build.BuildNumber)'
    displayName: 'Push Signed Image to ACR'
    condition: succeeded()  # Only push if signing succeeded

Signature Verification at Deployment

Purpose: Enforce that only signed images can be deployed to production, preventing unauthorized or tampered images.

Kubernetes Admission Controller (Cosign Verification):

# Policy Controller (Sigstore)
apiVersion: v1
kind: ConfigMap
metadata:
  name: cosign-verification-policy
  namespace: atp-prod
data:
  policy.yaml: |
    apiVersion: policy.sigstore.dev/v1beta1
    kind: ClusterImagePolicy
    metadata:
      name: atp-image-signing-policy
    spec:
      images:
      - glob: "connectsoft.azurecr.io/atp/**"

      authorities:
      - key:
          secretRef:
            name: cosign-public-key
            namespace: atp-prod

        # Require specific annotations (provenance)
        attestations:
        - name: build-provenance
          predicateType: https://slsa.dev/provenance/v0.2
          policy:
            type: cue
            data: |
              builder.id: "https://dev.azure.com/ConnectSoft/*"
              metadata.completeness.materials: true

---
apiVersion: v1
kind: Secret
metadata:
  name: cosign-public-key
  namespace: atp-prod
type: Opaque
stringData:
  cosign.pub: |
    -----BEGIN PUBLIC KEY-----
    MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
    -----END PUBLIC KEY-----

Deployment Validation:

#!/bin/bash
# verify-image-signature.sh

IMAGE=$1  # e.g., connectsoft.azurecr.io/atp/ingestion:1.0.123

echo "Verifying image signature: $IMAGE"

# Fetch public key from Key Vault
az keyvault secret show \
  --vault-name atp-keyvault-prod-eus \
  --name CosignPublicKey \
  --query value -o tsv > cosign.pub

# Verify signature
cosign verify --key cosign.pub $IMAGE

if [ $? -eq 0 ]; then
  echo "✅ Image signature valid"

  # Verify provenance attestation
  cosign verify-attestation --key cosign.pub $IMAGE

  if [ $? -eq 0 ]; then
    echo "✅ Provenance attestation valid"
    exit 0
  else
    echo "❌ Provenance attestation invalid or missing"
    exit 1
  fi
else
  echo "❌ Image signature invalid or missing"
  echo "   Unsigned images are not allowed in production."
  exit 1
fi

Supply Chain Attack Prevention

Purpose: Mitigate supply chain attack vectors through dependency pinning, checksum verification, and isolated build environments.

Supply Chain Security Controls:

Control Implementation Enforcement Risk Mitigated
Dependency Pinning Lock file with exact versions ✅ Required Prevent malicious updates
Checksum Verification NuGet package hash validation ✅ Automatic Detect package tampering
Isolated Build Agents Ephemeral agents, no internet access ✅ Prod builds Prevent agent compromise
Two-Person Review PR requires 2 approvals for dependency changes ✅ Production Prevent single-actor malicious PRs
SBOM Comparison Diff SBOMs between builds ⚠️ Warning Detect unexpected dependency changes
Private Package Feed Azure Artifacts mirrors public NuGet ✅ Recommended Prevent dependency confusion
Code Signing All builds signed with authenticode ✅ Prod only Verify publisher identity

Dependency Lock File (NuGet packages.lock.json):

<!-- Enable lock file in .csproj -->
<Project Sdk="Microsoft.NET.Sdk.Web">
  <PropertyGroup>
    <!-- Enable NuGet lock file -->
    <RestorePackagesWithLockFile>true</RestorePackagesWithLockFile>
    <RestoreLockedMode Condition="'$(CI)' == 'true'">true</RestoreLockedMode>

    <!-- Fail on lock file mismatch -->
    <NuGetLockFileMismatchCheck>true</NuGetLockFileMismatchCheck>
  </PropertyGroup>
</Project>
// packages.lock.json (auto-generated)
{
  "version": 1,
  "dependencies": {
    "net8.0": {
      "System.Text.Json": {
        "type": "Direct",
        "requested": "[8.0.0, )",
        "resolved": "8.0.0",
        "contentHash": "sha512-abc123def456...",
        "dependencies": {
          "System.Runtime": "8.0.0",
          "System.Memory": "8.0.0"
        }
      },
      "Newtonsoft.Json": {
        "type": "Direct",
        "requested": "[13.0.3, )",
        "resolved": "13.0.3",
        "contentHash": "sha512-xyz789abc123..."
      }
    }
  }
}

SBOM Diff Analysis (detect unexpected changes):

#!/bin/bash
# sbom-diff.sh

PREVIOUS_SBOM=$1  # Previous build SBOM
CURRENT_SBOM=$2   # Current build SBOM

echo "Analyzing SBOM changes..."

# Extract component lists
jq -r '.components[] | "\(.name)@\(.version)"' $PREVIOUS_SBOM | sort > previous-components.txt
jq -r '.components[] | "\(.name)@\(.version)"' $CURRENT_SBOM | sort > current-components.txt

# Detect added dependencies
ADDED=$(comm -13 previous-components.txt current-components.txt)

if [ -n "$ADDED" ]; then
  echo "⚠️  New dependencies added:"
  echo "$ADDED"

  # Create work item for review
  az boards work-item create \
    --type "Task" \
    --title "SBOM Review: New Dependencies in Build $(Build.BuildNumber)" \
    --description "New dependencies detected:\n\n$ADDED\n\nReview for security and license compliance." \
    --assigned-to "security-team@connectsoft.example"
fi

# Detect removed dependencies
REMOVED=$(comm -23 previous-components.txt current-components.txt)

if [ -n "$REMOVED" ]; then
  echo "⚠️  Dependencies removed:"
  echo "$REMOVED"
fi

# Detect version changes (same package, different version)
CHANGED=$(comm -12 <(cut -d'@' -f1 previous-components.txt) <(cut -d'@' -f1 current-components.txt) | while read PKG; do
  PREV_VER=$(grep "^$PKG@" previous-components.txt | cut -d'@' -f2)
  CURR_VER=$(grep "^$PKG@" current-components.txt | cut -d'@' -f2)

  if [ "$PREV_VER" != "$CURR_VER" ]; then
    echo "$PKG: $PREV_VER$CURR_VER"
  fi
done)

if [ -n "$CHANGED" ]; then
  echo "ℹ️  Dependencies updated:"
  echo "$CHANGED"
fi

echo "✅ SBOM diff analysis complete"

Private Package Feed (Dependency Confusion Prevention)

Purpose: Prevent dependency confusion attacks where attackers publish malicious packages with the same name as internal packages.

Strategy: Private Azure Artifacts feed that mirrors public NuGet with approval workflow.

Azure Artifacts Feed Configuration:

# Azure Artifacts: ATP-NuGet-Feed
feed:
  name: ATP-NuGet-Feed
  visibility: private

  upstreams:
    # Mirror public NuGet (with caching)
    - name: nuget.org
      protocol: nuget
      location: https://api.nuget.org/v3/index.json
      includePrerelease: false

      # Package allowlist (only approved packages)
      upstreamBehavior: allowExternalVersionsOnly

  permissions:
    # Feed readers (developers, build agents)
    - identity: Build Service (ConnectSoft)
      role: Reader

    - identity: Contributors (ConnectSoft)
      role: Reader

    # Feed publishers (only platform team)
    - identity: Platform-Team
      role: Contributor

NuGet.config (consume private feed):

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <packageSources>
    <clear />
    <!-- Private feed (takes precedence) -->
    <add key="ATP-NuGet-Feed" value="https://pkgs.dev.azure.com/ConnectSoft/_packaging/ATP-NuGet-Feed/nuget/v3/index.json" />

    <!-- Public NuGet as fallback (via upstream) -->
    <!-- <add key="nuget.org" value="https://api.nuget.org/v3/index.json" /> -->
  </packageSources>

  <packageSourceCredentials>
    <ATP-NuGet-Feed>
      <add key="Username" value="az" />
      <add key="ClearTextPassword" value="%SYSTEM_ACCESSTOKEN%" />
    </ATP-NuGet-Feed>
  </packageSourceCredentials>
</configuration>

Dependency Confusion Detection:

// Detect potential dependency confusion attacks
public class DependencyConfusionDetector
{
    public async Task<bool> DetectConfusionRiskAsync(string packageName, string version)
    {
        // Check if package exists in both public and private feeds
        var publicPackage = await _nugetClient.GetPackageAsync("https://api.nuget.org/v3/index.json", packageName, version);
        var privatePackage = await _nugetClient.GetPackageAsync("https://pkgs.dev.azure.com/ConnectSoft/_packaging/ATP-NuGet-Feed/nuget/v3/index.json", packageName, version);

        if (publicPackage != null && privatePackage != null)
        {
            // Both feeds have this package; potential confusion risk

            // Compare hashes
            if (publicPackage.Hash != privatePackage.Hash)
            {
                // CRITICAL: Same package name/version, different content
                await AlertSecurityTeamAsync(new
                {
                    PackageName = packageName,
                    Version = version,
                    PublicHash = publicPackage.Hash,
                    PrivateHash = privatePackage.Hash,
                    Severity = "Critical",
                    Recommendation = "Investigate immediately; potential supply chain attack"
                });

                return true;  // Confusion detected
            }
        }

        return false;  // No confusion detected
    }
}

SLSA Provenance (Supply Chain Levels for Software Artifacts)

Purpose: Provide verifiable provenance for all build artifacts, documenting the complete build process from source to artifact.

SLSA Level: ATP targets SLSA Level 3 (Hardened Builds)

SLSA Level Requirements ATP Status
SLSA 1 Provenance exists; build process documented ✅ Achieved
SLSA 2 Signed provenance; service-generated (not user) ✅ Achieved
SLSA 3 Hardened builds; isolated, ephemeral build environments 🚧 In Progress (Q2 2025)
SLSA 4 Two-person review; hermetic builds 🎯 Target (Q4 2025)

Provenance Attestation (SLSA v1.0):

{
  "_type": "https://in-toto.io/Statement/v1",
  "subject": [
    {
      "name": "connectsoft.azurecr.io/atp/ingestion",
      "digest": {
        "sha256": "abc123def456..."
      }
    }
  ],
  "predicateType": "https://slsa.dev/provenance/v1",
  "predicate": {
    "buildDefinition": {
      "buildType": "https://dev.azure.com/Pipelines/v1",
      "externalParameters": {
        "repository": "https://github.com/ConnectSoft/ATP.Ingestion",
        "ref": "refs/heads/main",
        "commit": "a1b2c3d4e5f6..."
      },
      "internalParameters": {
        "azurePipeline": "ATP-Ingestion-CI",
        "buildId": "12345"
      },
      "resolvedDependencies": [
        {
          "uri": "pkg:nuget/System.Text.Json@8.0.0",
          "digest": {
            "sha256": "abc123..."
          }
        }
      ]
    },
    "runDetails": {
      "builder": {
        "id": "https://dev.azure.com/ConnectSoft/_build",
        "version": {
          "azure-pipelines": "1.0"
        }
      },
      "metadata": {
        "invocationId": "$(Build.BuildId)",
        "startedOn": "2025-01-15T14:00:00Z",
        "finishedOn": "2025-01-15T14:15:00Z"
      },
      "byproducts": [
        {
          "name": "SBOM",
          "uri": "https://artifacts.connectsoft.com/sbom/atp-ingestion-1.0.123.json",
          "digest": {
            "sha256": "xyz789..."
          }
        }
      ]
    }
  }
}

Provenance Verification (Kubernetes):

# Deployment with provenance verification
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-prod
  annotations:
    # Require SLSA provenance
    admission.sigstore.dev/require-provenance: "true"
    admission.sigstore.dev/min-slsa-level: "2"
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:1.0.123
        # Image must be signed + have provenance attestation

SBOM Distribution & Consumption

Purpose: Make SBOMs accessible to security teams, customers, and auditors for transparency and compliance.

SBOM Distribution Channels:

sbomDistribution:
  internal:
    # Azure DevOps Artifacts (for internal teams)
    location: https://dev.azure.com/ConnectSoft/_artifacts/feed/ATP-SBOM
    retention: 7 years
    access: Security team, compliance team, auditors

  external:
    # Customer-accessible SBOM portal (for transparency)
    location: https://sbom.connectsoft.com/atp/
    format: HTML (rendered from JSON)
    access: Authenticated customers
    contains: Public SBOM (no internal infrastructure details)

  regulators:
    # Auditor-accessible immutable storage
    location: Azure Blob (read-only SAS token)
    retention: 7 years + legal hold
    access: External auditors (SOC 2, GDPR DPAs)

SBOM API (customer access):

// SBOM query API for customers
[ApiController]
[Route("api/sbom")]
[Authorize(Roles = "Customer")]
public class SbomController : ControllerBase
{
    [HttpGet("{product}/{version}")]
    public async Task<IActionResult> GetSbom(string product, string version)
    {
        // Validate customer can access this SBOM (their tenant uses this version)
        if (!await _authService.CanAccessSbomAsync(User.Identity.Name, product, version))
        {
            return Forbid();
        }

        // Fetch SBOM from blob storage
        var sbom = await _blobService.GetSbomAsync(product, version);

        if (sbom == null)
        {
            return NotFound($"SBOM not found for {product} version {version}");
        }

        // Redact internal infrastructure details
        var publicSbom = RedactInternalDetails(sbom);

        return Ok(publicSbom);
    }

    private SbomDocument RedactInternalDetails(SbomDocument sbom)
    {
        // Remove internal build metadata
        sbom.Metadata.Properties = sbom.Metadata.Properties
            .Where(p => !p.Name.StartsWith("build:internal"))
            .ToList();

        // Remove internal-only components
        sbom.Components = sbom.Components
            .Where(c => !c.Name.Contains("Internal"))
            .ToList();

        return sbom;
    }
}

Artifact Attestation (SLSA Build L3)

Purpose: Provide cryptographic proof that artifacts were built by trusted CI/CD pipelines without manual intervention.

in-toto Attestation (Azure Pipelines):

# Generate in-toto attestation
- script: |
    # Install in-toto
    pip install in-toto

    # Generate link metadata (build step attestation)
    in-toto-run \
      --step-name build \
      --key $(Build.ArtifactStagingDirectory)/signing-key.pem \
      --materials $(Build.SourcesDirectory)/ \
      --products $(Build.ArtifactStagingDirectory)/publish/ \
      -- dotnet publish -c Release -o $(Build.ArtifactStagingDirectory)/publish

    # Generate layout (defines expected build steps)
    cat > layout.json <<EOF
    {
      "_type": "layout",
      "steps": [
        {
          "name": "build",
          "expected_materials": [
            ["MATCH", "**/*.cs", "WITH", "PRODUCTS", "FROM", "checkout"]
          ],
          "expected_products": [
            ["CREATE", "publish/*.dll"],
            ["CREATE", "publish/*.json"]
          ],
          "pubkeys": ["$(cat $(Build.ArtifactStagingDirectory)/signing-key.pub)"]
        }
      ],
      "inspect": []
    }
    EOF

    # Sign layout
    in-toto-sign \
      --key $(Build.ArtifactStagingDirectory)/signing-key.pem \
      --file layout.json

    echo "✅ in-toto attestation generated"
  displayName: 'Generate in-toto Attestation'

Supply Chain Security Metrics

Purpose: Track supply chain health and identify anomalies.

Supply Chain KQL Queries:

// Dependency change frequency (last 90 days)
customEvents
| where name == "DependencyChanged"
| where timestamp > ago(90d)
| extend PackageName = tostring(customDimensions.PackageName)
| extend OldVersion = tostring(customDimensions.OldVersion)
| extend NewVersion = tostring(customDimensions.NewVersion)
| summarize ChangeCount = count() by PackageName
| order by ChangeCount desc
| take 20

// SBOM generation success rate
customEvents
| where name in ("SbomGenerated", "SbomGenerationFailed")
| where timestamp > ago(30d)
| summarize 
    TotalAttempts = count(),
    Successful = countif(name == "SbomGenerated"),
    Failed = countif(name == "SbomGenerationFailed"),
    SuccessRate = 100.0 * countif(name == "SbomGenerated") / count()

// Unsigned images detected in deployment attempts
customEvents
| where name == "UnsignedImageRejected"
| where timestamp > ago(7d)
| extend Image = tostring(customDimensions.Image)
| extend Namespace = tostring(customDimensions.Namespace)
| summarize RejectionCount = count() by Image, Namespace
| order by RejectionCount desc

// Dependency pinning violations (lock file mismatch)
customEvents
| where name == "LockFileMismatch"
| where timestamp > ago(30d)
| extend PackageName = tostring(customDimensions.PackageName)
| extend ExpectedVersion = tostring(customDimensions.ExpectedVersion)
| extend ActualVersion = tostring(customDimensions.ActualVersion)
| project timestamp, PackageName, ExpectedVersion, ActualVersion, BuildId = tostring(customDimensions.BuildId)

Summary

  • SBOM & Supply Chain Gates: 2-3 minute execution; SBOM generation mandatory for all builds
  • CycloneDX SBOM: JSON/XML format with complete dependency inventory (versions, licenses, CVEs, hashes)
  • SBOM Validation: 7 requirements (valid format, ≥10 components, versions, licenses, hashes, vulnerabilities, provenance)
  • SBOM Content Validator: PowerShell script checking format, component count, license coverage, provenance metadata
  • Cosign Signing: All production images cryptographically signed with Key Vault-stored keys
  • Signing Workflow: 8-step pipeline (Build → Scan → Install Cosign → Fetch Key → Sign → Generate Provenance → Verify → Push)
  • Provenance Attestation: SLSA v1.0 format with builder ID, materials, metadata, timestamps
  • Signature Verification: Kubernetes admission controller enforces signature validation at deployment time
  • Supply Chain Controls: 7 controls (dependency pinning, checksum verification, isolated agents, two-person review, SBOM diff, private feed, code signing)
  • Dependency Lock File: NuGet packages.lock.json with SHA-512 hashes; RestoreLockedMode enforced in CI
  • SBOM Diff Analysis: Bash script detecting added/removed/updated dependencies; creates work items for security review
  • Dependency Confusion Prevention: Private Azure Artifacts feed mirrors public NuGet with allowlist
  • SLSA Levels: Currently SLSA 2 (signed provenance); targeting SLSA 3 (Q2 2025), SLSA 4 (Q4 2025)
  • SBOM Distribution: Internal (Azure Artifacts), external (customer portal), regulators (immutable blob with SAS tokens)
  • Supply Chain Metrics: KQL queries for dependency changes, SBOM success rate, unsigned image rejections, lock file mismatches

Compliance Gates (Deep Dive)

Compliance gates ensure regulatory adherence (GDPR, HIPAA, SOC 2) through automated validation of audit logging, PII protection, and compliance controls. ATP enforces 100% audit logging coverage for state-mutating operations and zero tolerance for PII leakage in logs or telemetry.

Philosophy: Compliance is built-in, not bolted-on. ATP embeds compliance controls into the CI/CD pipeline, making it impossible to deploy non-compliant code to production. Every compliance requirement is validated, documented, and auditable.

Compliance Gate Workflow

graph TD
    A[SBOM Gates Passed] --> B[Audit Logging Validation]
    B --> C{100% Coverage?}
    C -->|No| D[Audit Logging Incomplete ❌]
    C -->|Yes| E[PII Redaction Validation]
    E --> F{No PII in Logs?}
    F -->|PII Found| G[PII Leakage Detected ❌]
    F -->|Clean| H[GDPR/HIPAA Checklist]
    H --> I{All Items Pass?}
    I -->|No| J[Compliance Checklist Failed ❌]
    I -->|Yes| K[Data Classification Validation]
    K --> L{Sensitive Data Classified?}
    L -->|No| M[Classification Missing ❌]
    L -->|Yes| N[Retention Policy Validation]
    N --> O{Policies Configured?}
    O -->|No| P[Retention Policy Missing ❌]
    O -->|Yes| Q[Compliance Gates Passed ✅]

    D --> R[Pipeline Stopped]
    G --> R
    J --> R
    M --> R
    P --> R
    Q --> S[Proceed to Staging Deployment]

    style D fill:#ff6b6b
    style G fill:#ff6b6b
    style J fill:#ff6b6b
    style M fill:#ff6b6b
    style P fill:#ff6b6b
    style Q fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical Compliance Gate Duration: 2-3 minutes


Audit Logging Validation

Purpose: Ensure 100% of state-mutating operations emit audit events to maintain complete audit trail for compliance.

Requirement: Every method that creates, updates, or deletes data must call IAuditLogger.LogAsync().

Tool: Custom static analyzer that scans C# code for audit logging calls

Threshold: 100% — No exceptions; all state mutations must be audited

Audit Logging Validator (C# Roslyn Analyzer):

// Custom Roslyn analyzer to enforce audit logging
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Diagnostics;

[DiagnosticAnalyzer(LanguageNames.CSharp)]
public class AuditLoggingAnalyzer : DiagnosticAnalyzer
{
    private const string DiagnosticId = "ATP001";
    private const string Title = "State-mutating method missing audit logging";
    private const string MessageFormat = "Method '{0}' modifies state but does not call IAuditLogger.LogAsync()";
    private const string Category = "Compliance";

    private static readonly DiagnosticDescriptor Rule = new DiagnosticDescriptor(
        DiagnosticId,
        Title,
        MessageFormat,
        Category,
        DiagnosticSeverity.Error,
        isEnabledByDefault: true,
        description: "All methods that create, update, or delete data must emit audit events.");

    public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics => ImmutableArray.Create(Rule);

    public override void Initialize(AnalysisContext context)
    {
        context.ConfigureGeneratedCodeAnalysis(GeneratedCodeAnalysisFlags.None);
        context.EnableConcurrentExecution();
        context.RegisterSyntaxNodeAction(AnalyzeMethod, SyntaxKind.MethodDeclaration);
    }

    private void AnalyzeMethod(SyntaxNodeAnalysisContext context)
    {
        var methodDeclaration = (MethodDeclarationSyntax)context.Node;
        var methodSymbol = context.SemanticModel.GetDeclaredSymbol(methodDeclaration);

        if (methodSymbol == null || methodSymbol.IsAbstract || methodSymbol.IsExtern)
            return;

        // Check if method is state-mutating (has Create/Update/Delete/Save in name)
        var methodName = methodSymbol.Name;
        var isStateMutating = methodName.Contains("Create") || 
                              methodName.Contains("Update") || 
                              methodName.Contains("Delete") || 
                              methodName.Contains("Save") ||
                              methodName.Contains("Add") ||
                              methodName.Contains("Remove");

        if (!isStateMutating)
            return;

        // Check if method returns Task (async)
        var returnType = methodSymbol.ReturnType;
        if (returnType.Name != "Task")
            return;

        // Check if method calls IAuditLogger.LogAsync()
        var invocations = methodDeclaration.DescendantNodes()
            .OfType<InvocationExpressionSyntax>();

        var hasAuditLogging = invocations.Any(invocation =>
        {
            var memberAccess = invocation.Expression as MemberAccessExpressionSyntax;
            if (memberAccess?.Name.Identifier.Text == "LogAsync")
            {
                var symbolInfo = context.SemanticModel.GetSymbolInfo(memberAccess);
                var symbol = symbolInfo.Symbol as IMethodSymbol;

                // Check if method is from IAuditLogger interface
                return symbol?.ContainingType.Name == "IAuditLogger";
            }
            return false;
        });

        if (!hasAuditLogging)
        {
            var diagnostic = Diagnostic.Create(Rule, methodDeclaration.Identifier.GetLocation(), methodSymbol.Name);
            context.ReportDiagnostic(diagnostic);
        }
    }
}

Audit Logging Validation Script (PowerShell):

<#
.SYNOPSIS
    Validate audit logging coverage in ATP services.
.DESCRIPTION
    Scans C# source code for state-mutating methods without audit logging calls.
#>

param(
    [string]$Path = "$(Build.SourcesDirectory)",
    [int]$Threshold = 100  # 100% coverage required
)

Write-Host "Validating audit logging coverage in: $Path"

# Find all C# files (exclude tests, migrations, generated)
$csFiles = Get-ChildItem -Path $Path -Recurse -Filter *.cs |
    Where-Object { 
        $_.FullName -notmatch "\\Tests\\" -and
        $_.FullName -notmatch "\\Migrations\\" -and
        $_.FullName -notmatch "\\.Generated\\.cs$"
    }

$stateMutatingMethods = @()
$methodsWithAuditLogging = @()

foreach ($file in $csFiles) {
    $content = Get-Content -Path $file.FullName -Raw

    # Find state-mutating methods (Create, Update, Delete, Save, Add, Remove)
    $methodPattern = 'public\s+(async\s+)?Task<?\w*>?\s+(Create|Update|Delete|Save|Add|Remove)\w*\s*\('
    $matches = [regex]::Matches($content, $methodPattern)

    foreach ($match in $matches) {
        $methodName = $match.Groups[2].Value + $match.Groups[0].Value.Split('(')[0].Split()[-1]
        $stateMutatingMethods += [PSCustomObject]@{
            File = $file.Name
            Method = $methodName
            Line = ($content.Substring(0, $match.Index) -split "`n").Count
        }

        # Check if method body contains IAuditLogger.LogAsync()
        $methodStart = $match.Index
        $methodEnd = $content.IndexOf("}", $methodStart)
        $methodBody = $content.Substring($methodStart, $methodEnd - $methodStart)

        if ($methodBody -match '(IAuditLogger|_auditLogger|auditLogger)\.LogAsync\(') {
            $methodsWithAuditLogging += $methodName
        }
    }
}

$totalStateMutatingMethods = $stateMutatingMethods.Count
$totalWithAuditLogging = $methodsWithAuditLogging.Count
$coveragePercent = if ($totalStateMutatingMethods -gt 0) { 
    ($totalWithAuditLogging / $totalStateMutatingMethods) * 100 
} else { 
    100 
}

Write-Host "Audit Logging Coverage:"
Write-Host "  Total state-mutating methods: $totalStateMutatingMethods"
Write-Host "  Methods with audit logging: $totalWithAuditLogging"
Write-Host "  Coverage: $($coveragePercent.ToString('F1'))%"

# Report methods without audit logging
$methodsWithoutLogging = $stateMutatingMethods | Where-Object { $methodsWithAuditLogging -notcontains $_.Method }

if ($methodsWithoutLogging.Count -gt 0) {
    Write-Host "`n❌ Methods without audit logging:" -ForegroundColor Red

    foreach ($method in $methodsWithoutLogging | Select-Object -First 10) {
        Write-Host "  - $($method.File):$($method.Line)$($method.Method)" -ForegroundColor Red
    }

    if ($methodsWithoutLogging.Count -gt 10) {
        Write-Host "  ... and $($methodsWithoutLogging.Count - 10) more"
    }
}

# Fail if coverage below threshold
if ($coveragePercent -lt $Threshold) {
    Write-Error "Audit logging coverage ($($coveragePercent.ToString('F1'))%) below threshold ($Threshold%)"
    Write-Error "Add IAuditLogger.LogAsync() calls to all state-mutating methods."
    exit 1
}

Write-Host "`n✅ Audit logging validation passed" -ForegroundColor Green

Azure Pipelines Integration:

# Audit Logging Validation Gate
- task: PowerShell@2
  inputs:
    filePath: 'scripts/validate-audit-logging.ps1'
    arguments: '-Path "$(Build.SourcesDirectory)" -Threshold 100'
    pwsh: true
  displayName: 'Validate Audit Logging Coverage'
  continueOnError: false  # Fail build if coverage < 100%

Example: Correct Audit Logging:

// ✅ GOOD: State-mutating method with audit logging
public class AuditEventService
{
    private readonly IAuditLogger _auditLogger;
    private readonly IAuditEventRepository _repository;

    public AuditEventService(IAuditLogger auditLogger, IAuditEventRepository repository)
    {
        _auditLogger = auditLogger;
        _repository = repository;
    }

    public async Task<AuditEvent> CreateEventAsync(CreateAuditEventRequest request, CancellationToken ct)
    {
        var evt = new AuditEvent
        {
            Id = Guid.NewGuid(),
            TenantId = request.TenantId,
            Action = request.Action,
            Timestamp = DateTime.UtcNow,
            UserId = request.UserId
        };

        await _repository.AddAsync(evt, ct);

        // ✅ Audit logging call (REQUIRED)
        await _auditLogger.LogAsync(new AuditLogEntry
        {
            EntityType = nameof(AuditEvent),
            EntityId = evt.Id.ToString(),
            Operation = AuditOperation.Create,
            Timestamp = DateTime.UtcNow,
            UserId = request.UserId,
            TenantId = request.TenantId,
            Changes = new { evt }
        }, ct);

        return evt;
    }

    public async Task UpdateEventAsync(Guid id, UpdateAuditEventRequest request, CancellationToken ct)
    {
        var evt = await _repository.GetByIdAsync(id, ct);

        if (evt == null)
            throw new NotFoundException($"Audit event {id} not found");

        var oldValues = new { evt.Status, evt.Notes };

        evt.Status = request.Status;
        evt.Notes = request.Notes;

        await _repository.UpdateAsync(evt, ct);

        // ✅ Audit logging call with before/after values
        await _auditLogger.LogAsync(new AuditLogEntry
        {
            EntityType = nameof(AuditEvent),
            EntityId = evt.Id.ToString(),
            Operation = AuditOperation.Update,
            Timestamp = DateTime.UtcNow,
            UserId = request.UserId,
            TenantId = evt.TenantId,
            Changes = new 
            { 
                Before = oldValues, 
                After = new { evt.Status, evt.Notes } 
            }
        }, ct);

        return evt;
    }
}

// ❌ BAD: State-mutating method without audit logging
public async Task DeleteEventAsync(Guid id, CancellationToken ct)
{
    var evt = await _repository.GetByIdAsync(id, ct);
    await _repository.DeleteAsync(evt, ct);

    // ❌ MISSING: IAuditLogger.LogAsync() call
    // This will be flagged by ATP001 analyzer and fail the build
}

PII Redaction Validation

Purpose: Prevent Personally Identifiable Information (PII) from appearing in logs, telemetry, or error messages to comply with GDPR/HIPAA.

Requirement: All sensitive data must be redacted before logging using custom attributes or redaction filters.

Tool: Custom log parser that scans for PII patterns (email, phone, SSN, credit card)

Threshold: 0 — No raw PII allowed in any log statements

PII Redaction Validator (PowerShell):

<#
.SYNOPSIS
    Validate PII redaction in ATP services.
.DESCRIPTION
    Scans C# source code and log statements for unredacted PII (email, phone, SSN).
#>

param(
    [string]$Path = "$(Build.SourcesDirectory)"
)

Write-Host "Validating PII redaction in: $Path"

# PII patterns to detect
$piiPatterns = @{
    "Email" = '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    "Phone" = '\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b'
    "SSN" = '\b\d{3}-\d{2}-\d{4}\b'
    "CreditCard" = '\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'
}

$violations = @()

# Find all C# files
$csFiles = Get-ChildItem -Path $Path -Recurse -Filter *.cs |
    Where-Object { $_.FullName -notmatch "\\Tests\\" }

foreach ($file in $csFiles) {
    $content = Get-Content -Path $file.FullName -Raw
    $lines = Get-Content -Path $file.FullName

    # Find log statements (ILogger calls)
    $logPattern = '(_logger|logger|_log|log)\.(Log\w+|Information|Warning|Error|Debug)\('
    $logMatches = [regex]::Matches($content, $logPattern)

    foreach ($logMatch in $logMatches) {
        # Extract log statement (up to closing parenthesis)
        $statementStart = $logMatch.Index
        $depth = 0
        $statementEnd = $statementStart

        for ($i = $statementStart; $i -lt $content.Length; $i++) {
            if ($content[$i] -eq '(') { $depth++ }
            if ($content[$i] -eq ')') { 
                $depth--
                if ($depth -eq 0) {
                    $statementEnd = $i
                    break
                }
            }
        }

        $logStatement = $content.Substring($statementStart, $statementEnd - $statementStart + 1)

        # Check for PII patterns in log statement
        foreach ($piiType in $piiPatterns.Keys) {
            if ($logStatement -match $piiPatterns[$piiType]) {
                $lineNumber = ($content.Substring(0, $statementStart) -split "`n").Count

                $violations += [PSCustomObject]@{
                    File = $file.Name
                    Line = $lineNumber
                    PIIType = $piiType
                    Statement = $logStatement.Substring(0, [Math]::Min(100, $logStatement.Length)) + "..."
                }
            }
        }
    }

    # Also check for direct string interpolation with user data
    if ($content -match '\$"\{.*?(Email|Phone|SSN|UserId|TenantId).*?\}"' -and $content -match '_logger') {
        Write-Warning "$($file.Name): Potential PII in string interpolation (manual review required)"
    }
}

if ($violations.Count -gt 0) {
    Write-Host "`n❌ PII detected in log statements:" -ForegroundColor Red
    Write-Host "   Total violations: $($violations.Count)" -ForegroundColor Red
    Write-Host ""

    foreach ($violation in $violations | Select-Object -First 10) {
        Write-Host "  - $($violation.File):$($violation.Line)$($violation.PIIType)" -ForegroundColor Red
        Write-Host "    $($violation.Statement)" -ForegroundColor Yellow
    }

    if ($violations.Count -gt 10) {
        Write-Host "  ... and $($violations.Count - 10) more violations"
    }

    Write-Host "`n📚 Remediation:" -ForegroundColor Yellow
    Write-Host "  1. Use redaction attributes: [EmailData], [PhoneData], [PersonalData]"
    Write-Host "  2. Use structured logging with redacted parameters"
    Write-Host "  3. Enable logging redaction in appsettings.json"
    Write-Host ""

    exit 1
}

# Check if redaction is enabled in appsettings.json
$appsettings = Get-ChildItem -Path $Path -Recurse -Filter appsettings.json | Select-Object -First 1

if ($appsettings) {
    $config = Get-Content -Path $appsettings.FullName | ConvertFrom-Json

    if ($config.Compliance.EnableLoggingRedaction -ne $true) {
        Write-Warning "Logging redaction not enabled in appsettings.json"
        Write-Warning "  Set Compliance.EnableLoggingRedaction: true"
    }
    else {
        Write-Host "✅ Logging redaction enabled in appsettings.json"
    }
}

Write-Host "`n✅ PII redaction validation passed" -ForegroundColor Green

Azure Pipelines Integration:

# PII Redaction Validation Gate
- task: PowerShell@2
  inputs:
    filePath: 'scripts/validate-pii-redaction.ps1'
    arguments: '-Path "$(Build.SourcesDirectory)"'
    pwsh: true
  displayName: 'Validate PII Redaction'
  continueOnError: false  # Fail build if PII detected

PII Redaction Examples:

// ❌ BAD: Raw PII in logs
public async Task ProcessUserAsync(User user)
{
    _logger.LogInformation($"Processing user: {user.Email}");  // ❌ Raw email logged
    _logger.LogInformation($"User phone: {user.PhoneNumber}");  // ❌ Raw phone logged
    _logger.LogInformation($"SSN: {user.SSN}");  // ❌ SSN logged (CRITICAL violation)
}

// ✅ GOOD: Redacted PII with attributes
public class User
{
    public Guid Id { get; set; }

    [EmailData]  // ✅ Redaction attribute
    public string Email { get; set; }

    [PhoneData]  // ✅ Redaction attribute
    public string PhoneNumber { get; set; }

    [PersonalData]  // ✅ Generic PII attribute
    public string SSN { get; set; }

    public string DisplayName { get; set; }  // Public, can be logged
}

// ✅ GOOD: Structured logging with automatic redaction
public async Task ProcessUserAsync(User user)
{
    _logger.LogInformation(
        "Processing user {UserId} with email {Email} and phone {Phone}",
        user.Id,  // Safe (GUID)
        user.Email,  // ✅ Auto-redacted to "***@***.com"
        user.PhoneNumber  // ✅ Auto-redacted to "***-***-1234" (last 4 digits)
    );
}

// ✅ GOOD: Manual redaction helper
public async Task ProcessUserAsync(User user)
{
    _logger.LogInformation(
        "Processing user {UserId} with email {Email}",
        user.Id,
        RedactEmail(user.Email)  // ✅ Explicitly redacted
    );
}

private string RedactEmail(string email)
{
    if (string.IsNullOrEmpty(email))
        return email;

    var parts = email.Split('@');
    if (parts.Length != 2)
        return "***";

    var username = parts[0];
    var domain = parts[1];

    // Show first char + last char, redact middle
    var redactedUsername = username.Length > 2
        ? $"{username[0]}***{username[^1]}"
        : "***";

    return $"{redactedUsername}@{domain}";
}

Logging Redaction Configuration (appsettings.json):

{
  "Compliance": {
    "EnableLoggingRedaction": true,
    "RedactionMode": "Automatic",  // Automatic | Manual | Disabled
    "RedactionAttributes": [
      "EmailData",
      "PhoneData",
      "PersonalData",
      "SensitiveData",
      "CreditCardData"
    ],
    "RedactionPlaceholder": "***",
    "PreserveLastN": 4  // Show last 4 chars (e.g., ***-***-1234)
  },
  "Logging": {
    "LogLevel": {
      "Default": "Information"
    },
    "Enrichers": [
      "PiiRedactionEnricher"  // Serilog enricher for automatic redaction
    ]
  }
}

GDPR/HIPAA Compliance Checklist

Purpose: Validate that all regulatory safeguards are implemented before deploying to staging/production.

Requirement: 8 compliance controls must be verified via automated checks and manual attestations.

Checklist Items:

Control Requirement Validation Method Blocker Regulatory Basis
Encryption at Rest All databases, storage accounts use encryption (TDE, SSE-AES256) Azure Policy scan ✅ Yes (staging/prod) GDPR Art. 32, HIPAA §164.312(a)(2)(iv)
Encryption in Transit TLS 1.3 enforced for all external APIs; TLS 1.2 minimum internal Network security policy ✅ Yes (prod) GDPR Art. 32, HIPAA §164.312(e)(1)
Tenant Isolation Multi-tenant data separation validated in integration tests Test results (tag: @tenantIsolation) ✅ Yes GDPR Art. 32, HIPAA §164.308(a)(3)
Retention Policies Configurable retention per tenant; default 7 years; auto-purge Configuration validation + integration test ✅ Yes GDPR Art. 5(1)(e), HIPAA §164.316(b)(2)(i)
DSAR Workflow Data Subject Access Request API implemented API contract test (GET /api/dsar/{userId}) ✅ Yes GDPR Art. 15-20
Breach Notification Incident response procedure documented Document exists in repo ⚠️ Warning GDPR Art. 33-34, HIPAA §164.404-414
Audit Logging 100% of write operations emit audit events Custom Roslyn analyzer (ATP001) ✅ Yes GDPR Art. 30, HIPAA §164.312(b)
PII Redaction No raw PII in logs, telemetry, error messages Custom log parser (PowerShell) ✅ Yes GDPR Art. 5(1)(f), HIPAA §164.514(b)

Compliance Checklist Validator (PowerShell):

<#
.SYNOPSIS
    Validate GDPR/HIPAA compliance checklist.
.DESCRIPTION
    Automated validation of compliance controls before deployment.
#>

param(
    [string]$Environment = "Staging"  # Dev | Test | Staging | Production
)

Write-Host "Validating GDPR/HIPAA compliance checklist for: $Environment"

$checklist = @()

# 1. Encryption at Rest
Write-Host "`nChecking: Encryption at Rest..."
$encryptionPolicy = az policy state list `
    --resource-group "ATP-$Environment-RG" `
    --filter "policyDefinitionName eq 'SQL TDE'" `
    --query "[?complianceState=='Compliant'].resourceId" -o json | ConvertFrom-Json

if ($encryptionPolicy.Count -gt 0) {
    Write-Host "✅ Encryption at Rest: PASS" -ForegroundColor Green
    $checklist += [PSCustomObject]@{ Control = "Encryption at Rest"; Status = "Pass" }
}
else {
    Write-Host "❌ Encryption at Rest: FAIL" -ForegroundColor Red
    $checklist += [PSCustomObject]@{ Control = "Encryption at Rest"; Status = "Fail" }
}

# 2. Encryption in Transit
Write-Host "`nChecking: Encryption in Transit..."
$appServices = az webapp list `
    --resource-group "ATP-$Environment-RG" `
    --query "[].{name:name,httpsOnly:httpsOnly,minTlsVersion:siteConfig.minTlsVersion}" -o json | ConvertFrom-Json

$allHttpsOnly = $appServices | Where-Object { $_.httpsOnly -eq $false }

if ($allHttpsOnly.Count -eq 0) {
    Write-Host "✅ Encryption in Transit: PASS (HTTPS enforced)" -ForegroundColor Green
    $checklist += [PSCustomObject]@{ Control = "Encryption in Transit"; Status = "Pass" }
}
else {
    Write-Host "❌ Encryption in Transit: FAIL (HTTPS not enforced)" -ForegroundColor Red
    $checklist += [PSCustomObject]@{ Control = "Encryption in Transit"; Status = "Fail" }
}

# 3. Tenant Isolation (from test results)
Write-Host "`nChecking: Tenant Isolation..."
$testResults = az pipelines runs artifact download `
    --artifact-name "test-results" `
    --path "test-results/" `
    --run-id $(Build.BuildId)

$tenantIsolationTests = Select-String -Path "test-results/*.trx" -Pattern '@tenantIsolation.*Passed'

if ($tenantIsolationTests.Count -gt 0) {
    Write-Host "✅ Tenant Isolation: PASS ($($tenantIsolationTests.Count) tests)" -ForegroundColor Green
    $checklist += [PSCustomObject]@{ Control = "Tenant Isolation"; Status = "Pass" }
}
else {
    Write-Host "❌ Tenant Isolation: FAIL (tests not found or failed)" -ForegroundColor Red
    $checklist += [PSCustomObject]@{ Control = "Tenant Isolation"; Status = "Fail" }
}

# 4. Retention Policies
Write-Host "`nChecking: Retention Policies..."
$appsettings = Get-Content -Path "src/*/appsettings.$Environment.json" | ConvertFrom-Json

if ($appsettings.Audit.RetentionDays -ge 2555) {  # 7 years = 2555 days
    Write-Host "✅ Retention Policies: PASS (7 years configured)" -ForegroundColor Green
    $checklist += [PSCustomObject]@{ Control = "Retention Policies"; Status = "Pass" }
}
else {
    Write-Host "❌ Retention Policies: FAIL (retention < 7 years)" -ForegroundColor Red
    $checklist += [PSCustomObject]@{ Control = "Retention Policies"; Status = "Fail" }
}

# 5. DSAR Workflow (API contract test)
Write-Host "`nChecking: DSAR Workflow..."
$openApiSpec = Get-Content -Path "swagger.json" | ConvertFrom-Json

$dsarEndpoint = $openApiSpec.paths."/api/dsar/{userId}"

if ($dsarEndpoint) {
    Write-Host "✅ DSAR Workflow: PASS (API endpoint exists)" -ForegroundColor Green
    $checklist += [PSCustomObject]@{ Control = "DSAR Workflow"; Status = "Pass" }
}
else {
    Write-Host "❌ DSAR Workflow: FAIL (API endpoint missing)" -ForegroundColor Red
    $checklist += [PSCustomObject]@{ Control = "DSAR Workflow"; Status = "Fail" }
}

# 6. Breach Notification
Write-Host "`nChecking: Breach Notification..."
$breachDoc = Test-Path -Path "docs/compliance/breach-notification-procedure.md"

if ($breachDoc) {
    Write-Host "✅ Breach Notification: PASS (procedure documented)" -ForegroundColor Green
    $checklist += [PSCustomObject]@{ Control = "Breach Notification"; Status = "Pass" }
}
else {
    Write-Host "⚠️  Breach Notification: WARNING (procedure not found)" -ForegroundColor Yellow
    $checklist += [PSCustomObject]@{ Control = "Breach Notification"; Status = "Warning" }
}

# 7. Audit Logging (already validated by previous gate)
Write-Host "`nChecking: Audit Logging..."
Write-Host "✅ Audit Logging: PASS (validated in previous gate)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "Audit Logging"; Status = "Pass" }

# 8. PII Redaction (already validated by previous gate)
Write-Host "`nChecking: PII Redaction..."
Write-Host "✅ PII Redaction: PASS (validated in previous gate)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "PII Redaction"; Status = "Pass" }

# Summary
Write-Host "`n═══════════════════════════════════════════════════════════"
Write-Host "Compliance Checklist Summary"
Write-Host "═══════════════════════════════════════════════════════════"

$checklist | Format-Table -AutoSize

$failed = $checklist | Where-Object { $_.Status -eq "Fail" }
$warnings = $checklist | Where-Object { $_.Status -eq "Warning" }

if ($failed.Count -gt 0) {
    Write-Host "`n❌ Compliance checklist FAILED: $($failed.Count) control(s)" -ForegroundColor Red
    exit 1
}

if ($warnings.Count -gt 0) {
    Write-Host "`n⚠️  Compliance checklist passed with warnings: $($warnings.Count) control(s)" -ForegroundColor Yellow
}

Write-Host "`n✅ All compliance controls validated" -ForegroundColor Green

Azure Pipelines Integration:

# GDPR/HIPAA Compliance Checklist Gate
- task: PowerShell@2
  inputs:
    filePath: 'scripts/validate-compliance-checklist.ps1'
    arguments: '-Environment $(Environment)'
    pwsh: true
  displayName: 'Validate GDPR/HIPAA Checklist'
  continueOnError: false  # Fail deployment if checklist fails

# Generate compliance attestation report
- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      $attestation = @{
        BuildId = "$(Build.BuildId)"
        BuildNumber = "$(Build.BuildNumber)"
        Environment = "$(Environment)"
        Timestamp = (Get-Date).ToUniversalTime().ToString("o")
        Checklist = @(
          @{ Control = "Encryption at Rest"; Status = "Pass"; Evidence = "Azure Policy: SQL TDE Enabled" }
          @{ Control = "Encryption in Transit"; Status = "Pass"; Evidence = "App Service: HTTPS Only" }
          @{ Control = "Tenant Isolation"; Status = "Pass"; Evidence = "Integration Tests: 15 passed" }
          @{ Control = "Retention Policies"; Status = "Pass"; Evidence = "appsettings.Production.json: RetentionDays=2555" }
          @{ Control = "DSAR Workflow"; Status = "Pass"; Evidence = "OpenAPI: GET /api/dsar/{userId}" }
          @{ Control = "Breach Notification"; Status = "Pass"; Evidence = "docs/compliance/breach-notification-procedure.md" }
          @{ Control = "Audit Logging"; Status = "Pass"; Evidence = "Audit Logging Coverage: 100%" }
          @{ Control = "PII Redaction"; Status = "Pass"; Evidence = "PII Validation: 0 violations" }
        )
        Approver = "$(Build.RequestedFor)"
      }

      $attestation | ConvertTo-Json -Depth 10 | Out-File -FilePath "compliance-attestation-$(Build.BuildNumber).json"

      Write-Host "✅ Compliance attestation report generated"
  displayName: 'Generate Compliance Attestation'

# Publish attestation as artifact
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: 'compliance-attestation-$(Build.BuildNumber).json'
    ArtifactName: 'compliance-attestation'
  displayName: 'Publish Compliance Attestation'

Data Classification Validation

Purpose: Ensure all sensitive data is properly classified with appropriate attributes for GDPR/HIPAA compliance.

Data Classification Levels:

Classification Attribute Examples Protection Requirements
Public (none) Product names, public IDs No special protection
Internal [InternalData] Employee names, department Redact in external logs
Confidential [ConfidentialData] Tenant config, business rules Redact in all logs, encrypt at rest
Personal [PersonalData] User names, preferences GDPR Article 4(1), redact always
Sensitive [SensitiveData] Email, phone, address GDPR Article 9, redact + encrypt
Restricted [RestrictedData] SSN, health data, biometrics HIPAA PHI, maximum protection

Classification Validator (C# Roslyn Analyzer):

// Custom analyzer to enforce data classification
[DiagnosticAnalyzer(LanguageNames.CSharp)]
public class DataClassificationAnalyzer : DiagnosticAnalyzer
{
    private const string DiagnosticId = "ATP002";

    private static readonly DiagnosticDescriptor Rule = new DiagnosticDescriptor(
        DiagnosticId,
        "Sensitive property missing classification attribute",
        "Property '{0}' contains sensitive data but lacks [PersonalData], [SensitiveData], or [RestrictedData] attribute",
        "Compliance",
        DiagnosticSeverity.Error,
        isEnabledByDefault: true);

    public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics => ImmutableArray.Create(Rule);

    public override void Initialize(AnalysisContext context)
    {
        context.RegisterSyntaxNodeAction(AnalyzeProperty, SyntaxKind.PropertyDeclaration);
    }

    private void AnalyzeProperty(SyntaxNodeAnalysisContext context)
    {
        var propertyDeclaration = (PropertyDeclarationSyntax)context.Node;
        var propertySymbol = context.SemanticModel.GetDeclaredSymbol(propertyDeclaration);

        if (propertySymbol == null)
            return;

        var propertyName = propertySymbol.Name.ToLower();

        // Sensitive property names
        var sensitiveNames = new[] 
        { 
            "email", "phone", "ssn", "socialsecurity", "creditcard", 
            "password", "healthrecord", "biometric", "dob", "dateofbirth" 
        };

        var isSensitive = sensitiveNames.Any(s => propertyName.Contains(s));

        if (!isSensitive)
            return;

        // Check if property has classification attribute
        var hasClassificationAttribute = propertySymbol.GetAttributes().Any(attr =>
        {
            var attrName = attr.AttributeClass?.Name;
            return attrName == "PersonalDataAttribute" ||
                   attrName == "SensitiveDataAttribute" ||
                   attrName == "RestrictedDataAttribute" ||
                   attrName == "EmailDataAttribute" ||
                   attrName == "PhoneDataAttribute";
        });

        if (!hasClassificationAttribute)
        {
            var diagnostic = Diagnostic.Create(Rule, propertyDeclaration.Identifier.GetLocation(), propertySymbol.Name);
            context.ReportDiagnostic(diagnostic);
        }
    }
}

Example: Proper Data Classification:

// ✅ GOOD: Entity with proper data classification
public class User
{
    // Public data (no attribute needed)
    public Guid Id { get; set; }
    public string DisplayName { get; set; }
    public DateTime CreatedAt { get; set; }

    // Personal data (GDPR Article 4(1))
    [PersonalData]
    public string FirstName { get; set; }

    [PersonalData]
    public string LastName { get; set; }

    // Sensitive data (GDPR Article 9)
    [EmailData]
    [SensitiveData]
    public string Email { get; set; }

    [PhoneData]
    [SensitiveData]
    public string PhoneNumber { get; set; }

    [SensitiveData]
    public string Address { get; set; }

    // Restricted data (HIPAA PHI)
    [RestrictedData]
    [EncryptedColumn]  // Column-level encryption
    public string SSN { get; set; }

    [RestrictedData]
    [EncryptedColumn]
    public string HealthRecordNumber { get; set; }
}

// ❌ BAD: Sensitive property without classification
public class User
{
    public Guid Id { get; set; }
    public string Email { get; set; }  // ❌ ATP002: Missing [EmailData] or [SensitiveData]
    public string SSN { get; set; }    // ❌ ATP002: Missing [RestrictedData]
}

Retention Policy Validation

Purpose: Validate that data retention policies are properly configured per GDPR Article 5(1)(e) (storage limitation) and HIPAA §164.316(b)(2)(i).

Validation (Integration Test):

// Integration test: Validate retention policy enforcement
[Fact]
[Trait("Category", "Compliance")]
[Trait("Regulatory", "GDPR")]
public async Task Should_EnforceRetentionPolicy_When_EventsExceedRetention()
{
    // Arrange: Create event older than retention period
    var retentionDays = _configuration.GetValue<int>("Audit:RetentionDays");
    var oldEvent = new AuditEvent
    {
        Id = Guid.NewGuid(),
        TenantId = _testTenant.Id,
        Timestamp = DateTime.UtcNow.AddDays(-retentionDays - 1),  // Beyond retention
        Action = "OldAction"
    };

    await _repository.AddAsync(oldEvent);

    // Act: Run retention policy enforcement job
    var retentionService = _serviceProvider.GetRequiredService<IRetentionPolicyService>();
    var purgedCount = await retentionService.EnforceRetentionPolicyAsync(_testTenant.Id);

    // Assert: Old event should be purged
    Assert.Equal(1, purgedCount);

    var retrievedEvent = await _repository.GetByIdAsync(oldEvent.Id);
    Assert.Null(retrievedEvent);  // Event should be deleted
}

[Fact]
[Trait("Category", "Compliance")]
[Trait("Regulatory", "GDPR")]
public async Task Should_RespectCustomRetention_When_TenantOverridesDefault()
{
    // Arrange: Tenant with custom 10-year retention
    var customTenant = new Tenant
    {
        Id = Guid.NewGuid(),
        Name = "CustomRetentionTenant",
        RetentionDays = 3650  // 10 years (overrides default 7 years)
    };

    await _tenantRepository.AddAsync(customTenant);

    var event8YearsOld = new AuditEvent
    {
        Id = Guid.NewGuid(),
        TenantId = customTenant.Id,
        Timestamp = DateTime.UtcNow.AddDays(-(365 * 8)),  // 8 years old
        Action = "OldAction"
    };

    await _repository.AddAsync(event8YearsOld);

    // Act: Run retention enforcement
    var retentionService = _serviceProvider.GetRequiredService<IRetentionPolicyService>();
    var purgedCount = await retentionService.EnforceRetentionPolicyAsync(customTenant.Id);

    // Assert: Event should NOT be purged (within custom 10-year retention)
    Assert.Equal(0, purgedCount);

    var retrievedEvent = await _repository.GetByIdAsync(event8YearsOld.Id);
    Assert.NotNull(retrievedEvent);  // Event still exists
}

DSAR (Data Subject Access Request) Workflow Validation

Purpose: Validate that DSAR workflows are implemented per GDPR Articles 15-20 (Right of Access, Erasure, Portability).

API Contract Test:

// API contract test for DSAR workflow
[Fact]
[Trait("Category", "Compliance")]
[Trait("Regulatory", "GDPR-Article-15")]
public async Task Should_ReturnUserData_When_DSARRequested()
{
    // Arrange: Create test user with audit events
    var userId = Guid.NewGuid();
    var events = Enumerable.Range(0, 10)
        .Select(i => new AuditEvent
        {
            Id = Guid.NewGuid(),
            TenantId = _testTenant.Id,
            UserId = userId,
            Action = $"Action{i}",
            Timestamp = DateTime.UtcNow.AddDays(-i)
        })
        .ToList();

    await _repository.AddRangeAsync(events);

    // Act: Request DSAR export
    var response = await _client.GetAsync($"/api/dsar/{userId}");

    // Assert: DSAR returns all user data
    response.EnsureSuccessStatusCode();

    var dsar = await response.Content.ReadFromJsonAsync<DSARExportResponse>();

    Assert.NotNull(dsar);
    Assert.Equal(userId, dsar.UserId);
    Assert.Equal(10, dsar.AuditEvents.Count);
    Assert.Equal("application/json", response.Content.Headers.ContentType?.MediaType);

    // Validate DSAR contains required sections per GDPR
    Assert.NotNull(dsar.PersonalData);
    Assert.NotNull(dsar.AuditTrail);
    Assert.NotNull(dsar.DataProcessingActivities);
    Assert.NotNull(dsar.ThirdPartyDisclosures);
}

[Fact]
[Trait("Category", "Compliance")]
[Trait("Regulatory", "GDPR-Article-17")]
public async Task Should_EraseUserData_When_RightToErasureInvoked()
{
    // Arrange: Create test user
    var userId = Guid.NewGuid();
    await CreateTestUserWithDataAsync(userId);

    // Act: Request erasure (Right to be Forgotten)
    var response = await _client.DeleteAsync($"/api/dsar/{userId}");

    // Assert: User data erased
    response.EnsureSuccessStatusCode();

    // Verify audit events anonymized (user ID replaced with pseudonym)
    var events = await _repository.GetByUserIdAsync(userId);
    Assert.Empty(events);  // User's events should be anonymized or deleted

    // Verify user record marked as erased
    var user = await _userRepository.GetByIdAsync(userId);
    Assert.True(user.IsErased);
    Assert.Equal("ERASED", user.Email);  // PII overwritten
}

Compliance Evidence Collection

Purpose: Automatically collect compliance evidence during builds for SOC 2, GDPR, HIPAA audits.

Evidence Artifacts:

# Collect compliance evidence
- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      $evidenceDir = "$(Build.ArtifactStagingDirectory)/compliance-evidence"
      New-Item -ItemType Directory -Force -Path $evidenceDir

      # 1. SBOM (already generated)
      Copy-Item "$(Build.ArtifactStagingDirectory)/sbom/*.json" -Destination "$evidenceDir/sbom.json"

      # 2. Security scan reports
      Copy-Item "dependency-check-report.html" -Destination "$evidenceDir/dependency-scan.html"
      Copy-Item "trivy-report.html" -Destination "$evidenceDir/container-scan.html"

      # 3. Test results with compliance tags
      Copy-Item "TestResults/*.trx" -Destination "$evidenceDir/test-results.trx"

      # 4. Code coverage report
      Copy-Item "coverage-report/index.html" -Destination "$evidenceDir/code-coverage.html"

      # 5. Compliance attestation
      Copy-Item "compliance-attestation-$(Build.BuildNumber).json" -Destination "$evidenceDir/compliance-attestation.json"

      # 6. Audit logging coverage report
      Copy-Item "audit-logging-coverage.json" -Destination "$evidenceDir/audit-logging-coverage.json"

      # 7. PII redaction report
      Copy-Item "pii-redaction-report.json" -Destination "$evidenceDir/pii-redaction-report.json"

      # 8. License compliance report
      Copy-Item "licenses/licenses.json" -Destination "$evidenceDir/license-report.json"

      Write-Host "✅ Compliance evidence collected: 8 artifacts"
  displayName: 'Collect Compliance Evidence'

# Publish compliance evidence bundle
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)/compliance-evidence'
    ArtifactName: 'compliance-evidence-$(Build.BuildNumber)'
  displayName: 'Publish Compliance Evidence Bundle'

# Archive to immutable storage (production only)
- task: AzureCLI@2
  inputs:
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: |
      # Upload evidence bundle
      az storage blob upload-batch \
        --source "$(Build.ArtifactStagingDirectory)/compliance-evidence" \
        --destination compliance-evidence \
        --account-name atpcomplianceblob \
        --pattern "*" \
        --metadata \
          BuildId=$(Build.BuildId) \
          BuildNumber=$(Build.BuildNumber) \
          Environment=Production \
          ComplianceFrameworks=GDPR,HIPAA,SOC2 \
          RetentionYears=7

      # Enable legal hold
      az storage blob set-legal-hold \
        --account-name atpcomplianceblob \
        --container-name compliance-evidence \
        --blob-name "$(Build.BuildNumber)/*" \
        --legal-hold true \
        --tags audit-evidence=true soc2=true gdpr=true hipaa=true
  displayName: 'Archive Compliance Evidence (Immutable)'
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))

SOC 2 Control Mapping

Purpose: Map ATP quality gates to SOC 2 Trust Service Criteria for audit readiness.

SOC 2 Control Mapping:

SOC 2 Control Description ATP Quality Gate Evidence
CC6.1 Logical and physical access controls Two-person review (PRs), Key Vault access PR approvals, Key Vault audit logs
CC6.6 Vulnerability management Dependency scanning, Trivy, SAST OWASP reports, Trivy reports
CC7.2 System monitoring Observability gates, health checks Application Insights, test results
CC8.1 Change management CAB approval, deployment gates CAB meeting minutes, approval logs
CC8.0 Change management controls Azure Pipelines approval gates Pipeline execution logs
CC9.2 Risk mitigation Security gates, compliance gates Risk acceptance documents, suppressions
A1.2 Availability commitments Performance gates, chaos tests Load test results, DR drill results
C1.1 Confidentiality commitments PII redaction, encryption gates PII validation reports, Azure Policy
P2.1 Privacy notice DSAR workflow, consent management DSAR API tests, consent logs
P3.2 Privacy access, correction, deletion DSAR API implementation DSAR integration tests

SOC 2 Evidence Package (auto-generated per build):

// Generate SOC 2 evidence package
public class SOC2EvidenceCollector
{
    public async Task<SOC2EvidencePackage> CollectEvidenceAsync(string buildId)
    {
        return new SOC2EvidencePackage
        {
            BuildId = buildId,
            GeneratedAt = DateTime.UtcNow,

            // CC6.1: Logical Access Controls
            CC6_1 = new ControlEvidence
            {
                ControlId = "CC6.1",
                Description = "Logical and physical access controls restrict unauthorized access",
                Evidence = new[]
                {
                    await GetPRApprovalLogsAsync(buildId),
                    await GetKeyVaultAccessLogsAsync(),
                    await GetAzureADAccessReviewsAsync()
                }
            },

            // CC6.6: Vulnerability Management
            CC6_6 = new ControlEvidence
            {
                ControlId = "CC6.6",
                Description = "Vulnerabilities are identified and remediated timely",
                Evidence = new[]
                {
                    await GetDependencyScanReportAsync(buildId),
                    await GetTrivyScanReportAsync(buildId),
                    await GetSASTReportAsync(buildId),
                    await GetVulnerabilityRemediationMetricsAsync()
                }
            },

            // CC8.1: Change Management
            CC8_1 = new ControlEvidence
            {
                ControlId = "CC8.1",
                Description = "Changes are authorized, tested, and approved before deployment",
                Evidence = new[]
                {
                    await GetCABApprovalRecordsAsync(buildId),
                    await GetPipelineExecutionLogsAsync(buildId),
                    await GetDeploymentApprovalLogsAsync(buildId),
                    await GetTestResultsAsync(buildId)
                }
            },

            // P3.2: Privacy Rights (DSAR)
            P3_2 = new ControlEvidence
            {
                ControlId = "P3.2",
                Description = "Individuals can access, correct, and delete personal data",
                Evidence = new[]
                {
                    await GetDSARAPITestResultsAsync(buildId),
                    await GetDSARExecutionLogsAsync(),
                    "DSAR API: GET /api/dsar/{userId}, DELETE /api/dsar/{userId}"
                }
            }
        };
    }
}

Compliance Gate Metrics & Reporting

Purpose: Track compliance posture over time and generate audit-ready reports.

Compliance Metrics Dashboard:

# Azure DevOps Compliance Dashboard
dashboard:
  name: "ATP Compliance Posture"

  widgets:
    - type: complianceScore
      title: "Overall Compliance Score"
      query: |
        customEvents
        | where name == "ComplianceChecklistValidated"
        | extend PassedControls = toint(customDimensions.PassedControls)
        | extend TotalControls = toint(customDimensions.TotalControls)
        | extend Score = (PassedControls * 100.0) / TotalControls
        | summarize AvgScore = avg(Score)
      target: 100%

    - type: auditLoggingCoverage
      title: "Audit Logging Coverage"
      query: "Audit Logging Coverage (Last 30 Builds)"
      target: 100%

    - type: piiViolations
      title: "PII Leakage Incidents"
      query: |
        customEvents
        | where name == "PIIDetectedInLogs"
        | where timestamp > ago(30d)
        | summarize count()
      target: 0

    - type: dsarResponseTime
      title: "DSAR Response Time"
      query: |
        customEvents
        | where name == "DSARCompleted"
        | extend ResponseTimeHours = todouble(customDimensions.ResponseTimeHours)
        | summarize AvgResponseTime = avg(ResponseTimeHours)
      target: < 72h (GDPR requirement: 30 days, ATP target: 3 days)

Compliance KQL Queries:

// Compliance checklist pass rate (last 90 days)
customEvents
| where name == "ComplianceChecklistValidated"
| where timestamp > ago(90d)
| extend PassedControls = toint(customDimensions.PassedControls)
| extend TotalControls = toint(customDimensions.TotalControls)
| extend PassRate = (PassedControls * 100.0) / TotalControls
| summarize 
    AvgPassRate = avg(PassRate),
    MinPassRate = min(PassRate),
    BuildsBelow100Percent = countif(PassRate < 100)
| extend ComplianceStatus = iff(MinPassRate == 100, "Compliant", "Non-Compliant")

// Audit logging coverage trend
customEvents
| where name == "AuditLoggingValidated"
| where timestamp > ago(90d)
| extend Coverage = todouble(customDimensions.CoveragePercent)
| summarize AvgCoverage = avg(Coverage) by bin(timestamp, 1d)
| render timechart

// PII leakage incidents by severity
customEvents
| where name == "PIIDetectedInLogs"
| where timestamp > ago(180d)
| extend PIIType = tostring(customDimensions.PIIType)
| extend File = tostring(customDimensions.File)
| summarize Count = count() by PIIType
| order by Count desc

// DSAR request fulfillment metrics
customEvents
| where name in ("DSARRequested", "DSARCompleted", "DSARFailed")
| where timestamp > ago(90d)
| extend RequestId = tostring(customDimensions.RequestId)
| summarize 
    RequestedAt = minif(timestamp, name == "DSARRequested"),
    CompletedAt = maxif(timestamp, name == "DSARCompleted"),
    Status = iff(countif(name == "DSARCompleted") > 0, "Completed", "Pending")
    by RequestId
| where isnotnull(CompletedAt)
| extend ResponseTimeHours = datetime_diff('hour', CompletedAt, RequestedAt)
| summarize 
    AvgResponseTime = avg(ResponseTimeHours),
    P50ResponseTime = percentile(ResponseTimeHours, 50),
    P95ResponseTime = percentile(ResponseTimeHours, 95),
    Within72Hours = 100.0 * countif(ResponseTimeHours <= 72) / count()

Compliance Audit Report Generation

Purpose: Generate audit-ready reports summarizing compliance posture for SOC 2, GDPR, HIPAA audits.

Monthly Compliance Report (Azure Function):

// Generate monthly compliance report
[FunctionName("GenerateMonthlyComplianceReport")]
public async Task RunAsync(
    [TimerTrigger("0 0 9 1 * *")] TimerInfo timer,  // 1st of month at 9 AM
    ILogger log)
{
    log.LogInformation("Generating monthly compliance report...");

    var reportMonth = DateTime.UtcNow.AddMonths(-1).ToString("MMMM yyyy");

    var report = new ComplianceReport
    {
        Month = reportMonth,
        GeneratedAt = DateTime.UtcNow,

        // Overall Compliance Score
        OverallScore = await CalculateComplianceScoreAsync(),

        // Quality Gate Pass Rates
        QualityGates = new QualityGateMetrics
        {
            BuildQuality = await GetGatePassRateAsync("BuildQuality"),
            TestCoverage = await GetGatePassRateAsync("TestCoverage"),
            Security = await GetGatePassRateAsync("Security"),
            Compliance = await GetGatePassRateAsync("Compliance"),
            Performance = await GetGatePassRateAsync("Performance"),
            Observability = await GetGatePassRateAsync("Observability")
        },

        // Audit Logging
        AuditLogging = new AuditLoggingMetrics
        {
            AverageCoverage = await GetAuditLoggingCoverageAsync(),
            EventsLogged = await GetAuditEventCountAsync(reportMonth),
            ComplianceRate = 100.0  // Always 100% (enforced by gate)
        },

        // PII Protection
        PIIProtection = new PIIProtectionMetrics
        {
            LeakageIncidents = await GetPIILeakageCountAsync(reportMonth),
            RedactionEffectiveness = await GetPIIRedactionRateAsync(),
            DataClassificationCoverage = await GetDataClassificationCoverageAsync()
        },

        // GDPR Compliance
        GDPR = new GDPRMetrics
        {
            DSARRequests = await GetDSARRequestCountAsync(reportMonth),
            DSARAvgResponseTime = await GetDSARAvgResponseTimeAsync(reportMonth),
            RightToErasureRequests = await GetErasureRequestCountAsync(reportMonth),
            DataBreachIncidents = await GetDataBreachCountAsync(reportMonth)
        },

        // HIPAA Compliance
        HIPAA = new HIPAAMetrics
        {
            EncryptionCompliance = await GetEncryptionComplianceRateAsync(),
            AccessControlAudits = await GetAccessControlAuditCountAsync(reportMonth),
            BAAAgreementsActive = await GetBAACountAsync()
        },

        // SOC 2 Controls
        SOC2 = await GenerateSOC2ControlEvidenceAsync(reportMonth)
    };

    // Generate PDF report
    var pdf = await GeneratePdfReportAsync(report);

    // Upload to compliance blob storage
    await UploadComplianceReportAsync($"compliance-reports/{reportMonth}/Compliance-Report-{reportMonth}.pdf", pdf);

    // Send to stakeholders
    await SendReportAsync(pdf, new[]
    {
        "ciso@connectsoft.example",
        "compliance-officer@connectsoft.example",
        "dpo@connectsoft.example",  // Data Protection Officer
        "external-auditors@connectsoft.example"
    });

    log.LogInformation($"✅ Monthly compliance report generated for {reportMonth}");
}

Summary

  • Compliance Gates: 2-3 minute execution; enforce GDPR, HIPAA, SOC 2 requirements before deployment
  • Audit Logging Validation: 100% coverage required; custom Roslyn analyzer (ATP001) enforces IAuditLogger.LogAsync() calls
  • Audit Logging Validator: PowerShell script scans for state-mutating methods (Create/Update/Delete/Save/Add/Remove) without audit calls
  • PII Redaction Validation: Zero tolerance for raw PII in logs; PowerShell script detects email/phone/SSN patterns
  • PII Patterns: 4 regex patterns (email, phone, SSN, credit card); scans all log statements
  • PII Redaction: C# examples with [EmailData], [PhoneData], [PersonalData] attributes; automatic redaction via Serilog enricher
  • GDPR/HIPAA Checklist: 8 controls (encryption at rest/transit, tenant isolation, retention, DSAR, breach notification, audit logging, PII redaction)
  • Checklist Validator: PowerShell script validates all 8 controls via Azure Policy, test results, configuration, API contracts, documentation
  • Data Classification: 6 levels (Public, Internal, Confidential, Personal, Sensitive, Restricted); custom analyzer (ATP002) enforces classification
  • Retention Policy Validation: Integration tests verify 7-year retention enforcement and custom tenant overrides
  • DSAR Workflow: API contract tests validate GDPR Article 15 (access), Article 17 (erasure), Article 20 (portability)
  • Compliance Evidence: 8 artifacts auto-collected per build (SBOM, security scans, test results, coverage, attestation, audit logging, PII redaction, licenses)
  • SOC 2 Mapping: 10 Trust Service Criteria mapped to ATP gates with evidence artifacts
  • Compliance Reporting: Monthly automated report (Azure Function) covering quality gates, audit logging, PII protection, GDPR, HIPAA, SOC 2
  • Immutable Evidence: All compliance artifacts archived in Azure Blob with legal hold (7-year retention)

Performance Gates (Deep Dive)

Performance gates validate that ATP services meet latency, throughput, and reliability requirements under production-like load conditions and failure scenarios. These gates execute in staging environment and block production deployment if performance thresholds are not met.

Philosophy: Performance is a feature, not an afterthought. ATP enforces industry-leading performance standards (p95 <500ms vs. industry <1000ms) and chaos engineering to ensure services degrade gracefully under adverse conditions.

Performance Gate Workflow

graph TD
    A[Compliance Gates Passed] --> B[Deploy to Staging]
    B --> C[Run Load Tests]
    C --> D{Latency < Threshold?}
    D -->|No| E[Latency Too High ❌]
    D -->|Yes| F{Error Rate < 0.1%?}
    F -->|No| G[Error Rate Too High ❌]
    F -->|Yes| H{Throughput ≥ 1000 RPS?}
    H -->|No| I[Throughput Too Low ⚠️]
    H -->|Yes| J[Run Chaos Tests]
    J --> K{Pod Restart Pass?}
    K -->|No| L[Pod Restart Failed ❌]
    K -->|Yes| M{Network Latency Pass?}
    M -->|No| N[Network Latency Failed ⚠️]
    M -->|Yes| O{Storage Failure Pass?}
    O -->|No| P[Storage Failure Failed ❌]
    O -->|Yes| Q[Performance Gates Passed ✅]

    E --> R[Block Production Deployment]
    G --> R
    L --> R
    P --> R
    I --> S[Warning: Monitor Capacity]
    N --> S
    Q --> T[Ready for Production]

    style E fill:#ff6b6b
    style G fill:#ff6b6b
    style L fill:#ff6b6b
    style P fill:#ff6b6b
    style I fill:#feca57
    style N fill:#feca57
    style Q fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical Performance Gate Duration: 10-15 minutes (load tests) + 5-10 minutes (chaos tests) = 15-25 minutes total


Load Test Thresholds (Staging)

Purpose: Simulate production traffic patterns (500-1000 concurrent users) and validate that services meet latency and error rate requirements.

Tool: Apache JMeter or k6 for load testing

Test Configuration:

Parameter Value Rationale
Concurrent Users 500 ~50% of production peak traffic
Ramp-Up Time 60 seconds Gradual traffic increase
Test Duration 600 seconds (10 minutes) Sufficient for steady-state analysis
Request Mix 70% read, 30% write Matches production patterns
Data Set 1M audit events Production-like scale

Performance Thresholds:

Metric Threshold Action Rationale
p50 Latency < 100ms ⚠️ Warning; investigate Median user experience; competitive with industry leaders
p95 Latency < 500ms ❌ Block prod deploy 95% of requests must be fast; ATP requirement stricter than industry (<1000ms)
p99 Latency < 1000ms ⚠️ Warning; track outliers 99% percentile; acceptable for edge cases
Error Rate < 0.1% ❌ Block prod deploy 99.9% success rate; <1 error per 1000 requests
Throughput ≥ 1000 RPS ℹ️ Info; capacity planning Sustained requests/second; validates scaling
CPU Utilization < 70% avg ⚠️ Warning; optimize Headroom for traffic spikes
Memory Utilization < 80% avg ⚠️ Warning; investigate leaks Prevent OOM conditions
Database DTU/RU < 80% ⚠️ Warning; scale up Database capacity headroom

JMeter Test Plan (XML excerpt):

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2" properties="5.0" jmeter="5.6.2">
  <hashTree>
    <!-- Test Plan -->
    <TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="ATP Load Test - Staging">
      <stringProp name="TestPlan.comments">ATP load test simulating production traffic</stringProp>
      <boolProp name="TestPlan.functional_mode">false</boolProp>
      <boolProp name="TestPlan.serialize_threadgroups">false</boolProp>
      <elementProp name="TestPlan.user_defined_variables" elementType="Arguments">
        <collectionProp name="Arguments.arguments">
          <elementProp name="BASE_URL" elementType="Argument">
            <stringProp name="Argument.name">BASE_URL</stringProp>
            <stringProp name="Argument.value">${__P(baseUrl,https://atp-staging.azurewebsites.net)}</stringProp>
          </elementProp>
          <elementProp name="USERS" elementType="Argument">
            <stringProp name="Argument.name">USERS</stringProp>
            <stringProp name="Argument.value">${__P(users,500)}</stringProp>
          </elementProp>
          <elementProp name="RAMP_UP" elementType="Argument">
            <stringProp name="Argument.name">RAMP_UP</stringProp>
            <stringProp name="Argument.value">${__P(rampUp,60)}</stringProp>
          </elementProp>
          <elementProp name="DURATION" elementType="Argument">
            <stringProp name="Argument.name">DURATION</stringProp>
            <stringProp name="Argument.value">${__P(duration,600)}</stringProp>
          </elementProp>
        </collectionProp>
      </elementProp>
    </TestPlan>

    <hashTree>
      <!-- Thread Group: Read Operations (70% of traffic) -->
      <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Read Operations">
        <stringProp name="ThreadGroup.num_threads">${USERS}</stringProp>
        <stringProp name="ThreadGroup.ramp_time">${RAMP_UP}</stringProp>
        <stringProp name="ThreadGroup.duration">${DURATION}</stringProp>
        <boolProp name="ThreadGroup.scheduler">true</boolProp>
      </ThreadGroup>

      <hashTree>
        <!-- GET /api/audit-events (list events) -->
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="GET Audit Events">
          <elementProp name="HTTPsampler.Arguments" elementType="Arguments">
            <collectionProp name="Arguments.arguments">
              <elementProp name="tenantId" elementType="HTTPArgument">
                <stringProp name="Argument.value">${__UUID()}</stringProp>
              </elementProp>
              <elementProp name="pageSize" elementType="HTTPArgument">
                <stringProp name="Argument.value">50</stringProp>
              </elementProp>
            </collectionProp>
          </elementProp>
          <stringProp name="HTTPSampler.domain">${BASE_URL}</stringProp>
          <stringProp name="HTTPSampler.path">/api/audit-events</stringProp>
          <stringProp name="HTTPSampler.method">GET</stringProp>
        </HTTPSamplerProxy>

        <!-- Assertion: Response time < 500ms (p95) -->
        <DurationAssertion guiclass="DurationAssertionGui" testclass="DurationAssertion" testname="Response Time < 500ms">
          <longProp name="DurationAssertion.duration">500</longProp>
        </DurationAssertion>

        <!-- Assertion: HTTP 200 OK -->
        <ResponseAssertion guiclass="AssertionGui" testclass="ResponseAssertion" testname="HTTP 200 OK">
          <collectionProp name="Asserion.test_strings">
            <stringProp name="49586">200</stringProp>
          </collectionProp>
          <stringProp name="Assertion.test_field">Assertion.response_code</stringProp>
        </ResponseAssertion>
      </hashTree>

      <!-- Thread Group: Write Operations (30% of traffic) -->
      <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Write Operations">
        <stringProp name="ThreadGroup.num_threads">${__intSum(${USERS},0.3)}</stringProp>
        <stringProp name="ThreadGroup.ramp_time">${RAMP_UP}</stringProp>
        <stringProp name="ThreadGroup.duration">${DURATION}</stringProp>
      </ThreadGroup>

      <hashTree>
        <!-- POST /api/audit-events (create event) -->
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="POST Audit Event">
          <boolProp name="HTTPSampler.postBodyRaw">true</boolProp>
          <elementProp name="HTTPsampler.Arguments" elementType="Arguments">
            <collectionProp name="Arguments.arguments">
              <elementProp name="" elementType="HTTPArgument">
                <boolProp name="HTTPArgument.always_encode">false</boolProp>
                <stringProp name="Argument.value">{
  "tenantId": "${__UUID()}",
  "action": "UserLogin",
  "userId": "${__UUID()}",
  "timestamp": "${__time(yyyy-MM-dd'T'HH:mm:ss'Z')}"
}</stringProp>
              </elementProp>
            </collectionProp>
          </elementProp>
          <stringProp name="HTTPSampler.domain">${BASE_URL}</stringProp>
          <stringProp name="HTTPSampler.path">/api/audit-events</stringProp>
          <stringProp name="HTTPSampler.method">POST</stringProp>
          <stringProp name="HTTPSampler.contentEncoding">UTF-8</stringProp>
          <stringProp name="HTTPSampler.header_manager">Content-Type: application/json</stringProp>
        </HTTPSamplerProxy>
      </hashTree>

      <!-- Listeners: Aggregate Report -->
      <ResultCollector guiclass="StatVisualizer" testclass="ResultCollector" testname="Aggregate Report">
        <boolProp name="ResultCollector.error_logging">false</boolProp>
        <objProp>
          <name>saveConfig</name>
          <value class="SampleSaveConfiguration">
            <time>true</time>
            <latency>true</latency>
            <timestamp>true</timestamp>
            <success>true</success>
            <label>true</label>
            <code>true</code>
            <message>true</message>
            <threadName>true</threadName>
            <dataType>true</dataType>
            <encoding>false</encoding>
            <assertions>true</assertions>
            <subresults>true</subresults>
            <responseData>false</responseData>
            <samplerData>false</samplerData>
            <xml>false</xml>
            <fieldNames>true</fieldNames>
            <responseHeaders>false</responseHeaders>
            <requestHeaders>false</requestHeaders>
            <responseDataOnError>false</responseDataOnError>
            <saveAssertionResultsFailureMessage>true</saveAssertionResultsFailureMessage>
            <assertionsResultsToSave>0</assertionsResultsToSave>
            <bytes>true</bytes>
            <sentBytes>true</sentBytes>
            <url>true</url>
            <threadCounts>true</threadCounts>
            <idleTime>true</idleTime>
            <connectTime>true</connectTime>
          </value>
        </objProp>
        <stringProp name="filename">load-test-results.jtl</stringProp>
      </ResultCollector>
    </hashTree>
  </hashTree>
</jmeterTestPlan>

Azure Pipelines Load Test Execution:

# Load Test Gate (Staging Environment)
- stage: Performance_Tests
  displayName: 'Performance Testing (Staging)'
  dependsOn: Deploy_Staging
  condition: succeeded()

  jobs:
  - job: LoadTest
    displayName: 'Run Load Tests'
    pool:
      vmImage: 'ubuntu-latest'

    steps:
    # Install JMeter
    - script: |
        wget https://archive.apache.org/dist/jmeter/binaries/apache-jmeter-5.6.2.tgz
        tar -xzf apache-jmeter-5.6.2.tgz
        sudo mv apache-jmeter-5.6.2 /opt/jmeter
        export PATH=$PATH:/opt/jmeter/bin
        jmeter --version
      displayName: 'Install JMeter'

    # Run load test
    - script: |
        /opt/jmeter/bin/jmeter \
          -n \
          -t load-tests/atp-load-test.jmx \
          -l load-test-results.jtl \
          -e \
          -o load-test-report \
          -JbaseUrl=$(StagingUrl) \
          -Jusers=500 \
          -JrampUp=60 \
          -Jduration=600
      displayName: 'Execute Load Test (10 minutes)'
      timeoutInMinutes: 15

    # Analyze results
    - task: PowerShell@2
      inputs:
        targetType: 'inline'
        script: |
          # Parse JMeter results
          $results = Import-Csv load-test-results.jtl -Delimiter ","

          # Calculate metrics
          $latencies = $results | Where-Object { $_.success -eq "true" } | Select-Object -ExpandProperty elapsed | ForEach-Object { [int]$_ }
          $errors = $results | Where-Object { $_.success -eq "false" }

          $p50 = ($latencies | Sort-Object)[[int]($latencies.Count * 0.50)]
          $p95 = ($latencies | Sort-Object)[[int]($latencies.Count * 0.95)]
          $p99 = ($latencies | Sort-Object)[[int]($latencies.Count * 0.99)]
          $errorRate = ($errors.Count / $results.Count) * 100
          $throughput = $results.Count / 600  # Total requests / 600 seconds

          Write-Host "Load Test Results:"
          Write-Host "  p50 Latency: ${p50}ms (threshold: <100ms)"
          Write-Host "  p95 Latency: ${p95}ms (threshold: <500ms)"
          Write-Host "  p99 Latency: ${p99}ms (threshold: <1000ms)"
          Write-Host "  Error Rate: $($errorRate.ToString('F2'))% (threshold: <0.1%)"
          Write-Host "  Throughput: $($throughput.ToString('F1')) RPS (threshold: ≥1000)"

          # Validate thresholds
          $failed = $false

          if ($p50 -gt 100) {
            Write-Warning "p50 latency exceeded threshold: ${p50}ms > 100ms"
          }

          if ($p95 -gt 500) {
            Write-Error "❌ p95 latency exceeded threshold: ${p95}ms > 500ms (BLOCKER)"
            $failed = $true
          }

          if ($p99 -gt 1000) {
            Write-Warning "p99 latency exceeded threshold: ${p99}ms > 1000ms"
          }

          if ($errorRate -gt 0.1) {
            Write-Error "❌ Error rate exceeded threshold: $($errorRate.ToString('F2'))% > 0.1% (BLOCKER)"
            $failed = $true
          }

          if ($throughput -lt 1000) {
            Write-Warning "Throughput below target: $($throughput.ToString('F1')) RPS < 1000 RPS"
          }

          if ($failed) {
            Write-Error "Load test failed; blocking production deployment"
            exit 1
          }

          Write-Host "`n✅ Load test passed all thresholds"
      displayName: 'Validate Load Test Results'

    # Publish results
    - task: PublishBuildArtifacts@1
      inputs:
        PathtoPublish: 'load-test-report'
        ArtifactName: 'load-test-report-$(Build.BuildNumber)'
      displayName: 'Publish Load Test Report'
      condition: always()

k6 Alternative (Modern Load Testing):

// load-test.js (k6)
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

// Custom metrics
const errorRate = new Rate('errors');
const latency = new Trend('latency');

// Test configuration
export const options = {
  stages: [
    { duration: '1m', target: 500 },   // Ramp-up to 500 users
    { duration: '10m', target: 500 },  // Hold at 500 users for 10 minutes
    { duration: '1m', target: 0 },     // Ramp-down
  ],
  thresholds: {
    'http_req_duration{type:read}': ['p(50)<100', 'p(95)<500', 'p(99)<1000'],  // Latency thresholds
    'http_req_duration{type:write}': ['p(95)<800'],  // Writes can be slower
    'http_req_failed': ['rate<0.001'],  // <0.1% error rate
    'http_reqs': ['rate>1000'],  // ≥1000 requests/second
  },
};

const BASE_URL = __ENV.BASE_URL || 'https://atp-staging.azurewebsites.net';

export default function () {
  // 70% read operations
  if (Math.random() < 0.7) {
    const res = http.get(`${BASE_URL}/api/audit-events?pageSize=50`, {
      tags: { type: 'read' },
    });

    check(res, {
      'status is 200': (r) => r.status === 200,
      'latency < 500ms': (r) => r.timings.duration < 500,
    });

    errorRate.add(res.status !== 200);
    latency.add(res.timings.duration, { type: 'read' });
  }
  // 30% write operations
  else {
    const payload = JSON.stringify({
      tenantId: '12345678-1234-1234-1234-123456789012',
      action: 'UserLogin',
      userId: '87654321-4321-4321-4321-210987654321',
      timestamp: new Date().toISOString(),
    });

    const res = http.post(`${BASE_URL}/api/audit-events`, payload, {
      headers: { 'Content-Type': 'application/json' },
      tags: { type: 'write' },
    });

    check(res, {
      'status is 201': (r) => r.status === 201,
      'latency < 800ms': (r) => r.timings.duration < 800,
    });

    errorRate.add(res.status !== 201);
    latency.add(res.timings.duration, { type: 'write' });
  }

  sleep(1);  // 1 second delay between requests
}

// Run k6 in Azure Pipelines
// k6 run --out json=load-test-results.json load-test.js

Chaos Test Pass Rate

Purpose: Validate that ATP services degrade gracefully under failure conditions (pod restarts, network latency, storage unavailable).

Tool: Chaos Mesh (Kubernetes-native chaos engineering) or Azure Chaos Studio

Chaos Scenarios:

Scenario Description Pass Rate Blocker Expected Behavior
Pod Restart Random pod killed every 30s 100% ✅ Yes Graceful shutdown, requests redistributed, no data loss, <5s recovery
Network Latency 500ms latency added to pod 95% ❌ No Timeouts honored, retries triggered, circuit breaker opens
Storage Unavailable SQL/Blob down for 30s 100% ✅ Yes Circuit breaker opens, degraded mode, cached data served, no cascading failures
CPU Throttle Pod CPU limited to 50% 90% ❌ No Graceful degradation, autoscaling triggered, no OOM kills
Memory Pressure 80% memory consumed 95% ❌ No GC triggered, cache eviction, no OOM exceptions

Chaos Test Configuration (YAML):

# Chaos engineering tests
chaosTests:
  - scenario: pod_restart
    description: "Random pod restart every 30 seconds"
    duration: 300  # 5 minutes
    passRate: 100%  # Must handle gracefully
    blockerOnFail: true
    expectedBehavior:
      - Graceful shutdown (SIGTERM handled)
      - Requests redistributed to healthy pods
      - No data loss or corruption
      - Recovery time < 5 seconds

  - scenario: network_latency_500ms
    description: "Add 500ms network latency to pod"
    duration: 300
    passRate: 95%  # Allow some failures
    blockerOnFail: false
    expectedBehavior:
      - Timeouts honored (circuit breaker opens)
      - Retries triggered with exponential backoff
      - Graceful degradation (cached responses)

  - scenario: storage_unavailable
    description: "SQL/Blob storage unavailable for 30 seconds"
    duration: 300
    passRate: 100%  # Critical; must degrade gracefully
    blockerOnFail: true
    expectedBehavior:
      - Circuit breaker opens immediately
      - Degraded mode activated (read from cache)
      - No cascading failures to other services
      - Auto-recovery when storage returns

  - scenario: cpu_throttle
    description: "Limit pod CPU to 50% of requested"
    duration: 300
    passRate: 90%
    blockerOnFail: false
    expectedBehavior:
      - Autoscaling triggered (horizontal pod autoscaler)
      - Request queue managed (no rejections)
      - Graceful performance degradation

  - scenario: memory_pressure
    description: "Consume 80% of pod memory"
    duration: 300
    passRate: 95%
    blockerOnFail: false
    expectedBehavior:
      - Garbage collection triggered
      - Cache eviction (LRU policy)
      - No OutOfMemoryException

Chaos Mesh Manifests (Kubernetes):

# Chaos Experiment: Pod Restart
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-restart-test
  namespace: atp-staging
spec:
  action: pod-kill
  mode: one  # Kill one random pod
  selector:
    namespaces:
      - atp-staging
    labelSelectors:
      app: atp-ingestion

  scheduler:
    cron: "@every 30s"  # Kill pod every 30 seconds

  duration: "5m"  # Test duration

---
# Chaos Experiment: Network Latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-latency-test
  namespace: atp-staging
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - atp-staging
    labelSelectors:
      app: atp-ingestion

  delay:
    latency: "500ms"
    correlation: "50"  # 50% correlation between delays
    jitter: "100ms"

  direction: both  # Both ingress and egress
  duration: "5m"

---
# Chaos Experiment: Storage I/O Chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: storage-unavailable-test
  namespace: atp-staging
spec:
  action: fault
  mode: one
  selector:
    namespaces:
      - atp-staging
    labelSelectors:
      app: atp-ingestion

  volumePath: /data
  path: /data/**/*

  errno: 5  # I/O error
  percent: 100  # 100% of I/O operations fail

  duration: "30s"  # Storage down for 30 seconds

  scheduler:
    cron: "@every 2m"  # Inject failure every 2 minutes

---
# Chaos Experiment: CPU Stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-throttle-test
  namespace: atp-staging
spec:
  mode: one
  selector:
    namespaces:
      - atp-staging
    labelSelectors:
      app: atp-ingestion

  stressors:
    cpu:
      workers: 2  # 2 CPU-intensive workers
      load: 50  # 50% CPU load

  duration: "5m"

---
# Chaos Experiment: Memory Pressure
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-pressure-test
  namespace: atp-staging
spec:
  mode: one
  selector:
    namespaces:
      - atp-staging
    labelSelectors:
      app: atp-ingestion

  stressors:
    memory:
      workers: 1
      size: "512MB"  # Consume 512MB (80% of 640MB pod limit)

  duration: "5m"

Chaos Test Execution (Azure Pipelines):

# Chaos Engineering Tests (Staging)
- job: ChaosTests
  displayName: 'Run Chaos Tests'
  dependsOn: LoadTest
  condition: succeeded()

  steps:
  # Install Chaos Mesh CLI
  - script: |
      curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash
      export PATH=$PATH:$HOME/.local/bin
      chaos-mesh version
    displayName: 'Install Chaos Mesh CLI'

  # Connect to AKS cluster
  - task: AzureCLI@2
    inputs:
      scriptType: 'bash'
      scriptLocation: 'inlineScript'
      inlineScript: |
        az aks get-credentials \
          --resource-group ATP-Staging-RG \
          --name atp-aks-staging-eus
    displayName: 'Connect to AKS Staging'

  # Scenario 1: Pod Restart
  - script: |
      # Apply chaos experiment
      kubectl apply -f chaos-tests/pod-restart.yaml

      # Wait for experiment to complete
      sleep 300

      # Check application metrics during chaos
      ERRORS=$(kubectl logs -l app=atp-ingestion --since=5m | grep -c ERROR || true)

      if [ "$ERRORS" -gt 10 ]; then
        echo "❌ Pod restart chaos test failed: $ERRORS errors detected"
        exit 1
      fi

      # Clean up experiment
      kubectl delete -f chaos-tests/pod-restart.yaml

      echo "✅ Pod restart chaos test passed"
    displayName: 'Chaos Test: Pod Restart'

  # Scenario 2: Network Latency
  - script: |
      kubectl apply -f chaos-tests/network-latency.yaml
      sleep 300

      # Validate circuit breaker opened (check metrics)
      CIRCUIT_BREAKER_OPENS=$(curl -s $(StagingUrl)/metrics | grep 'circuit_breaker_state{state="open"}' | awk '{print $2}')

      if [ "$CIRCUIT_BREAKER_OPENS" -eq "0" ]; then
        echo "⚠️  Circuit breaker did not open during network latency"
      fi

      kubectl delete -f chaos-tests/network-latency.yaml
      echo "✅ Network latency chaos test passed"
    displayName: 'Chaos Test: Network Latency'

  # Scenario 3: Storage Unavailable
  - script: |
      kubectl apply -f chaos-tests/storage-unavailable.yaml
      sleep 60  # Wait for 30s failure + 30s recovery

      # Validate no cascading failures
      POD_STATUS=$(kubectl get pods -l app=atp-ingestion -o jsonpath='{.items[*].status.phase}')

      if echo "$POD_STATUS" | grep -q "CrashLoopBackOff"; then
        echo "❌ Storage failure caused cascading pod crashes"
        exit 1
      fi

      kubectl delete -f chaos-tests/storage-unavailable.yaml
      echo "✅ Storage unavailable chaos test passed"
    displayName: 'Chaos Test: Storage Unavailable'
    continueOnError: false  # BLOCKER: Fail if storage failure causes crashes

Performance Baseline & Regression Detection

Purpose: Track performance trends over time and detect regressions (latency increases, throughput decreases).

Performance Baseline Tracking:

// Track performance baseline per build
public class PerformanceBaselineTracker
{
    public async Task RecordBaselineAsync(string buildId, PerformanceMetrics metrics)
    {
        var baseline = new PerformanceBaseline
        {
            BuildId = buildId,
            BuildNumber = await GetBuildNumberAsync(buildId),
            Timestamp = DateTime.UtcNow,

            P50Latency = metrics.P50Latency,
            P95Latency = metrics.P95Latency,
            P99Latency = metrics.P99Latency,
            ErrorRate = metrics.ErrorRate,
            Throughput = metrics.Throughput,

            CpuUtilization = metrics.CpuUtilization,
            MemoryUtilization = metrics.MemoryUtilization,
            DatabaseUtilization = metrics.DatabaseUtilization
        };

        await _cosmosClient.UpsertAsync(baseline);
    }

    public async Task<PerformanceRegressionResult> DetectRegressionAsync(
        string currentBuildId,
        PerformanceMetrics currentMetrics)
    {
        // Get last 10 builds for baseline
        var historicalBaselines = await _cosmosClient.QueryAsync<PerformanceBaseline>(
            q => q.OrderByDescending(b => b.Timestamp).Take(10));

        var avgP95 = historicalBaselines.Average(b => b.P95Latency);
        var avgErrorRate = historicalBaselines.Average(b => b.ErrorRate);
        var avgThroughput = historicalBaselines.Average(b => b.Throughput);

        var regression = new PerformanceRegressionResult();

        // Detect p95 latency regression (>20% increase)
        if (currentMetrics.P95Latency > avgP95 * 1.2)
        {
            regression.HasRegression = true;
            regression.Issues.Add($"p95 latency increased {((currentMetrics.P95Latency / avgP95 - 1) * 100):F1}% from baseline");
        }

        // Detect error rate regression (>50% increase)
        if (currentMetrics.ErrorRate > avgErrorRate * 1.5)
        {
            regression.HasRegression = true;
            regression.Issues.Add($"Error rate increased {((currentMetrics.ErrorRate / avgErrorRate - 1) * 100):F1}% from baseline");
        }

        // Detect throughput regression (>10% decrease)
        if (currentMetrics.Throughput < avgThroughput * 0.9)
        {
            regression.HasRegression = true;
            regression.Issues.Add($"Throughput decreased {((1 - currentMetrics.Throughput / avgThroughput) * 100):F1}% from baseline");
        }

        if (regression.HasRegression)
        {
            // Create work item for investigation
            await CreatePerformanceRegressionWorkItemAsync(currentBuildId, regression);
        }

        return regression;
    }
}

Performance Optimization Guidance

Purpose: Provide actionable remediation when performance gates fail.

Common Performance Issues & Fixes:

Issue Symptom Root Cause Remediation Typical Time
High p95 Latency p95 > 500ms N+1 queries, missing indexes Add .Include() for EF Core, create indexes 2-8 hours
High Error Rate Error rate > 0.1% Race conditions, deadlocks Add retries, pessimistic locking, idempotency 1-3 days
Low Throughput < 1000 RPS Synchronous I/O, thread pool exhaustion Use async/await, increase thread pool, add caching 1-2 days
Memory Leak Memory grows over time Undisp Objects, event handler leaks Implement IDisposable, remove event handlers 1-2 days
Database Bottleneck High DTU/RU utilization Missing indexes, inefficient queries Add indexes, optimize queries, use read replicas 4 hours - 2 days
Cache Miss Rate High latency on reads Insufficient cache warming, short TTL Increase TTL, implement cache warming, add Redis cluster 4-8 hours

Performance Profiling (dotnet-trace):

#!/bin/bash
# profile-performance.sh

POD_NAME=$1  # Kubernetes pod name

echo "Collecting performance trace from pod: $POD_NAME"

# Install dotnet-trace in pod
kubectl exec -it $POD_NAME -- bash -c "dotnet tool install --global dotnet-trace"

# Collect 60-second CPU trace
kubectl exec -it $POD_NAME -- bash -c "/root/.dotnet/tools/dotnet-trace collect --process-id 1 --duration 00:01:00 --format speedscope"

# Copy trace file locally
kubectl cp $POD_NAME:/tmp/trace.speedscope.json ./performance-trace.json

# Analyze trace (upload to https://speedscope.app)
echo "✅ Performance trace collected: performance-trace.json"
echo "   Upload to https://speedscope.app for analysis"

Summary

  • Performance Gates: 15-25 minute execution in staging; block production if thresholds exceeded
  • Load Test Thresholds: p50 <100ms (warning), p95 <500ms (blocker), p99 <1000ms (warning), error rate <0.1% (blocker), throughput ≥1000 RPS (info)
  • Load Test Configuration: 500 concurrent users, 60s ramp-up, 10-minute duration, 70% read / 30% write mix
  • JMeter Test Plan: Complete XML with thread groups, HTTP samplers, assertions, result collectors
  • k6 Alternative: Modern load testing with JavaScript DSL, custom metrics, threshold definitions
  • Azure Pipelines Load Test: Install JMeter, execute test, analyze results (PowerShell parsing), publish HTML report
  • Chaos Test Scenarios: 5 scenarios (pod restart, network latency, storage unavailable, CPU throttle, memory pressure)
  • Chaos Pass Rates: Pod restart (100%, blocker), network latency (95%, non-blocker), storage unavailable (100%, blocker), CPU (90%), memory (95%)
  • Chaos Mesh Manifests: 5 Kubernetes YAML manifests (PodChaos, NetworkChaos, IOChaos, StressChaos for CPU/memory)
  • Azure Pipelines Chaos Tests: 3-scenario execution with validation (pod restart, network latency, storage unavailable)
  • Performance Baseline: C# tracker recording metrics per build; regression detection (>20% latency increase, >50% error increase, >10% throughput decrease)
  • Performance Issues Table: 6 common issues with root causes and remediation times (2 hours - 3 days)
  • Performance Profiling: Bash script using dotnet-trace for CPU profiling in Kubernetes pods

Observability Gates (Deep Dive)

Observability gates validate that ATP services emit structured logs, distributed traces, custom metrics, and expose health check endpoints required for production monitoring, alerting, and incident response. These gates execute in all environments and block production deployment if observability requirements are not met.

Philosophy: Observability is not optional—production services must be fully observable (logs, traces, metrics, health checks) to enable rapid incident response (<15 minutes MTTR) and proactive issue detection (alerts before users notice).

Observability Gate Workflow

graph TD
    A[Performance Gates Passed] --> B[Validate OpenTelemetry]
    B --> C{All Endpoints Instrumented?}
    C -->|No| D[Missing Instrumentation ❌]
    C -->|Yes| E{Database Calls Instrumented?}
    E -->|No| F[Missing DB Instrumentation ❌]
    E -->|Yes| G{Custom Metrics Present?}
    G -->|No| H[Missing Metrics ❌]
    G -->|Yes| I{Health Checks Valid?}
    I -->|No| J[Invalid Health Checks ❌]
    I -->|Yes|試験{Structured Logging?}
    試験 -->|No| K[Unstructured Logs ❌]
    試験 -->|Yes| L{Trace Context Propagated?}
    L -->|No| M[Trace Propagation Failed ❌]
    L -->|Yes| N[Observability Gates Passed ✅]

    D --> O[Block Production Deployment]
    F --> O
    H --> O
    J --> O
    K --> O
    M --> O
    N --> P[Ready for Production]

    style D fill:#ff6b6b
    style F fill:#ff6b6b
    style H fill:#ff6b6b
    style J fill:#ff6b6b
    style K fill:#ff6b6b
    style M fill:#ff6b6b
    style N fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical Observability Gate Duration: 5-10 minutes (validation scripts + health check tests)


OpenTelemetry Validation

Purpose: Ensure all ATP services are fully instrumented with OpenTelemetry for distributed tracing, custom metrics, and log correlation.

Validation Checks:

Check Requirement Blocker Rationale
HTTP Endpoints Instrumented All endpoints have activity source spans ✅ Yes Without spans, requests are invisible in traces
Database Calls Instrumented All EF Core / ADO.NET calls instrumented ✅ Yes Database operations are critical path; must be visible
Custom Metrics Present Business KPIs emitted (audit records, queries) ✅ Yes Business metrics enable SLO tracking and capacity planning
Trace Context Propagated Trace context passed via HTTP headers ✅ Yes Without propagation, distributed traces are incomplete
Activity Source Naming Consistent naming: ConnectSoft.ATP.{Service} ⚠️ Warning Enables filtering and aggregation in observability tools

OpenTelemetry Validation Script (PowerShell):

# scripts/validate-otel.ps1
param(
    [Parameter(Mandatory=$true)]
    [string]$Path,

    [Parameter(Mandatory=$false)]
    [string]$ServiceName = "ATP"
)

Write-Host "🔍 Validating OpenTelemetry instrumentation for $ServiceName services..." -ForegroundColor Cyan

$errors = @()
$warnings = @()

# Find all C# projects
$csprojFiles = Get-ChildItem -Path $Path -Filter "*.csproj" -Recurse | Where-Object {
    $_.FullName -notlike "*\Test\*" -and $_.FullName -notlike "*\Tests\*"
}

foreach ($project in $csprojFiles) {
    $projectPath = $project.DirectoryName
    $projectName = $project.BaseName

    Write-Host "`n📦 Validating project: $projectName" -ForegroundColor Yellow

    # Check 1: OpenTelemetry NuGet packages present
    $csprojContent = Get-Content $project.FullName -Raw
    if ($csprojContent -notmatch "OpenTelemetry") {
        $warnings += "⚠️  $projectName: Missing OpenTelemetry NuGet packages"
        Write-Host "  ⚠️  Missing OpenTelemetry packages" -ForegroundColor Yellow
    } else {
        Write-Host "  ✅ OpenTelemetry packages found" -ForegroundColor Green

        # Check for required packages
        $requiredPackages = @(
            "OpenTelemetry.Exporter.Console",
            "OpenTelemetry.Exporter.OTLP",
            "OpenTelemetry.Extensions.Hosting",
            "OpenTelemetry.Instrumentation.AspNetCore",
            "OpenTelemetry.Instrumentation.Http"
        )

        foreach ($pkg in $requiredPackages) {
            if ($csprojContent -notmatch [regex]::Escape($pkg)) {
                $warnings += "⚠️  $projectName: Missing recommended package: $pkg"
            }
        }
    }

    # Check 2: ActivitySource registration in Startup/Program.cs
    $programFiles = @(
        "$projectPath\Program.cs",
        "$projectPath\Startup.cs",
        "$projectPath\DependencyInjection.cs"
    )

    $foundActivitySource = $false
    foreach ($file in $programFiles) {
        if (Test-Path $file) {
            $content = Get-Content $file -Raw
            if ($content -match "ActivitySource|AddSource") {
                $foundActivitySource = $true
                Write-Host "  ✅ ActivitySource registered" -ForegroundColor Green
                break
            }
        }
    }

    if (-not $foundActivitySource) {
        $errors += "❌ $projectName: No ActivitySource registration found in Program.cs/Startup.cs"
        Write-Host "  ❌ No ActivitySource registration" -ForegroundColor Red
    }

    # Check 3: HTTP endpoints have activity source spans
    $controllerFiles = Get-ChildItem -Path $projectPath -Filter "*Controller.cs" -Recurse
    $endpointFiles = Get-ChildItem -Path $projectPath -Filter "*Endpoints.cs" -Recurse

    $allEndpoints = @($controllerFiles) + @($endpointFiles)

    if ($allEndpoints.Count -eq 0) {
        Write-Host "  ⚠️  No controllers/endpoints found (may be library project)" -ForegroundColor Yellow
        continue
    }

    foreach ($endpointFile in $allEndpoints) {
        $endpointContent = Get-Content $endpointFile.FullName -Raw

        # Check for HTTP methods
        if ($endpointContent -match "\[HttpGet\]|\[HttpPost\]|\[HttpPut\]|\[HttpDelete\]") {
            # Check if method uses ActivitySource or Activity.Current
            if ($endpointContent -notmatch "ActivitySource|Activity\.Current|Activity\.Start") {
                $errors += "❌ $projectName\$($endpointFile.Name): HTTP endpoint missing ActivitySource instrumentation"
                Write-Host "  ❌ $($endpointFile.Name): Missing instrumentation" -ForegroundColor Red
            } else {
                Write-Host "  ✅ $($endpointFile.Name): Instrumented" -ForegroundColor Green
            }
        }
    }

    # Check 4: Database calls instrumented (EF Core / ADO.NET)
    $dbFiles = Get-ChildItem -Path $projectPath -Filter "*DbContext.cs" -Recurse
    $repositoryFiles = Get-ChildItem -Path $projectPath -Filter "*Repository.cs" -Recurse

    $allDbFiles = @($dbFiles) + @($repositoryFiles)

    foreach ($dbFile in $allDbFiles) {
        $dbContent = Get-Content $dbFile.FullName -Raw

        # Check for database operations
        if ($dbContent -match "\.SaveChanges|\.Execute|\.Query|\.QueryAsync|\.Command") {
            # Check if EF Core instrumentation is enabled
            if ($csprojContent -notmatch "OpenTelemetry\.Instrumentation\.EntityFrameworkCore") {
                $errors += "❌ $projectName\$($dbFile.Name): Database calls present but EF Core instrumentation missing"
                Write-Host "  ❌ $($dbFile.Name): Missing EF Core instrumentation" -ForegroundColor Red
            } else {
                Write-Host "  ✅ $($dbFile.Name): EF Core instrumentation enabled" -ForegroundColor Green
            }
        }
    }

    # Check 5: Custom metrics present
    $serviceFiles = Get-ChildItem -Path $projectPath -Filter "*Service.cs" -Recurse
    $foundMetrics = $false

    foreach ($serviceFile in $serviceFiles) {
        $serviceContent = Get-Content $serviceFile.FullName -Raw

        if ($serviceContent -match "Meter\.Create|CreateCounter|CreateHistogram|CreateGauge|CreateUpDownCounter") {
            $foundMetrics = $true
            Write-Host "  ✅ Custom metrics found in $($serviceFile.Name)" -ForegroundColor Green
            break
        }
    }

    if (-not $foundMetrics -and $serviceFiles.Count -gt 0) {
        $warnings += "⚠️  $projectName: No custom metrics found (business KPIs recommended)"
        Write-Host "  ⚠️  No custom metrics found" -ForegroundColor Yellow
    }

    # Check 6: Trace context propagation (HTTP client instrumentation)
    $httpClientFiles = Get-ChildItem -Path $projectPath -Filter "*Client.cs" -Recurse
    $foundHttpClientInstrumentation = $false

    foreach ($clientFile in $httpClientFiles) {
        $clientContent = Get-Content $clientFile.FullName -Raw

        if ($clientContent -match "HttpClient|IHttpClientFactory") {
            if ($csprojContent -match "OpenTelemetry\.Instrumentation\.Http") {
                $foundHttpClientInstrumentation = $true
                Write-Host "  ✅ HTTP client instrumentation enabled" -ForegroundColor Green
                break
            }
        }
    }

    if (-not $foundHttpClientInstrumentation -and $httpClientFiles.Count -gt 0) {
        $warnings += "⚠️  $projectName: HTTP clients present but instrumentation may be missing"
        Write-Host "  ⚠️  HTTP client instrumentation not verified" -ForegroundColor Yellow
    }
}

# Summary
Write-Host "`n" -NoNewline
Write-Host "=" * 80 -ForegroundColor Cyan
Write-Host "Validation Summary" -ForegroundColor Cyan
Write-Host "=" * 80 -ForegroundColor Cyan

if ($errors.Count -gt 0) {
    Write-Host "`n❌ ERRORS ($($errors.Count)):" -ForegroundColor Red
    foreach ($error in $errors) {
        Write-Host "  $error" -ForegroundColor Red
    }
    Write-Host "`n❌ OpenTelemetry validation FAILED. Fix errors before deployment." -ForegroundColor Red
    exit 1
}

if ($warnings.Count -gt 0) {
    Write-Host "`n⚠️  WARNINGS ($($warnings.Count)):" -ForegroundColor Yellow
    foreach ($warning in $warnings) {
        Write-Host "  $warning" -ForegroundColor Yellow
    }
}

Write-Host "`n✅ OpenTelemetry validation PASSED" -ForegroundColor Green
exit 0

Azure Pipelines Integration:

# Observability Gate: OpenTelemetry Validation
- stage: Observability_Gates
  displayName: 'Observability Validation'
  dependsOn: Build_Test_Publish
  condition: succeeded()

  jobs:
  - job: ValidateObservability
    displayName: 'Validate OpenTelemetry & Health Checks'
    pool:
      vmImage: 'windows-latest'  # PowerShell script requires Windows

    steps:
    # Validate OpenTelemetry instrumentation
    - task: PowerShell@2
      inputs:
        targetType: 'filePath'
        filePath: '$(Build.SourcesDirectory)/scripts/validate-otel.ps1'
        arguments: '-Path "$(Build.SourcesDirectory)" -ServiceName "ATP"'
      displayName: 'Validate OpenTelemetry Instrumentation'
      continueOnError: false  # BLOCKER: Fail if instrumentation missing

    # Run additional Roslyn analyzer checks
    - task: DotNetCoreCLI@2
      inputs:
        command: 'build'
        projects: '**/*.csproj'
        arguments: '/p:EnforceOpenTelemetry=true /p:TreatWarningsAsErrors=true'
      displayName: 'Build with OpenTelemetry Enforcement'
      continueOnError: false

C# OpenTelemetry Setup Example (Required Pattern):

// Program.cs (ATP Ingestion Service)
using OpenTelemetry;
using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
using System.Diagnostics;

var builder = WebApplication.CreateBuilder(args);

// Configure OpenTelemetry Resource
var resourceBuilder = ResourceBuilder.CreateDefault()
    .AddAttributes(new Dictionary<string, object>
    {
        ["service.name"] = "atp-ingestion",
        ["service.namespace"] = "ConnectSoft.ATP",
        ["deployment.environment"] = builder.Environment.EnvironmentName
    });

// Configure Tracing
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .SetResourceBuilder(resourceBuilder)
        .AddAspNetCoreInstrumentation(options =>
        {
            options.RecordException = true;
            options.EnrichWithHttpRequest = (activity, request) =>
            {
                activity.SetTag("http.user_agent", request.Headers.User-Agent.ToString());
                activity.SetTag("http.request_id", request.Headers["X-Request-Id"].ToString());
            };
        })
        .AddHttpClientInstrumentation(options =>
        {
            options.RecordException = true;
            options.EnrichWithHttpRequestMessage = (activity, request) =>
            {
                activity.SetTag("http.client.name", request.RequestUri?.Host);
            };
        })
        .AddEntityFrameworkCoreInstrumentation(options =>
        {
            options.SetDbStatementForText = true;
            options.EnrichWithIDbCommand = (activity, command) =>
            {
                activity.SetTag("db.statement.type", command.CommandType.ToString());
            };
        })
        .AddSource("ConnectSoft.ATP.Ingestion")  // Custom activity source
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri(builder.Configuration["OpenTelemetry:OtlpEndpoint"] 
                ?? "http://otel-collector:4317");
        })
    )
    .WithMetrics(metrics => metrics
        .SetResourceBuilder(resourceBuilder)
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()  // GC, thread pool metrics
        .AddMeter("ConnectSoft.ATP.Ingestion")  // Custom metrics
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri(builder.Configuration["OpenTelemetry:OtlpEndpoint"] 
                ?? "http://otel-collector:4317");
        })
    );

// Configure Logging
builder.Logging.AddOpenTelemetry(options =>
{
    options.SetResourceBuilder(resourceBuilder)
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri(builder.Configuration["OpenTelemetry:OtlpEndpoint"] 
                ?? "http://otel-collector:4317");
        });
});

// Custom ActivitySource for business operations
var activitySource = new ActivitySource("ConnectSoft.ATP.Ingestion");

// Custom Metrics Meter
var meter = new Meter("ConnectSoft.ATP.Ingestion", "1.0.0");
var auditRecordsIngested = meter.CreateCounter<long>(
    "atp.audit_records_ingested_total",
    "records",
    "Total number of audit records ingested"
);
var auditRecordsIngestedLatency = meter.CreateHistogram<double>(
    "atp.audit_records_ingested_duration_ms",
    "milliseconds",
    "Latency of audit record ingestion"
);

var app = builder.Build();

// Health check endpoint (validated separately)
app.MapHealthChecks("/health");

app.Run();

Custom ActivitySource Usage Example:

// Controllers/AuditEventsController.cs
using System.Diagnostics;

[ApiController]
[Route("api/[controller]")]
public class AuditEventsController : ControllerBase
{
    private static readonly ActivitySource ActivitySource = new("ConnectSoft.ATP.Ingestion");
    private readonly IAuditEventService _auditEventService;
    private readonly Meter _meter;
    private readonly Counter<long> _recordsIngested;

    public AuditEventsController(IAuditEventService auditEventService, Meter meter)
    {
        _auditEventService = auditEventService;
        _meter = meter;
        _recordsIngested = _meter.CreateCounter<long>("atp.audit_records_ingested_total");
    }

    [HttpPost]
    public async Task<IActionResult> CreateAuditEvent([FromBody] CreateAuditEventRequest request)
    {
        // Create activity for this operation
        using var activity = ActivitySource.StartActivity("IngestAuditEvent");

        activity?.SetTag("audit.event.action", request.Action);
        activity?.SetTag("audit.event.tenant_id", request.TenantId);
        activity?.SetTag("audit.event.user_id", request.UserId);

        try
        {
            var stopwatch = Stopwatch.StartNew();

            var result = await _auditEventService.IngestAsync(request);

            stopwatch.Stop();

            // Record custom metric
            _recordsIngested.Add(1, new KeyValuePair<string, object>("action", request.Action), new KeyValuePair<string, object>("tenant_id", request.TenantId));

            // Record latency metric
            activity?.SetTag("audit.ingestion.duration_ms", stopwatch.ElapsedMilliseconds);

            activity?.SetStatus(ActivityStatusCode.Ok);

            return CreatedAtAction(nameof(GetAuditEvent), new { id = result.Id }, result);
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            throw;
        }
    }
}

Health Check Validation

Purpose: Ensure all ATP services expose liveness and readiness health check endpoints required for Kubernetes deployments and Azure App Service health monitoring.

Health Check Requirements:

Endpoint Purpose Status Codes Blocker
/health/live Liveness probe (Kubernetes) 200 (healthy), 503 (unhealthy) ✅ Yes
/health/ready Readiness probe (Kubernetes) 200 (ready), 503 (not ready) ✅ Yes
/health/startup Startup probe (Kubernetes) 200 (started), 503 (starting) ⚠️ Warning
/health Aggregated health (Azure App Service) 200 (healthy), 503 (unhealthy) ✅ Yes

Health Check Dependencies (Readiness Probe):

Dependency Check Type Timeout Failure Impact
Database SQL connectivity + query 5s Service not ready (503)
Message Bus Connection + queue availability 5s Service not ready (503)
Cache (Redis) Connection + ping 3s Service degraded (200 with warning)
Blob Storage Container existence 5s Service not ready (503)
Key Vault Secret retrieval 5s Service not ready (503)
External APIs HTTP health endpoint 10s Service degraded (200 with warning)

Health Check Validation Script (PowerShell):

# scripts/validate-health-checks.ps1
param(
    [Parameter(Mandatory=$true)]
    [string]$ServiceUrl,

    [Parameter(Mandatory=$false)]
    [int]$TimeoutSeconds = 30
)

Write-Host "🏥 Validating health check endpoints for: $ServiceUrl" -ForegroundColor Cyan

$errors = @()
$warnings = @()

# Test 1: Liveness Probe (/health/live)
Write-Host "`n1. Testing liveness probe (/health/live)..." -ForegroundColor Yellow
try {
    $response = Invoke-WebRequest -Uri "$ServiceUrl/health/live" `
        -Method Get `
        -TimeoutSec $TimeoutSeconds `
        -UseBasicParsing `
        -ErrorAction Stop

    if ($response.StatusCode -eq 200) {
        Write-Host "  ✅ Liveness probe returns 200 OK" -ForegroundColor Green
    } else {
        $errors += "❌ Liveness probe returned status code $($response.StatusCode) (expected 200)"
        Write-Host "  ❌ Unexpected status code: $($response.StatusCode)" -ForegroundColor Red
    }
}
catch {
    $errors += "❌ Liveness probe failed: $_"
    Write-Host "  ❌ Liveness probe failed: $_" -ForegroundColor Red
}

# Test 2: Readiness Probe (/health/ready)
Write-Host "`n2. Testing readiness probe (/health/ready)..." -ForegroundColor Yellow
try {
    $response = Invoke-WebRequest -Uri "$ServiceUrl/health/ready" `
        -Method Get `
        -TimeoutSec $TimeoutSeconds `
        -UseBasicParsing `
        -ErrorAction Stop

    if ($response.StatusCode -eq 200) {
        Write-Host "  ✅ Readiness probe returns 200 OK" -ForegroundColor Green

        # Parse health check response (JSON)
        $healthData = $response.Content | ConvertFrom-Json

        # Validate required dependencies
        $requiredDependencies = @("database", "messagebus")
        foreach ($dep in $requiredDependencies) {
            if ($healthData.checks -and ($healthData.checks | Where-Object { $_.name -eq $dep })) {
                $depCheck = $healthData.checks | Where-Object { $_.name -eq $dep } | Select-Object -First 1
                if ($depCheck.status -eq "Healthy") {
                    Write-Host "    ✅ $dep dependency is healthy" -ForegroundColor Green
                } else {
                    $errors += "❌ Required dependency '$dep' is not healthy: $($depCheck.status)"
                    Write-Host "    ❌ $dep dependency unhealthy: $($depCheck.status)" -ForegroundColor Red
                }
            } else {
                $warnings += "⚠️  Required dependency '$dep' not found in health check response"
                Write-Host "    ⚠️  Dependency '$dep' not checked" -ForegroundColor Yellow
            }
        }

        # Validate optional dependencies
        $optionalDependencies = @("redis", "blobstorage", "keyvault")
        foreach ($dep in $optionalDependencies) {
            if ($healthData.checks -and ($healthData.checks | Where-Object { $_.name -eq $dep })) {
                $depCheck = $healthData.checks | Where-Object { $_.name -eq $dep } | Select-Object -First 1
                if ($depCheck.status -eq "Healthy") {
                    Write-Host "    ✅ $dep dependency is healthy" -ForegroundColor Green
                } else {
                    $warnings += "⚠️  Optional dependency '$dep' is not healthy: $($depCheck.status)"
                    Write-Host "    ⚠️  $dep dependency unhealthy: $($depCheck.status)" -ForegroundColor Yellow
                }
            }
        }
    } else {
        $errors += "❌ Readiness probe returned status code $($response.StatusCode) (expected 200)"
        Write-Host "  ❌ Unexpected status code: $($response.StatusCode)" -ForegroundColor Red
    }
}
catch {
    $errors += "❌ Readiness probe failed: $_"
    Write-Host "  ❌ Readiness probe failed: $_" -ForegroundColor Red
}

# Test 3: Startup Probe (/health/startup) - Optional
Write-Host "`n3. Testing startup probe (/health/startup)..." -ForegroundColor Yellow
try {
    $response = Invoke-WebRequest -Uri "$ServiceUrl/health/startup" `
        -Method Get `
        -TimeoutSec $TimeoutSeconds `
        -UseBasicParsing `
        -ErrorAction Stop

    if ($response.StatusCode -eq 200) {
        Write-Host "  ✅ Startup probe returns 200 OK" -ForegroundColor Green
    } else {
        $warnings += "⚠️  Startup probe returned status code $($response.StatusCode) (optional endpoint)"
        Write-Host "  ⚠️  Unexpected status code: $($response.StatusCode)" -ForegroundColor Yellow
    }
}
catch {
    $warnings += "⚠️  Startup probe not available (optional endpoint)"
    Write-Host "  ⚠️  Startup probe not available (optional)" -ForegroundColor Yellow
}

# Test 4: Aggregated Health (/health)
Write-Host "`n4. Testing aggregated health endpoint (/health)..." -ForegroundColor Yellow
try {
    $response = Invoke-WebRequest -Uri "$ServiceUrl/health" `
        -Method Get `
        -TimeoutSec $TimeoutSeconds `
        -UseBasicParsing `
        -ErrorAction Stop

    if ($response.StatusCode -eq 200) {
        Write-Host "  ✅ Aggregated health endpoint returns 200 OK" -ForegroundColor Green
    } else {
        $errors += "❌ Aggregated health endpoint returned status code $($response.StatusCode) (expected 200)"
        Write-Host "  ❌ Unexpected status code: $($response.StatusCode)" -ForegroundColor Red
    }
}
catch {
    $errors += "❌ Aggregated health endpoint failed: $_"
    Write-Host "  ❌ Aggregated health endpoint failed: $_" -ForegroundColor Red
}

# Test 5: Response Time (health checks should be fast)
Write-Host "`n5. Testing health check response times..." -ForegroundColor Yellow
$endpoints = @("/health/live", "/health/ready", "/health")
foreach ($endpoint in $endpoints) {
    try {
        $stopwatch = [System.Diagnostics.Stopwatch]::StartNew()
        $response = Invoke-WebRequest -Uri "$ServiceUrl$endpoint" `
            -Method Get `
            -TimeoutSec 5 `
            -UseBasicParsing `
            -ErrorAction Stop
        $stopwatch.Stop()

        $elapsedMs = $stopwatch.ElapsedMilliseconds

        if ($elapsedMs -lt 1000) {
            Write-Host "  ✅ $endpoint: ${elapsedMs}ms (fast)" -ForegroundColor Green
        } elseif ($elapsedMs -lt 5000) {
            Write-Host "  ⚠️  $endpoint: ${elapsedMs}ms (acceptable)" -ForegroundColor Yellow
            $warnings += "⚠️  $endpoint response time is ${elapsedMs}ms (should be <1s)"
        } else {
            $errors += "❌ $endpoint response time is ${elapsedMs}ms (too slow, should be <1s)"
            Write-Host "  ❌ $endpoint: ${elapsedMs}ms (too slow)" -ForegroundColor Red
        }
    }
    catch {
        # Already reported in previous tests
    }
}

# Summary
Write-Host "`n" -NoNewline
Write-Host "=" * 80 -ForegroundColor Cyan
Write-Host "Health Check Validation Summary" -ForegroundColor Cyan
Write-Host "=" * 80 -ForegroundColor Cyan

if ($errors.Count -gt 0) {
    Write-Host "`n❌ ERRORS ($($errors.Count)):" -ForegroundColor Red
    foreach ($error in $errors) {
        Write-Host "  $error" -ForegroundColor Red
    }
    Write-Host "`n❌ Health check validation FAILED. Fix errors before deployment." -ForegroundColor Red
    exit 1
}

if ($warnings.Count -gt 0) {
    Write-Host "`n⚠️  WARNINGS ($($warnings.Count)):" -ForegroundColor Yellow
    foreach ($warning in $warnings) {
        Write-Host "  $warning" -ForegroundColor Yellow
    }
}

Write-Host "`n✅ Health check validation PASSED" -ForegroundColor Green
exit 0

Azure Pipelines Integration:

# Observability Gate: Health Check Validation
- job: ValidateHealthChecks
  displayName: 'Validate Health Check Endpoints'
  dependsOn: Deploy_Dev  # Deploy to dev environment first
  condition: succeeded()

  steps:
  # Wait for deployment to be ready
  - task: PowerShell@2
    inputs:
      targetType: 'inline'
      script: |
        $maxAttempts = 30
        $attempt = 0
        $serviceUrl = "$(AppServiceUrl)"

        Write-Host "Waiting for service to be ready..."

        while ($attempt -lt $maxAttempts) {
          try {
            $response = Invoke-WebRequest -Uri "$serviceUrl/health/ready" `
              -Method Get `
              -TimeoutSec 5 `
              -UseBasicParsing `
              -ErrorAction Stop

            if ($response.StatusCode -eq 200) {
              Write-Host "✅ Service is ready!"
              exit 0
            }
          }
          catch {
            Write-Host "Attempt $($attempt + 1)/$maxAttempts: Service not ready yet..."
          }

          $attempt++
          Start-Sleep -Seconds 10
        }

        Write-Error "Service did not become ready within $($maxAttempts * 10) seconds"
        exit 1
      displayName: 'Wait for Service Readiness'
      timeoutInMinutes: 10

  # Validate health check endpoints
  - task: PowerShell@2
    inputs:
      targetType: 'filePath'
      filePath: '$(Build.SourcesDirectory)/scripts/validate-health-checks.ps1'
      arguments: '-ServiceUrl "$(AppServiceUrl)" -TimeoutSeconds 30'
    displayName: 'Validate Health Check Endpoints'
    continueOnError: false  # BLOCKER: Fail if health checks invalid

C# Health Check Implementation (ASP.NET Core):

// Program.cs
using Microsoft.AspNetCore.Diagnostics.HealthChecks;
using Microsoft.Extensions.Diagnostics.HealthChecks;
using System.Net;

var builder = WebApplication.CreateBuilder(args);

// Add health checks
builder.Services.AddHealthChecks()
    // Database health check (required)
    .AddSqlServer(
        connectionString: builder.Configuration.GetConnectionString("DefaultConnection"),
        healthQuery: "SELECT 1",
        name: "database",
        failureStatus: HealthStatus.Unhealthy,
        tags: new[] { "db", "sql" },
        timeout: TimeSpan.FromSeconds(5))

    // Message Bus health check (required)
    .AddRabbitMQ(
        rabbitConnectionString: builder.Configuration.GetConnectionString("RabbitMQ"),
        name: "messagebus",
        failureStatus: HealthStatus.Unhealthy,
        tags: new[] { "messaging", "rabbitmq" },
        timeout: TimeSpan.FromSeconds(5))

    // Redis cache health check (optional)
    .AddRedis(
        redisConnectionString: builder.Configuration.GetConnectionString("Redis"),
        name: "redis",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "cache", "redis" },
        timeout: TimeSpan.FromSeconds(3))

    // Blob Storage health check (optional)
    .AddAzureBlobStorage(
        connectionString: builder.Configuration.GetConnectionString("AzureStorage"),
        containerName: "atp-audit-events",
        name: "blobstorage",
        failureStatus: HealthStatus.Unhealthy,
        tags: new[] { "storage", "blob" },
        timeout: TimeSpan.FromSeconds(5))

    // Key Vault health check (optional)
    .AddAzureKeyVault(
        keyVaultClientFactory: sp =>
        {
            var keyVaultUrl = builder.Configuration["KeyVault:VaultUri"];
            return new Azure.Security.KeyVault.Secrets.SecretClient(
                new Uri(keyVaultUrl),
                sp.GetRequiredService<Azure.Core.TokenCredential>());
        },
        name: "keyvault",
        failureStatus: HealthStatus.Unhealthy,
        tags: new[] { "secrets", "keyvault" },
        timeout: TimeSpan.FromSeconds(5));

var app = builder.Build();

// Liveness probe (Kubernetes)
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = registration => registration.Tags.Contains("live"),
    ResultStatusCodes =
    {
        [HealthStatus.Healthy] = StatusCodes.Status200OK
    },
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";
        await context.Response.WriteAsync(JsonSerializer.Serialize(new
        {
            status = report.Status.ToString(),
            timestamp = DateTime.UtcNow
        }));
    }
}).WithTags("live");

// Readiness probe (Kubernetes)
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = registration => registration.Tags.Count == 0 || !registration.Tags.Contains("live"),
    ResultStatusCodes =
    {
        [HealthStatus.Healthy] = StatusCodes.Status200OK,
        [HealthStatus.Degraded] = StatusCodes.Status200OK,
        [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
    },
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";
        var result = new
        {
            status = report.Status.ToString(),
            timestamp = DateTime.UtcNow,
            checks = report.Entries.Select(entry => new
            {
                name = entry.Key,
                status = entry.Value.Status.ToString(),
                description = entry.Value.Description,
                duration = entry.Value.Duration.TotalMilliseconds,
                data = entry.Value.Data
            })
        };
        await context.Response.WriteAsync(JsonSerializer.Serialize(result));
    }
});

// Startup probe (Kubernetes, optional)
app.MapHealthChecks("/health/startup", new HealthCheckOptions
{
    Predicate = _ => false,  // No checks for startup probe
    ResultStatusCodes =
    {
        [HealthStatus.Healthy] = StatusCodes.Status200OK,
        [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
    },
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";
        await context.Response.WriteAsync(JsonSerializer.Serialize(new
        {
            status = report.Status.ToString(),
            timestamp = DateTime.UtcNow
        }));
    }
}).WithTags("startup");

// Aggregated health endpoint (Azure App Service)
app.MapHealthChecks("/health", new HealthCheckOptions
{
    ResultStatusCodes =
    {
        [HealthStatus.Healthy] = StatusCodes.Status200OK,
        [HealthStatus.Degraded] = StatusCodes.Status200OK,
        [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
    },
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";
        var result = new
        {
            status = report.Status.ToString(),
            timestamp = DateTime.UtcNow,
            checks = report.Entries.Select(entry => new
            {
                name = entry.Key,
                status = entry.Value.Status.ToString(),
                description = entry.Value.Description,
                duration = entry.Value.Duration.TotalMilliseconds
            })
        };
        await context.Response.WriteAsync(JsonSerializer.Serialize(result));
    }
});

app.Run();

Summary

  • Observability Gates: 5-10 minute execution; block production if instrumentation or health checks missing
  • OpenTelemetry Validation: PowerShell script checks 6 requirements (HTTP instrumentation, DB instrumentation, custom metrics, trace propagation, ActivitySource naming, HTTP client instrumentation)
  • OpenTelemetry Setup: Complete C# Program.cs example with tracing, metrics, logging, custom ActivitySource, custom Meter
  • Custom ActivitySource Usage: Controller example showing activity creation, tagging, exception recording, custom metrics
  • Health Check Requirements: 4 endpoints (liveness, readiness, startup, aggregated) with status codes and blocker status
  • Health Check Dependencies: 6 dependency types (database, message bus, cache, blob storage, Key Vault, external APIs) with timeout and failure impact
  • Health Check Validation Script: PowerShell script testing all endpoints, dependency checks, response times (<1s requirement)
  • Health Check Implementation: Complete ASP.NET Core Program.cs with SQL Server, RabbitMQ, Redis, Blob Storage, Key Vault health checks
  • Azure Pipelines Integration: YAML for OpenTelemetry validation and health check validation (with service readiness wait)

Contract & API Gates (Deep Dive)

Contract and API gates validate that ATP services maintain backward compatibility in API contracts (REST, WebSocket) and message schemas (events, commands, queries). These gates execute in CI stage and block production deployment if breaking changes are detected without proper versioning.

Philosophy: Backward compatibility is a promise—API consumers and event subscribers must not be broken by service updates. Breaking changes require explicit API versioning (e.g., /v2/audit-records) and deprecation notices (minimum 6 months before removal).

Contract Gate Workflow

graph TD
    A[Observability Gates Passed] --> B[Extract OpenAPI Spec]
    B --> C[Compare with Baseline]
    C --> D{Breaking Changes?}
    D -->|Yes| E[Breaking Change Detected ❌]
    D -->|No| F{Version Incremented?}
    F -->|No| G[Non-Breaking Change ✅]
    F -->|Yes| H[Validate Version Format]
    H --> I{Version Valid?}
    I -->|No| J[Invalid Version Format ❌]
    I -->|Yes| K[Breaking Change Allowed with Version ✅]

    E --> L[Block Production Deployment]
    J --> L

    G --> M[Validate Event Schemas]
    K --> M
    M --> N{Schema Compatibility?}
    N -->|No| O[Incompatible Schema ❌]
    N -->|Yes| P{Schema Version Incremented?}
    P -->|Yes| Q[Schema Version Valid?]
    P -->|No| R[Compatible Schema ✅]
    Q -->|No| S[Invalid Schema Version ❌]
    Q -->|Yes| R

    O --> L
    S --> L
    R --> T[Contract Gates Passed ✅]
    T --> U[Ready for Production]

    style E fill:#ff6b6b
    style J fill:#ff6b6b
    style O fill:#ff6b6b
    style S fill:#ff6b6b
    style G fill:#90EE90
    style R fill:#90EE90
    style T fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical Contract Gate Duration: 2-5 minutes (OpenAPI diff + schema validation)


OpenAPI Breaking Change Detection

Purpose: Ensure REST API contracts maintain backward compatibility or explicitly version breaking changes (e.g., /v1/audit-events/v2/audit-events).

Baseline Strategy:

Baseline Source Usage Update Trigger
Last Release Production baseline from last tagged release On each release (Git tag)
Main Branch Latest merged PR baseline Continuous validation against main
Explicit Baseline Manually pinned OpenAPI spec On major architectural changes

Breaking Change Detection Rules:

Change Type Breaking? Action Example
Removed Endpoint ✅ Yes ❌ Block; require /v2/ endpoint DELETE /api/audit-events/{id} removed
Removed Parameter ✅ Yes ❌ Block; make parameter optional or version GET /api/audit-events?tenantId= removed
Changed Parameter Type ✅ Yes ❌ Block; require version pageSize: stringpageSize: number
Changed Required Status ✅ Yes ❌ Block; require version tenantId optional → required
Removed Response Property ✅ Yes ❌ Block; require version Response {id, name}{id}
Changed Status Code ✅ Yes ❌ Block; require version 200 OK201 Created
Added Required Parameter ✅ Yes ❌ Block; require version New required query parameter
Added Endpoint ❌ No ✅ Allow New POST /api/audit-events
Added Optional Parameter ❌ No ✅ Allow New optional query parameter
Added Response Property ❌ No ✅ Allow Response {id}{id, name}
Removed Optional Parameter ❌ No ✅ Allow Optional parameter removed

OpenAPI Spec Extraction (Swashbuckle/NSwag):

// Program.cs - Configure OpenAPI generation
using Microsoft.OpenApi.Models;
using Swashbuckle.AspNetCore.SwaggerGen;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen(options =>
{
    options.SwaggerDoc("v1", new OpenApiInfo
    {
        Title = "ATP Ingestion API",
        Version = "v1",
        Description = "API for ingesting audit events into the ConnectSoft Audit Trail Platform",
        Contact = new OpenApiContact
        {
            Name = "ATP Platform Team",
            Email = "platform@connectsoft.example"
        }
    });

    // Enable schema validation
    options.CustomSchemaIds(type => type.FullName);

    // Include XML comments
    var xmlFile = $"{Assembly.GetExecutingAssembly().GetName().Name}.xml";
    var xmlPath = Path.Combine(AppContext.BaseDirectory, xmlFile);
    if (File.Exists(xmlPath))
    {
        options.IncludeXmlComments(xmlPath);
    }

    // Generate deterministic OpenAPI spec (no random values)
    options.SchemaFilter<DeterministicSchemaFilter>();

    // Enforce API versioning
    options.DocInclusionPredicate((docName, apiDesc) =>
    {
        if (!apiDesc.RelativePath.StartsWith($"api/{docName}/", StringComparison.OrdinalIgnoreCase))
            return false;
        return true;
    });
});

var app = builder.Build();

// Swagger UI for development
if (app.Environment.IsDevelopment())
{
    app.UseSwagger();
    app.UseSwaggerUI(options =>
    {
        options.SwaggerEndpoint("/swagger/v1/swagger.json", "ATP Ingestion API v1");
    });
}

app.Run();

OpenAPI Breaking Change Detection Script (PowerShell):

# scripts/validate-openapi-breaking-changes.ps1
param(
    [Parameter(Mandatory=$true)]
    [string]$BaselineSpecPath,

    [Parameter(Mandatory=$true)]
    [string]$CurrentSpecPath,

    [Parameter(Mandatory=$false)]
    [switch]$FailOnBreakingChanges = $true,

    [Parameter(Mandatory=$false)]
    [string]$ApiVersion = "v1"
)

Write-Host "📋 Validating OpenAPI contract compatibility..." -ForegroundColor Cyan

$errors = @()
$warnings = @()

# Load OpenAPI specs
try {
    $baseline = Get-Content $BaselineSpecPath -Raw | ConvertFrom-Json
    $current = Get-Content $CurrentSpecPath -Raw | ConvertFrom-Json
}
catch {
    $errors += "❌ Failed to parse OpenAPI specs: $_"
    Write-Host "  ❌ Error: $_" -ForegroundColor Red
    if ($FailOnBreakingChanges) { exit 1 } else { exit 0 }
}

# Check 1: Removed endpoints
Write-Host "`n1. Checking for removed endpoints..." -ForegroundColor Yellow
$baselinePaths = $baseline.paths.PSObject.Properties.Name
$currentPaths = $current.paths.PSObject.Properties.Name

foreach ($path in $baselinePaths) {
    if ($path -notin $currentPaths) {
        $errors += "❌ Endpoint removed: $path (breaking change)"
        Write-Host "  ❌ Removed: $path" -ForegroundColor Red
    }
    else {
        # Check removed HTTP methods
        $baselineMethods = $baseline.paths.$path.PSObject.Properties.Name
        $currentMethods = $current.paths.$path.PSObject.Properties.Name

        foreach ($method in $baselineMethods) {
            if ($method -notin $currentMethods) {
                $errors += "❌ HTTP method removed: $method $path (breaking change)"
                Write-Host "  ❌ Removed method: $method $path" -ForegroundColor Red
            }
        }
    }
}

# Check 2: Changed parameters (removed, type changed, required status changed)
Write-Host "`n2. Checking for parameter changes..." -ForegroundColor Yellow
foreach ($path in $baselinePaths) {
    if ($path -notin $currentPaths) { continue }

    $baselineMethods = $baseline.paths.$path.PSObject.Properties.Name
    foreach ($method in $baselineMethods) {
        if ($method -eq "parameters") { continue }  # Skip path-level parameters for now

        if ($baseline.paths.$path.$method.parameters) {
            $baselineParams = $baseline.paths.$path.$method.parameters

            if ($current.paths.$path.$method.parameters) {
                $currentParams = $current.paths.$path.$method.parameters

                foreach ($baselineParam in $baselineParams) {
                    $paramName = $baselineParam.name
                    $currentParam = $currentParams | Where-Object { $_.name -eq $paramName }

                    if (-not $currentParam) {
                        # Parameter removed
                        if ($baselineParam.required -eq $true) {
                            $errors += "❌ Required parameter removed: $paramName in $method $path (breaking change)"
                            Write-Host "  ❌ Removed required param: $paramName in $method $path" -ForegroundColor Red
                        } else {
                            $warnings += "⚠️  Optional parameter removed: $paramName in $method $path"
                            Write-Host "  ⚠️  Removed optional param: $paramName in $method $path" -ForegroundColor Yellow
                        }
                    } else {
                        # Check parameter type change
                        if ($baselineParam.schema.type -ne $currentParam.schema.type) {
                            $errors += "❌ Parameter type changed: $paramName in $method $path ($($baselineParam.schema.type)$($currentParam.schema.type))"
                            Write-Host "  ❌ Type changed: $paramName ($($baselineParam.schema.type)$($currentParam.schema.type))" -ForegroundColor Red
                        }

                        # Check required status change (optional → required)
                        if ($baselineParam.required -eq $false -and $currentParam.required -eq $true) {
                            $errors += "❌ Parameter became required: $paramName in $method $path (breaking change)"
                            Write-Host "  ❌ Param became required: $paramName" -ForegroundColor Red
                        }
                    }
                }
            } else {
                # All parameters removed
                foreach ($baselineParam in $baselineParams) {
                    if ($baselineParam.required -eq $true) {
                        $errors += "❌ Required parameter removed: $paramName in $method $path (breaking change)"
                    }
                }
            }
        }
    }
}

# Check 3: Removed response properties
Write-Host "`n3. Checking for removed response properties..." -ForegroundColor Yellow
foreach ($path in $baselinePaths) {
    if ($path -notin $currentPaths) { continue }

    $baselineMethods = $baseline.paths.$path.PSObject.Properties.Name | Where-Object { $_ -ne "parameters" }
    foreach ($method in $baselineMethods) {
        $baselineResponses = $baseline.paths.$path.$method.responses
        $currentResponses = $current.paths.$path.$method.responses

        if ($baselineResponses -and $currentResponses) {
            foreach ($statusCode in $baselineResponses.PSObject.Properties.Name) {
                if ($statusCode -in $currentResponses.PSObject.Properties.Name) {
                    # Compare response schemas
                    $baselineSchema = $baselineResponses.$statusCode.content.'application/json'.schema
                    $currentSchema = $currentResponses.$statusCode.content.'application/json'.schema

                    if ($baselineSchema.properties -and $currentSchema.properties) {
                        $baselineProps = $baselineSchema.properties.PSObject.Properties.Name
                        $currentProps = $currentSchema.properties.PSObject.Properties.Name

                        foreach ($prop in $baselineProps) {
                            if ($prop -notin $currentProps) {
                                $errors += "❌ Response property removed: $prop in $method $path $statusCode (breaking change)"
                                Write-Host "  ❌ Removed property: $prop" -ForegroundColor Red
                            }
                        }
                    }
                } else {
                    # Status code removed (if it was a success code, this might be breaking)
                    if ([int]$statusCode -ge 200 -and [int]$statusCode -lt 300) {
                        $warnings += "⚠️  Success status code removed: $statusCode in $method $path"
                        Write-Host "  ⚠️  Status code removed: $statusCode" -ForegroundColor Yellow
                    }
                }
            }
        }
    }
}

# Check 4: API versioning validation (if breaking changes exist and version not incremented)
Write-Host "`n4. Validating API versioning..." -ForegroundColor Yellow
if ($errors.Count -gt 0) {
    # Check if path includes version (e.g., /v2/audit-events)
    $versionPattern = "/(v\d+)/"

    $hasVersionedPath = $false
    foreach ($path in $currentPaths) {
        if ($path -match $versionPattern) {
            $matchedVersion = $matches[1]
            if ($matchedVersion -ne $ApiVersion) {
                $hasVersionedPath = $true
                Write-Host "  ✅ Breaking changes are in versioned endpoint: $path" -ForegroundColor Green
                break
            }
        }
    }

    if (-not $hasVersionedPath) {
        $errors += "❌ Breaking changes detected but API version not incremented. Use /v2/ endpoint for breaking changes."
        Write-Host "  ❌ Breaking changes require API versioning (e.g., /v2/audit-events)" -ForegroundColor Red
    }
}

# Summary
Write-Host "`n" -NoNewline
Write-Host "=" * 80 -ForegroundColor Cyan
Write-Host "OpenAPI Contract Validation Summary" -ForegroundColor Cyan
Write-Host "=" * 80 -ForegroundColor Cyan

if ($errors.Count -gt 0) {
    Write-Host "`n❌ BREAKING CHANGES ($($errors.Count)):" -ForegroundColor Red
    foreach ($error in $errors) {
        Write-Host "  $error" -ForegroundColor Red
    }

    if ($FailOnBreakingChanges) {
        Write-Host "`n❌ OpenAPI contract validation FAILED. Fix breaking changes or increment API version." -ForegroundColor Red
        exit 1
    }
}

if ($warnings.Count -gt 0) {
    Write-Host "`n⚠️  WARNINGS ($($warnings.Count)):" -ForegroundColor Yellow
    foreach ($warning in $warnings) {
        Write-Host "  $warning" -ForegroundColor Yellow
    }
}

if ($errors.Count -eq 0) {
    Write-Host "`n✅ OpenAPI contract validation PASSED (no breaking changes)" -ForegroundColor Green
}

exit 0

Azure Pipelines Integration:

# Contract Gate: OpenAPI Breaking Change Detection
- stage: Contract_Gates
  displayName: 'API Contract Validation'
  dependsOn: Build_Test_Publish
  condition: succeeded()

  jobs:
  - job: ValidateOpenApiContract
    displayName: 'Validate OpenAPI Contract Compatibility'
    pool:
      vmImage: 'windows-latest'

    steps:
    # Extract OpenAPI spec from build
    - task: DotNetCoreCLI@2
      inputs:
        command: 'run'
        projects: '**/ConnectSoft.ATP.*.csproj'
        arguments: '--urls "http://localhost:5000" --launch-profile "Swagger"'
      displayName: 'Generate OpenAPI Spec'
      continueOnError: false

    # Download baseline OpenAPI spec (from last release)
    - task: PowerShell@2
      inputs:
        targetType: 'inline'
        script: |
          # Get latest release tag
          $latestTag = git describe --tags --abbrev=0 --match "v*.*.*" 2>$null
          if ($latestTag) {
            Write-Host "Using baseline from release: $latestTag"

            # Download baseline spec from artifacts or Git
            git checkout $latestTag -- swagger.json 2>$null
            if (Test-Path "swagger.json") {
              New-Item -ItemType Directory -Force -Path "$(Pipeline.Workspace)/baseline" | Out-Null
              Move-Item swagger.json "$(Pipeline.Workspace)/baseline/openapi.json"
            }
          } else {
            Write-Host "No release tags found; using main branch baseline"
            # Use main branch spec as baseline
            git checkout origin/main -- swagger.json 2>$null
            if (Test-Path "swagger.json") {
              New-Item -ItemType Directory -Force -Path "$(Pipeline.Workspace)/baseline" | Out-Null
              Move-Item swagger.json "$(Pipeline.Workspace)/baseline/openapi.json"
            }
          }
        displayName: 'Download Baseline OpenAPI Spec'

    # Extract current OpenAPI spec
    - task: PowerShell@2
      inputs:
        targetType: 'inline'
        script: |
          # Wait for Swagger UI to be available
          $maxAttempts = 30
          $attempt = 0
          while ($attempt -lt $maxAttempts) {
            try {
              $response = Invoke-WebRequest -Uri "http://localhost:5000/swagger/v1/swagger.json" -UseBasicParsing -TimeoutSec 5
              $response.Content | Out-File "$(Build.SourcesDirectory)/swagger.json" -Encoding UTF8
              Write-Host "<|place▁holder▁no▁339|> OpenAPI spec extracted"
              exit 0
            }
            catch {
              livu-Host "Attempt $($attempt + 1)/$maxAttempts: Waiting for Swagger..."
            }
            $attempt++
            Start-Sleep -Seconds 2
          }

          Write-Error "Failed to extract OpenAPI spec"
          exit 1
        displayName: 'Extract Current OpenAPI Spec'

    # Validate breaking changes
    - task: PowerShell@2
      inputs:
        targetType: 'filePath'
        filePath: '$(Build.SourcesDirectory)/scripts/validate-openapi-breaking-changes.ps1'
        arguments: >
          -BaselineSpecPath "$(Pipeline.Workspace)/baseline/openapi.json"
          -CurrentSpecPath "$(Build.SourcesDirectory)/swagger.json"
          -FailOnBreakingChanges
          -ApiVersion "v1"
      displayName: 'Validate OpenAPI Breaking Changes'
      continueOnError: false  # BLOCKER: Fail on breaking changes without versioning

    # Publish OpenAPI specs as artifacts
    - task: PublishBuildArtifacts@1
      inputs:
        PathtoPublish: '$(Build.SourcesDirectory)/swagger.json'
        ArtifactName: 'openapi-spec-$(Build.BuildNumber)'
      displayName: 'Publish OpenAPI Spec'
      condition: always()

Message Schema Compatibility

Purpose: Ensure event/command/query schemas (JSON Schema, Avro, Protobuf) maintain backward compatibility for event-driven architectures.

Schema Compatibility Rules:

Change Type Compatible? Action Example
Added Optional Field ✅ Yes ✅ Allow {id, name}{id, name, email?}
Removed Required Field ❌ No ❌ Block {id, name}{id} (name was required)
Removed Optional Field ⚠️ Deprecated ⚠️ Warning {id, name?, email?}{id, email?}
Changed Field Type ❌ No ❌ Block age: stringage: number
Removed Enum Value ❌ No ❌ Block status: ["active", "inactive"]status: ["active"]
Added Enum Value ✅ Yes ✅ Allow status: ["active"]status: ["active", "pending"]
Changed Field Required ❌ No ❌ Block email?: stringemail: string
Schema Version Not Incremented ❌ No ❌ Block Breaking change without version bump

Event Schema Example (JSON Schema):

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://connectsoft.example/schemas/audit-event/v1.0.0",
  "title": "AuditEvent",
  "description": "Audit event schema for ATP Ingestion service",
  "type": "object",
  "required": ["eventId", "tenantId", "action", "timestamp", "version"],
  "properties": {
    "eventId": {
      "type": "string",
      "format": "uuid",
      "description": "Unique event identifier"
    },
    "tenantId": {
      "type": "string",
      "format": "uuid",
      "description": "Tenant identifier"
    },
    "action": {
      "type": "string",
      "enum": ["UserLogin", "UserLogout", "DataAccess", "DataModification", "DataDeletion"],
      "description": "Action performed"
    },
    "timestamp": {
      "type": "string",
      "format": "date-time",
      "description": "Event timestamp (ISO 8601)"
    },
    "version": {
      "type": "string",
      "pattern": "^\\d+\\.\\d+\\.\\d+$",
      "description": "Schema version (semantic versioning)"
    },
    "userId": {
      "type": "string",
      "format": "uuid",
      "description": "User identifier (optional)"
    },
    "metadata": {
      "type": "object",
      "additionalProperties": true,
      "description": "Additional event metadata (optional)"
    }
  },
  "additionalProperties": false
}

Message Schema Compatibility Validation Script (PowerShell):

# scripts/validate-schema-compatibility.ps1
param(
    [Parameter(Mandatory=$true)]
    [string]$BaselineDir,

    [Parameter(Mandatory=$true)]
    [string]$CurrentDir,

    [Parameter(Mandatory=$false)]
    [switch]$FailOnBreakingChanges = $true
)

Write-Host "📋 Validating event schema compatibility..." -ForegroundColor Cyan

$errors = @()
$warnings = @()

# Find all JSON Schema files
$baselineSchemas = Get-ChildItem -Path $BaselineDir -Filter "*.schema.json" -Recurse
$currentSchemas = Get-ChildItem -Path $CurrentDir -Filter "*.schema.json" -Recurse

foreach ($baselineSchemaFile in $baselineSchemas) {
    $schemaName = $baselineSchemaFile.BaseName -replace '\.schema$', ''

    Write-Host "`n🔍 Validating schema: $schemaName" -ForegroundColor Yellow

    try {
        $baselineSchema = Get-Content $baselineSchemaFile.FullName -Raw | ConvertFrom-Json

        # Find corresponding current schema
        $currentSchemaFile = $currentSchemas | Where-Object { $_.BaseName -eq $baselineSchemaFile.BaseName }

        if (-not $currentSchemaFile) {
            $errors += "❌ Schema removed: $schemaName (breaking change)"
            Write-Host "  ❌ Schema removed: $schemaName" -ForegroundColor Red
            continue
        }

        $currentSchema = Get-Content $currentSchemaFile.FullName -Raw | ConvertFrom-Json

        # Check schema version increment (if breaking changes exist)
        $baselineVersion = $baselineSchema.'$id' -match 'v(\d+\.\d+\.\d+)' ? $matches[1] : $null
        $currentVersion = $currentSchema.'$id' -match 'v(\d+\.\d+\.\d+)' ? $matches[1] : $null

        # Check 1: Required fields
        $baselineRequired = if ($baselineSchema.required) { $baselineSchema.required } else { @() }
        $currentRequired = if ($currentSchema.required) { $currentSchema.required } else { @() }

        $removedRequired = $baselineRequired | Where-Object { $_ -notin $currentRequired }
        foreach ($field in $removedRequired) {
            $errors += "❌ Required field removed: $field in $schemaName (breaking change)"
            Write-Host "  ❌ Required field removed: $field" -ForegroundColor Red
        }

        # Check 2: Field type changes
        $baselineProps = $baselineSchema.properties.PSObject.Properties
        $currentProps = $currentSchema.properties.PSObject.Properties

        foreach ($baselineProp in $baselineProps) {
            $fieldName = $baselineProp.Name
            $currentProp = $currentProps | Where-Object { $_.Name -eq $fieldName }

            if (-not $currentProp) {
                # Field removed (already checked if required)
                if ($fieldName -notin $baselineRequired) {
                    $warnings += "⚠️  Optional field removed: $fieldName in $schemaName (deprecated)"
                    Write-Host "  ⚠️  Optional field removed: $fieldName" -ForegroundColor Yellow
                }
            } else {
                # Check type change
                $baselineType = $baselineProp.Value.type
                $currentType = $currentProp.Value.type

                if ($baselineType -ne $currentType) {
                    $errors += "❌ Field type changed: $fieldName in $schemaName ($baselineType → $currentType, breaking change)"
                    Write-Host "  ❌ Type changed: $fieldName ($baselineType → $currentType)" -ForegroundColor Red
                }

                # Check enum value removal
                if ($baselineProp.Value.enum -and $currentProp.Value.enum) {
                    $baselineEnum = $baselineProp.Value.enum
                    $currentEnum = $currentProp.Value.enum

                    $removedEnumValues = $baselineEnum | Where-Object { $_ -notin $currentEnum }
                    foreach ($enumValue in $removedEnumValues) {
                        $errors += "❌ Enum value removed: $fieldName = $enumValue in $schemaName (breaking change)"
                        Write-Host "  ❌ Enum value removed: $fieldName = $enumValue" -ForegroundColor Red
                    }
                }
            }
        }

        # Check 3: Required status change (optional → required)
        foreach ($baselineProp in $baselineProps) {
            $fieldName = $baselineProp.Name
            $currentProp = $currentProps | Where-Object { $_.Name -eq $fieldName }

            if ($currentProp) {
                $wasOptional = $fieldName -notin $baselineRequired
                $isRequired = $fieldName -in $currentRequired

                if ($wasOptional -and $isRequired) {
                    $errors += "❌ Field became required: $fieldName in $schemaName (breaking change)"
                    Write-Host "  ❌ Field became required: $fieldName" -ForegroundColor Red
                }
            }
        }

        # Check 4: Schema version increment (if breaking changes)
        if ($errors.Count -gt 0 -and $baselineVersion -and $currentVersion) {
            $baselineMajor = [int]($baselineVersion -split '\.')[0]
            $currentMajor = [int]($currentVersion -split '\.')[0]

            if ($currentMajor -le $baselineMajor) {
                $errors += "❌ Schema version not incremented: $schemaName ($baselineVersion → $currentVersion, breaking changes require major version bump)"
                Write-Host "  ❌ Version not incremented: $baselineVersion → $currentVersion" -ForegroundColor Red
            } else {
                Write-Host "  ✅ Version incremented: $baselineVersion → $currentVersion" -ForegroundColor Green
            }
        }
    }
    catch {
        $errors += "❌ Failed to validate schema $schemaName: $_"
        Write-Host "  ❌ Error: $_" -ForegroundColor Red
    }
}

# Summary
Write-Host "`n" -NoNewline
Write-Host "=" * 80 -ForegroundColor Cyan
Write-Host "Message Schema Compatibility Validation Summary" -ForegroundColor Cyan
Write-Host "=" * 80 -ForegroundColor Cyan

if ($errors.Count -gt 0) {
    Write-Host "`n❌ BREAKING CHANGES ($($errors.Count)):" -ForegroundColor Red
    foreach ($error in $errors) {
        Write-Host "  $error" -ForegroundColor Red
    }

    if ($FailOnBreakingChanges) {
        Write-Host "`n❌ Schema compatibility validation FAILED. Fix breaking changes or increment schema version." -ForegroundColor Red
        exit 1
    }
}

if ($warnings.Count -gt 0) {
    Write-Host "`n⚠️  WARNINGS ($($warnings.Count)):" -ForegroundColor Yellow
    foreach ($warning in $warnings) {
        Write-Host "  $warning" -ForegroundColor Yellow
    }
}

if ($errors.Count -eq 0) {
    Write-Host "`n✅ Schema compatibility validation PASSED (backward compatible)" -ForegroundColor Green
}

exit 0

Azure Pipelines Integration:

# Contract Gate: Message Schema Compatibility
- job: ValidateSchemaCompatibility
  displayName: 'Validate Event Schema Compatibility'
  dependsOn: ValidateOpenApiContract
  condition: succeeded()

  steps:
  # Download baseline schemas (from last release)
  - task: PowerShell@2
    inputs:
      targetType: 'inline'
      script: |
        $latestTag = git describe --tags --abbrev=0 --match "v*.*.*" 2>$null
        if ($latestTag) {
          git checkout $latestTag -- schemas/ 2>$null
          if (Test-Path "schemas") {
            New-Item -ItemType Directory -Force -Path "$(Pipeline.Workspace)/baseline/schemas" | Out-Null
            Copy-Item schemas/* "$(Pipeline.Workspace)/baseline/schemas/" -Recurse
          }
        }
      displayName: 'Download Baseline Schemas'

  # Validate schema compatibility
  - task: PowerShell@2
    inputs:
      targetType: 'filePath'
      filePath: '$(Build.SourcesDirectory)/scripts/validate-schema-compatibility.ps1'
      arguments: >
        -BaselineDir "$(Pipeline.Workspace)/baseline/schemas"
        -CurrentDir "$(Build.SourcesDirectory)/schemas"
        -FailOnBreakingChanges
    displayName: 'Validate Event Schema Compatibility'
    continueOnError: false  # BLOCKER: Fail on breaking schema changes

Summary

  • Contract & API Gates: 2-5 minute execution; block production if breaking changes detected without versioning
  • OpenAPI Breaking Change Detection: PowerShell script validates 8 breaking change types (removed endpoints/parameters, type changes, required status changes, removed response properties, status code changes)
  • OpenAPI Baseline Strategy: 3 strategies (last release, main branch, explicit baseline) with update triggers
  • OpenAPI Spec Extraction: C# Program.cs example with Swashbuckle configuration, deterministic schema generation, API versioning enforcement
  • Message Schema Compatibility: JSON Schema validation with 8 compatibility rules (additive changes allowed, removals blocked, enum value removal blocked, version increment required)
  • Event Schema Example: Complete JSON Schema with required/optional fields, enum constraints, version field, metadata support
  • Schema Compatibility Validation Script: PowerShell script validating required field removal, type changes, enum value removal, required status changes, version increment enforcement
  • Azure Pipelines Integration: YAML for OpenAPI spec extraction, baseline download, breaking change detection, schema compatibility validation

Approval Gates (Manual Governance)

Approval gates enforce human oversight and organizational governance for deployments to staging and production environments. These gates ensure that deployments are reviewed by appropriate stakeholders (engineers, architects, SREs, CAB) and that risk assessments are completed before changes reach production.

Philosophy: Automation plus human judgment—while automated quality gates catch technical issues, manual approval gates ensure business readiness, risk awareness, and deployment coordination (change windows, on-call coverage, rollback preparedness).

Approval Gate Workflow

graph TD
    A[Contract Gates Passed] --> B[Deploy to Staging Request]
    B --> C{Pre-Production Gate}
    C -->|Not Ready| D[Approval Denied ❌]
    C -->|Ready| E[Lead Engineer Approval]
    E --> F{1 Approver?}
    F -->|No| D
    F -->|Yes| G[Deploy to Staging]

    G --> H[Staging Soak Period 24h]
    H --> I{Production Gate}
    I -->|Not Ready| J[Production Approval Denied ❌]
    I -->|Ready| K[Architect Approval]
    K --> L{2 Approvers?}
    L -->|No| J
    L -->|Yes| M[SRE Approval]
    M --> N{1 SRE Approver?}
    N -->|No| J
    N -->|Yes| O[CAB Approval]
    O --> P{CAB Approved?}
    P -->|No| J
    P -->|Yes| Q{Active Incidents?}
    Q -->|Yes P1/P2| J
    Q -->|No| R{Change Freeze?}
    R -->|Yes| J
    R -->|No| S[Deploy to Production]

    D --> T[Remediate Issues]
    J --> U[Reschedule Deployment]
    S --> V[Production Monitoring]

    style D fill:#ff6b6b
    style J fill:#ff6b6b
    style S fill:#90EE90
    style V fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Typical Approval Duration: Staging (1-4 hours), Production (4-24 hours depending on CAB schedule)


Pre-Production Approval (Staging)

Purpose: Ensure technical readiness for staging deployment through peer review and automated gate validation.

Approval Requirements:

Requirement Details Timeout Bypass Allowed
Minimum Approvers 1 Lead Engineer 4 hours ❌ No (except hotfix)
Automated Gates All quality gates passed N/A ❌ No
Test Results 100% pass rate N/A ❌ No
Security Scan Zero critical/high vulnerabilities N/A ⚠️ Yes (with risk acceptance)
Coverage Threshold Service-specific threshold met N/A ❌ No

Approval Checklist (Lead Engineer):

## Staging Deployment Approval Checklist

**Build**: `$(Build.BuildNumber)`  
**Requested By**: `$(Build.RequestedFor)`  
**Date**: `$(System.Date)`

### Automated Quality Gates
- [ ] All automated tests passed (unit, integration, E2E)
- [ ] Code coverage threshold met (≥70% for service)
- [ ] Security scans clean (SAST, dependency, secrets, container)
- [ ] OpenAPI contract backward compatible (or versioned)
- [ ] Event schema backward compatible (or versioned)
- [ ] OpenTelemetry instrumentation validated
- [ ] Health check endpoints validated

### Security & Compliance
- [ ] SBOM generated and reviewed (no prohibited licenses)
- [ ] Dependency vulnerabilities assessed (critical/high resolved or accepted)
- [ ] Secrets detection passed (no leaked credentials)
- [ ] Container image hardened (Trivy scan clean)

### Observability & Monitoring
- [ ] Structured logging validated (PII redacted)
- [ ] Distributed tracing configured (ActivitySource registered)
- [ ] Custom metrics emitted (business KPIs)
- [ ] Health check dependencies validated (database, message bus, cache)

### Documentation & Communication
- [ ] Architecture Decision Record (ADR) updated (if applicable)
- [ ] CHANGELOG updated with user-facing changes
- [ ] Rollback plan documented in deployment notes

### Approval Decision
- [ ] **APPROVED** — Deploy to staging
- [ ] **DENIED** — Block deployment (reason: _________________)

**Approver**: ________________  
**Date/Time**: ________________  
**Comments**: ________________

Azure DevOps Environment Configuration (Staging):

# Azure DevOps Environment: ATP-Staging
name: ATP-Staging
resourceType: none  # No direct Kubernetes/VM resources

# Approval configuration
approvals:
  - type: requiredApprovers
    requiredApprovers:
      - group: ATP-Lead-Engineers
    minRequiredApprovers: 1
    instructions: |
      Review the staging deployment approval checklist before approving.

      Key validation points:
      - All automated quality gates passed
      - Security scans clean or risks accepted
      - SBOM reviewed for license compliance
      - Observability validated (logs, traces, metrics)
      - Rollback plan documented
    timeout: 4h
    notifyOnlyInitiator: false  # Notify all group members

# Pre-deployment gates
gates:
  - type: azureFunction
    function: ValidateTestResults
    url: https://atp-approval-gates.azurewebsites.net/api/ValidateTestResults
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"testPassRate": 100}'
    timeout: 5m

  - type: azureFunction
    function: ValidateSecurityScan
    url: https://atp-approval-gates.azurewebsites.net/api/ValidateSecurityScan
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"criticalVulnerabilities": 0, "highVulnerabilities": 0}'
    timeout: 5m

  - type: azureFunction
    function: ValidateCoverageThreshold
    url: https://atp-approval-gates.azurewebsites.net/api/ValidateCoverageThreshold
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"coverageMet": true}'
    timeout: 5m

# Deployment lock (prevent concurrent deployments)
lock:
  enabled: true
  lockType: sequential

Automated Gate Validation Function (C#):

// ValidateTestResults.cs — Azure Function for pre-deployment gate
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using System.Net.Http;
using System.Threading.Tasks;

public static class ValidateTestResults
{
    [FunctionName("ValidateTestResults")]
    public static async Task<IActionResult> Run(
        [HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req,
        ILogger log)
    {
        log.LogInformation("Validating test results for deployment approval");

        // Parse request body (Azure DevOps sends build info)
        var requestBody = await new StreamReader(req.Body).ReadToEndAsync();
        var gateRequest = JsonSerializer.Deserialize<GateRequest>(requestBody);

        // Query Azure DevOps Test Results API
        var testResults = await GetTestResultsAsync(gateRequest.BuildId);

        // Validate test pass rate (must be 100%)
        var totalTests = testResults.TotalCount;
        var passedTests = testResults.PassedTests;
        var testPassRate = (double)passedTests / totalTests * 100;

        log.LogInformation($"Test pass rate: {testPassRate:F2}% ({passedTests}/{totalTests})");

        if (testPassRate < 100)
        {
            return new BadRequestObjectResult(new
            {
                status = "Failed",
                message = $"Test pass rate is {testPassRate:F2}% (expected 100%). {totalTests - passedTests} tests failed.",
                testPassRate,
                totalTests,
                passedTests,
                failedTests = totalTests - passedTests
            });
        }

        return new OkObjectResult(new
        {
            status = "Success",
            message = "All tests passed",
            testPassRate = 100,
            totalTests,
            passedTests
        });
    }

    private static async Task<TestResults> GetTestResultsAsync(string buildId)
    {
        var azureDevOpsUrl = Environment.GetEnvironmentVariable("AZURE_DEVOPS_URL");
        var pat = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PAT");

        var client = new HttpClient();
        client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue(
            "Basic",
            Convert.ToBase64String(Encoding.ASCII.GetBytes($":{pat}"))
        );

        var response = await client.GetAsync(
            $"{azureDevOpsUrl}/_apis/test/ResultSummaryByBuild?buildId={buildId}&api-version=7.0"
        );

        response.EnsureSuccessStatusCode();

        var content = await response.Content.ReadAsStringAsync();
        return JsonSerializer.Deserialize<TestResults>(content);
    }
}

public class GateRequest
{
    public string BuildId { get; set; }
    public string EnvironmentName { get; set; }
    public string StageName { get; set; }
}

public class TestResults
{
    public int TotalCount { get; set; }
    public int PassedTests { get; set; }
}

Production Approval (Multi-Level Governance)

Purpose: Ensure business readiness, risk mitigation, and organizational alignment for production deployment through multi-level approval and Change Advisory Board (CAB) review.

Approval Requirements:

Requirement Details Timeout Bypass Allowed
Minimum Approvers (Architects) 2 ATP Architects 24 hours ❌ No (except emergency hotfix)
Minimum Approvers (SRE) 1 SRE Team Member 24 hours ❌ No (except emergency hotfix)
CAB Approval Change Advisory Board review 24-72 hours ❌ No (except emergency hotfix)
Automated Gates Load tests, chaos tests, incident check N/A ❌ No
Staging Soak Period Minimum 24 hours in staging N/A ⚠️ Yes (with architect override)
Active Incidents No P1/P2 incidents open N/A ❌ No
Change Freeze Outside blackout periods N/A ⚠️ Yes (with executive approval)

Approval Checklist (Architects + SRE):

## Production Deployment Approval Checklist

**Build**: `$(Build.BuildNumber)`  
**Requested By**: `$(Build.RequestedFor)`  
**Deployment Window**: `[Start Date/Time] - [End Date/Time]`  
**On-Call Engineer**: `[Name]`

### Staging Validation
- [ ] Staging deployment successful (minimum 24 hours soak period)
- [ ] No errors/exceptions in staging logs (Log Analytics reviewed)
- [ ] Performance metrics within thresholds (p95 latency <500ms, error rate <0.1%)
- [ ] Synthetic monitors passing (health checks, smoke tests)
- [ ] Load tests passed (1000 RPS sustained, p95 <500ms)
- [ ] Chaos tests passed (pod restart, network latency, storage failure)

### Change Management
- [ ] CAB approval obtained (change ticket: CR-XXXXXXX)
- [ ] Deployment window scheduled (change calendar updated)
- [ ] Change freeze respected (no deployment during blackout periods)
- [ ] Rollback plan tested in staging (slot swap or canary rollback)
- [ ] On-call engineer notified and available during deployment
- [ ] Communication plan prepared (status page, tenant email, Slack announcement)

### Risk Assessment
- [ ] No active P1/P2 incidents (Azure DevOps Boards checked)
- [ ] No concurrent deployments scheduled (deployment calendar reviewed)
- [ ] Breaking changes versioned and documented (API versioning, deprecation notices)
- [ ] Database migrations backward compatible (no downtime required)
- [ ] Feature flags configured for gradual rollout (10% → 25% → 50% → 100%)

### Compliance & Audit
- [ ] SBOM published to artifact feed (compliance evidence collected)
- [ ] Security scan reports archived (immutable storage, 7-year retention)
- [ ] Deployment approval trail captured (Azure DevOps audit log)
- [ ] ADR updated for architectural changes

### Post-Deployment Monitoring
- [ ] Monitoring dashboard prepared (Application Insights, Grafana)
- [ ] Alert rules validated (error rate, latency, availability)
- [ ] Runbook updated for incident response

### Approval Decision
- [ ] **APPROVED** — Deploy to production
- [ ] **APPROVED WITH CONDITIONS** — Deploy with specific constraints (e.g., canary only, feature flag off by default)
- [ ] **DENIED** — Block deployment (reason: _________________)

**Architect Approver 1**: ________________  
**Architect Approver 2**: ________________  
**SRE Approver**: ________________  
**CAB Decision**: ________________  
**Date/Time**: ________________  
**Comments**: ________________

Azure DevOps Environment Configuration (Production):

# Azure DevOps Environment: ATP-Production
name: ATP-Production
resourceType: kubernetes  # Optional: link to AKS cluster

# Multi-level approval configuration
approvals:
  # Level 1: Architect approval (minimum 2)
  - type: requiredApprovers
    requiredApprovers:
      - group: ATP-Architects
    minRequiredApprovers: 2
    instructions: |
      Review the production deployment approval checklist.

      Key validation points:
      - Staging soak period completed (minimum 24 hours)
      - Load tests and chaos tests passed
      - No active P1/P2 incidents
      - CAB approval obtained
      - Deployment window scheduled
      - Rollback plan tested in staging
    timeout: 24h
    notifyOnlyInitiator: false

  # Level 2: SRE approval (minimum 1)
  - type: requiredApprovers
    requiredApprovers:
      - group: SRE-Team
    minRequiredApprovers: 1
    instructions: |
      SRE team review for production deployment.

      Validate:
      - On-call coverage during deployment window
      - Monitoring and alerting configured
      - Runbook updated for incident response
      - Rollback procedure validated
    timeout: 24h
    notifyOnlyInitiator: false

# Pre-deployment gates
gates:
  # Gate 1: Validate load test results
  - type: azureFunction
    function: ValidateLoadTests
    url: https://atp-approval-gates.azurewebsites.net/api/ValidateLoadTests
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"p95Latency": "<500", "errorRate": "<0.001", "throughput": ">=1000"}'
    timeout: 10m
    retryInterval: 2m

  # Gate 2: Validate chaos test results
  - type: azureFunction
    function: ValidateChaosTests
    url: https://atp-approval-gates.azurewebsites.net/api/ValidateChaosTests
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"podRestartPassed": true, "storageFailurePassed": true}'
    timeout: 10m
    retryInterval: 2m

  # Gate 3: Check active incidents (block if P1/P2 open)
  - type: azureFunction
    function: CheckActiveIncidents
    url: https://atp-approval-gates.azurewebsites.net/api/CheckActiveIncidents
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"activeP1Incidents": 0, "activeP2Incidents": 0}'
    timeout: 5m
    retryInterval: 1m

  # Gate 4: Validate staging soak period (minimum 24 hours)
  - type: azureFunction
    function: ValidateStagingSoakPeriod
    url: https://atp-approval-gates.azurewebsites.net/api/ValidateStagingSoakPeriod
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"soakPeriodHours": ">=24", "stagingHealthy": true}'
    timeout: 5m

  # Gate 5: Check change freeze (block if in blackout period)
  - type: azureFunction
    function: CheckChangeFreeze
    url: https://atp-approval-gates.azurewebsites.net/api/CheckChangeFreeze
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"inChangeFreeze": false}'
    timeout: 5m

# Deployment lock (prevent concurrent deployments)
lock:
  enabled: true
  lockType: exclusive  # Only one deployment at a time

Automated Gate: Check Active Incidents (C#):

// CheckActiveIncidents.cs — Block production deployment if P1/P2 incidents open
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using Microsoft.TeamFoundation.WorkItemTracking.WebApi;
using Microsoft.TeamFoundation.WorkItemTracking.WebApi.Models;
using Microsoft.VisualStudio.Services.Common;
using Microsoft.VisualStudio.Services.WebApi;
using System;
using System.Linq;
using System.Threading.Tasks;

public static class CheckActiveIncidents
{
    [FunctionName("CheckActiveIncidents")]
    public static async Task<IActionResult> Run(
        [HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req,
        ILogger log)
    {
        log.LogInformation("Checking for active P1/P2 incidents");

        var azureDevOpsUrl = Environment.GetEnvironmentVariable("AZURE_DEVOPS_URL");
        var pat = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PAT");
        var projectName = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PROJECT");

        var credentials = new VssBasicCredential(string.Empty, pat);
        var connection = new VssConnection(new Uri(azureDevOpsUrl), credentials);
        var witClient = connection.GetClient<WorkItemTrackingHttpClient>();

        // Query for active P1/P2 incidents
        var wiql = new Wiql
        {
            Query = @"
                SELECT [System.Id], [System.Title], [System.State], [Microsoft.VSTS.Common.Priority]
                FROM WorkItems
                WHERE [System.WorkItemType] = 'Incident'
                  AND [System.State] = 'Active'
                  AND [Microsoft.VSTS.Common.Priority] <= 2
                ORDER BY [Microsoft.VSTS.Common.Priority]"
        };

        var result = await witClient.QueryByWiqlAsync(wiql, projectName);
        var activeIncidents = result.WorkItems.ToList();

        var activeP1 = activeIncidents.Count(wi => 
            int.Parse(wi.Fields["Microsoft.VSTS.Common.Priority"].ToString()) == 1);
        var activeP2 = activeIncidents.Count(wi => 
            int.Parse(wi.Fields["Microsoft.VSTS.Common.Priority"].ToString()) == 2);

        log.LogInformation($"Active P1 incidents: {activeP1}, Active P2 incidents: {activeP2}");

        if (activeP1 > 0 || activeP2 > 0)
        {
            return new BadRequestObjectResult(new
            {
                status = "Failed",
                message = $"Active high-priority incidents detected: {activeP1} P1, {activeP2} P2. Resolve incidents before production deployment.",
                activeP1Incidents = activeP1,
                activeP2Incidents = activeP2,
                incidents = activeIncidents.Select(wi => new
                {
                    id = wi.Id,
                    title = wi.Fields["System.Title"].ToString(),
                    priority = wi.Fields["Microsoft.VSTS.Common.Priority"].ToString()
                })
            });
        }

        return new OkObjectResult(new
        {
            status = "Success",
            message = "No active P1/P2 incidents",
            activeP1Incidents = 0,
            activeP2Incidents = 0
        });
    }
}

Change Advisory Board (CAB) Process

Purpose: Provide cross-functional review of production changes to assess business impact, technical risk, and deployment coordination.

CAB Composition:

Role Responsibilities Required for Approval
Lead Architect Technical feasibility, architectural alignment ✅ Yes
SRE Lead Operational readiness, on-call coverage ✅ Yes
Product Owner Business impact, user communication ✅ Yes
Security Officer Security risk assessment, compliance ⚠️ For security changes only
Customer Success Tenant impact, downtime communication ⚠️ For breaking changes only

CAB Meeting Cadence:

  • Weekly: Tuesday 10:00 AM (routine changes)
  • Emergency: On-demand via Slack /cab-emergency (hotfixes, P1 incidents)
  • Async Review: Low-risk changes via Azure DevOps approval workflow (no meeting required)

CAB Approval Workflow:

graph TD
    A[Create Change Request] --> B{Change Type?}
    B -->|Standard| C[Weekly CAB Meeting]
    B -->|Emergency| D[Emergency CAB]
    B -->|Low-Risk| E[Async Approval]

    C --> F[CAB Review]
    D --> G[Emergency Review within 2h]
    E --> H[Async Review 24h]

    F --> I{Approved?}
    G --> I
    H --> I

    I -->|No| J[Change Denied/Deferred]
    I -->|Yes| K[CAB Approval Granted]

    J --> L[Remediate Issues]
    K --> M[Schedule Deployment]
    M --> N[Production Deployment]

    style J fill:#ff6b6b
    style K fill:#90EE90
    style N fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Change Request Template (Azure DevOps Work Item):

# Work Item Type: Change Request
fields:
  - field: System.Title
    value: "[ATP] Production Deployment  $(Build.BuildNumber)"

  - field: System.Description
    value: |
      ## Change Summary
      **Service**: ATP Ingestion Service
      **Build**: $(Build.BuildNumber)
      **Deployment Window**: [Start] - [End]
      **Estimated Duration**: 30 minutes

      ## Change Details
      ### Features Added
      - Feature 1: Description
      - Feature 2: Description

      ### Bug Fixes
      - Bug 1: Description
      - Bug 2: Description

      ### Breaking Changes
      - None (or list breaking changes with mitigation)

      ## Risk Assessment
      **Risk Level**: Low / Medium / High
      **Impact**: Tenant-facing / Internal / Infrastructure
      **Rollback Strategy**: Blue-green slot swap (30 seconds)

      ## Testing Evidence
      - [ ] All automated quality gates passed
      - [ ] Load tests passed (p95 <500ms, error rate <0.1%)
      - [ ] Chaos tests passed (pod restart, storage failure)
      - [ ] Staging soak period completed (24+ hours)

      ## Communication Plan
      - [ ] Status page updated (if user-facing changes)
      - [ ] Tenant email sent (if breaking changes)
      - [ ] Slack #atp-deployments announcement

      ## Approval Checklist
      - [ ] Lead Architect approved
      - [ ] SRE Lead approved
      - [ ] Product Owner approved

  - field: Microsoft.VSTS.Common.Priority
    value: 2  # P2 by default; P1 for emergency hotfixes

  - field: Custom.ChangeType
    value: Standard  # Standard / Emergency / Low-Risk

  - field: Custom.RiskLevel
    value: Medium  # Low / Medium / High

  - field: Custom.DeploymentWindow
    value: "[2025-11-01 02:00 UTC] - [2025-11-01 04:00 UTC]"

  - field: Custom.RollbackPlan
    value: "Blue-green slot swap via Azure CLI: az webapp deployment slot swap"

CAB Approval Automation (Azure Function):

// GetCABApprovalStatus.cs — Check CAB approval status for change request
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using System.Threading.Tasks;

public static class GetCABApprovalStatus
{
    [FunctionName("GetCABApprovalStatus")]
    public static async Task<IActionResult> Run(
        [HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req,
        ILogger log)
    {
        log.LogInformation("Checking CAB approval status");

        var requestBody = await new StreamReader(req.Body).ReadToEndAsync();
        var gateRequest = JsonSerializer.Deserialize<GateRequest>(requestBody);

        // Query Azure DevOps for linked Change Request work item
        var changeRequest = await GetChangeRequestAsync(gateRequest.BuildId);

        if (changeRequest == null)
        {
            return new BadRequestObjectResult(new
            {
                status = "Failed",
                message = "No Change Request work item linked to this build. Create a Change Request and link it to the build."
            });
        }

        // Check approval fields
        var cabApproved = changeRequest.Fields.ContainsKey("Custom.CABApproved") &&
                          changeRequest.Fields["Custom.CABApproved"].ToString() == "Yes";

        var leadArchitectApproved = changeRequest.Fields.ContainsKey("Custom.LeadArchitectApproved") &&
                                     changeRequest.Fields["Custom.LeadArchitectApproved"].ToString() == "Yes";

        var sreLeadApproved = changeRequest.Fields.ContainsKey("Custom.SRELeadApproved") &&
                              changeRequest.Fields["Custom.SRELeadApproved"].ToString() == "Yes";

        var productOwnerApproved = changeRequest.Fields.ContainsKey("Custom.ProductOwnerApproved") &&
                                   changeRequest.Fields["Custom.ProductOwnerApproved"].ToString() == "Yes";

        log.LogInformation($"CAB: {cabApproved}, Architect: {leadArchitectApproved}, SRE: {sreLeadApproved}, PO: {productOwnerApproved}");

        if (!cabApproved || !leadArchitectApproved || !sreLeadApproved || !productOwnerApproved)
        {
            var missingApprovals = new List<string>();
            if (!cabApproved) missingApprovals.Add("CAB");
            if (!leadArchitectApproved) missingApprovals.Add("Lead Architect");
            if (!sreLeadApproved) missingApprovals.Add("SRE Lead");
            if (!productOwnerApproved) missingApprovals.Add("Product Owner");

            return new BadRequestObjectResult(new
            {
                status = "Failed",
                message = $"CAB approval incomplete. Missing approvals: {string.Join(", ", missingApprovals)}",
                changeRequestId = changeRequest.Id,
                cabApproved,
                leadArchitectApproved,
                sreLeadApproved,
                productOwnerApproved
            });
        }

        return new OkObjectResult(new
        {
            status = "Success",
            message = "CAB approval granted",
            changeRequestId = changeRequest.Id,
            cabApproved = true,
            leadArchitectApproved = true,
            sreLeadApproved = true,
            productOwnerApproved = true
        });
    }

    private static async Task<WorkItem> GetChangeRequestAsync(string buildId)
    {
        // Query Azure DevOps API for Change Request work item linked to build
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }
}

Emergency Approval Procedures (Hotfixes)

Purpose: Enable rapid deployment of critical fixes (P1 incidents, security vulnerabilities) with streamlined approval while maintaining governance.

Emergency Approval Requirements:

Requirement Standard Deployment Emergency Hotfix
Minimum Approvers (Architects) 2 1
Minimum Approvers (SRE) 1 1
CAB Approval Yes (24-72h) Async (2h post-deployment)
Staging Soak Period 24+ hours 1-2 hours (expedited)
Load/Chaos Tests Required Optional (skip if time-critical)
Change Freeze Respected Bypassed with executive approval

Emergency Approval Workflow:

# Emergency hotfix approval (Azure DevOps Environment)
name: ATP-Production-Hotfix
approvals:
  - type: requiredApprovers
    requiredApprovers:
      - group: ATP-Architects
      - group: SRE-Team
    minRequiredApprovers: 2  # 1 Architect + 1 SRE
    instructions: |
      **EMERGENCY HOTFIX APPROVAL**

      This is an expedited approval for a critical production issue.

      Validate:
      - P1 incident ticket linked (incident severity justified)
      - Hotfix tested in staging (minimum 1 hour)
      - Rollback plan documented and tested
      - On-call engineer notified and available
      - CAB async review scheduled (within 2 hours post-deployment)
    timeout: 2h  # Expedited timeout
    notifyOnlyInitiator: false

gates:
  # Simplified gates for emergency hotfix
  - type: azureFunction
    function: ValidateEmergencyHotfix
    url: https://atp-approval-gates.azurewebsites.net/api/ValidateEmergencyHotfix
    apiKey: $(ApprovalGateApiKey)
    successCriteria: '{"p1IncidentLinked": true, "stagingTested": true}'
    timeout: 5m

Emergency Deployment Checklist:

## Emergency Hotfix Deployment Checklist

**Incident**: P1-XXXXX  
**Build**: $(Build.BuildNumber)  
**Severity**: Critical  
**Deployment Time**: Immediate

### Emergency Justification
- [ ] P1 incident active (production down or severe degradation)
- [ ] Security vulnerability (CVSS ≥9.0) requiring immediate patching
- [ ] Data loss/corruption risk requiring immediate mitigation

### Minimal Validation
- [ ] Hotfix tested in staging (minimum 1 hour)
- [ ] Rollback plan documented and tested
- [ ] On-call engineer notified and available during deployment
- [ ] Incident commander assigned (coordinates deployment)

### Post-Deployment Requirements
- [ ] CAB async review scheduled (within 2 hours)
- [ ] Post-incident review (PIR) scheduled (within 48 hours)
- [ ] Incident status page updated (communicate fix deployed)

### Approval Decision
- [ ] **APPROVED (EMERGENCY)** — Deploy immediately

**Architect Approver**: ________________  
**SRE Approver**: ________________  
**Incident Commander**: ________________  
**Date/Time**: ________________

Approval Tracking & Audit Trail

Purpose: Maintain comprehensive audit trail of all deployment approvals for compliance (SOC 2, GDPR, HIPAA).

Approval Audit Data Captured:

Data Point Captured Retention Immutable
Approver Identity User principal name, email 7 years ✅ Yes
Approval Timestamp UTC timestamp 7 years ✅ Yes
Approval Decision Approved/Denied/Deferred 7 years ✅ Yes
Approval Comments Free-text justification 7 years ✅ Yes
Build Artifacts Build number, commit SHA, SBOM 7 years ✅ Yes
Automated Gate Results Test results, security scans, coverage 7 years ✅ Yes

Approval Audit Export (Azure Function):

// ExportApprovalAuditTrail.cs — Export approval history for compliance
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.IO;
using System.Linq;
using System.Text.Json;
using System.Threading.Tasks;

public static class ExportApprovalAuditTrail
{
    [FunctionName("ExportApprovalAuditTrail")]
    public static async Task Run(
        [TimerTrigger("0 0 2 * * 0")] TimerInfo timer,  // Weekly: Sunday 2:00 AM
        ILogger log)
    {
        log.LogInformation("Exporting approval audit trail for compliance");

        var azureDevOpsUrl = Environment.GetEnvironmentVariable("AZURE_DEVOPS_URL");
        var pat = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PAT");
        var projectName = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PROJECT");

        // Query Azure DevOps Audit Log API for approval events
        var auditEvents = await GetApprovalAuditEventsAsync(azureDevOpsUrl, pat, projectName);

        // Transform to compliance format
        var complianceRecords = auditEvents.Select(e => new
        {
            timestamp = e.Timestamp,
            approverUpn = e.Actor.Upn,
            approverEmail = e.Actor.Email,
            environment = e.Resource.EnvironmentName,
            buildNumber = e.Resource.BuildNumber,
            decision = e.Data.Decision,  // Approved/Denied
            comments = e.Data.Comments,
            changeRequestId = e.Data.ChangeRequestId
        }).ToList();

        // Export to JSON (for compliance evidence)
        var json = JsonSerializer.Serialize(complianceRecords, new JsonSerializerOptions
        {
            WriteIndented = true
        });

        // Upload to immutable Azure Blob Storage (WORM, 7-year retention)
        var storageConnectionString = Environment.GetEnvironmentVariable("COMPLIANCE_STORAGE_CONNECTION");
        var blobServiceClient = new BlobServiceClient(storageConnectionString);
        var containerClient = blobServiceClient.GetBlobContainerClient("approval-audit-trail");

        var blobName = $"approval-audit-trail-{DateTime.UtcNow:yyyy-MM-dd}.json";
        var blobClient = containerClient.GetBlobClient(blobName);

        using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(json)))
        {
            await blobClient.UploadAsync(stream, overwrite: false);
        }

        // Set legal hold (immutability)
        await blobClient.SetLegalHoldAsync(hasLegalHold: true);

        log.LogInformation($"Approval audit trail exported: {blobName} ({complianceRecords.Count} records)");
    }

    private static async Task<List<AuditEvent>> GetApprovalAuditEventsAsync(
        string azureDevOpsUrl, string pat, string projectName)
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }
}

Summary

  • Approval Gates (Manual): Human oversight for staging (1 approver, 4h) and production (3 approvers + CAB, 24h)
  • Pre-Production Approval: Lead Engineer reviews automated gate results, security scans, observability, rollback plan
  • Production Approval: Multi-level (2 Architects + 1 SRE + CAB), validates staging soak period, active incidents, change freeze, deployment window
  • Azure DevOps Environment Configuration: Complete YAML for approval groups, minimum approvers, timeout, automated gates (test results, security, coverage, incidents, change freeze)
  • Automated Gate Functions: 5 C# Azure Functions (ValidateTestResults, CheckActiveIncidents, ValidateStagingSoakPeriod, CheckChangeFreeze, GetCABApprovalStatus)
  • CAB Process: Weekly meetings, emergency on-demand, async for low-risk, change request template (risk level, deployment window, rollback plan)
  • Emergency Hotfix Procedures: Expedited approval (1 Architect + 1 SRE, 2h timeout), simplified gates, P1 incident justification, CAB async review post-deployment
  • Approval Audit Trail: 7-year retention in immutable storage (WORM, legal hold), weekly export to Azure Blob, compliance evidence for SOC 2/GDPR/HIPAA

Quality Gate Metrics & Dashboards

Quality gate metrics provide data-driven insights into pipeline health, test effectiveness, security posture, and deployment reliability. These metrics enable continuous improvement through trend analysis, anomaly detection, and proactive remediation of quality issues.

Philosophy: What gets measured gets improved—comprehensive metrics enable teams to identify quality trends, detect regressions early, and make data-driven decisions about process improvements. ATP tracks 15+ quality metrics with monthly reviews and quarterly improvement cycles.

Quality Metrics Architecture

graph TD
    A[Pipeline Execution] --> B[Emit Metrics]
    B --> C[Azure DevOps Analytics]
    B --> D[Application Insights]
    B --> E[Log Analytics]

    C --> F[Quality Dashboard]
    D --> F
    E --> F

    F --> G{Threshold Exceeded?}
    G -->|Yes| H[Alert & Notify]
    G -->|No| I[Store Historical Data]

    H --> J[Slack Notification]
    H --> K[Email Notification]
    H --> L[PagerDuty Alert]

    I --> M[Trend Analysis]
    M --> N[Monthly Quality Review]
    N --> O[Improvement Backlog]
    O --> P[Quarterly Roadmap]

    style H fill:#feca57
    style J fill:#feca57
    style K fill:#feca57
    style L fill:#ff6b6b
    style P fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Key Metrics (Tracked)

Purpose: Monitor quality gate effectiveness across all ATP services and identify improvement opportunities.

Quality Metrics Scorecard:

Metric Target Current Trend Blocker Measurement Frequency
Build Success Rate ≥98% 97.2% ↗️ Improving ❌ No Per build
Test Pass Rate 100% 99.8% → Stable ✅ Yes Per build
Code Coverage (Avg) ≥70% 73.5% ↗️ Improving ✅ Yes Per build
Branch Coverage (Avg) ≥60% 64.2% ↗️ Improving ⚠️ Warning Per build
Security Scan Pass Rate 100% 98.5% ↗️ Improving ✅ Yes Per build
SBOM Generation Success 100% 100% → Stable ✅ Yes Per build
Container Scan Pass Rate ≥95% 96.8% → Stable ⚠️ Warning Per build
Deployment Success Rate ≥95% 96.1% → Stable ❌ No Per deployment
Flaky Test Rate <2% 1.3% ↘️ Decreasing ❌ No Daily
Mean Time to Fix Gate <4 hours 3.2 hours ↘️ Decreasing ❌ No Per failure
API Breaking Changes 0 0 → Stable ✅ Yes Per build
Schema Breaking Changes 0 0 → Stable ✅ Yes Per build
Critical Vulnerabilities 0 0 → Stable ✅ Yes Per build
High Vulnerabilities 0 1 ↗️ Regressing ⚠️ Warning Per build
Compliance Gate Pass Rate 100% 100% → Stable ✅ Yes Per build

Metric Trend Indicators:

  • ↗️ Improving: Metric moving toward target (positive trend)
  • → Stable: Metric at target or within acceptable variance (±2%)
  • ↘️ Decreasing: Metric improving beyond target (overachieving)
  • ⚠️ Regressing: Metric moving away from target (requires attention)

KQL Queries for Quality Metrics:

// Build Success Rate (Last 30 Days)
Build
| where Repository == "ConnectSoft.ATP.Ingestion"
| where QueueTime >= ago(30d)
| summarize
    TotalBuilds = count(),
    SuccessfulBuilds = countif(Result == "succeeded"),
    FailedBuilds = countif(Result == "failed" or Result == "canceled")
  by bin(QueueTime, 1d)
| extend SuccessRate = round((todouble(SuccessfulBuilds) / TotalBuilds) * 100, 2)
| project
    Date = format_datetime(QueueTime, 'yyyy-MM-dd'),
    TotalBuilds,
    SuccessfulBuilds,
    FailedBuilds,
    SuccessRate
| order by Date desc
| render timechart with (title="Build Success Rate (30 Days)", ytitle="Success Rate %", xtitle="Date")
// Test Pass Rate (Per Service, Last 7 Days)
TestRun
| where StartedDate >= ago(7d)
| summarize
    TotalTests = sum(TotalTests),
    PassedTests = sum(PassedTests),
    FailedTests = sum(FailedTests)
  by BuildDefinitionName, bin(StartedDate, 1d)
| extend TestPassRate = round((todouble(PassedTests) / TotalTests) * 100, 2)
| project
    Service = BuildDefinitionName,
    Date = format_datetime(StartedDate, 'yyyy-MM-dd'),
    TotalTests,
    PassedTests,
    FailedTests,
    TestPassRate
| order by Service, Date desc
// Code Coverage Trend (Last 90 Days)
CodeCoverage
| where BuildCompletedDate >= ago(90d)
| where Repository startswith "ConnectSoft.ATP"
| summarize
    AvgLineCoverage = round(avg(LineCoveragePercent), 2),
    AvgBranchCoverage = round(avg(BranchCoveragePercent), 2)
  by bin(BuildCompletedDate, 7d), Repository
| project
    Week = format_datetime(BuildCompletedDate, 'yyyy-MM-dd'),
    Service = extract(@"ConnectSoft\.ATP\.(\w+)", 1, Repository),
    AvgLineCoverage,
    AvgBranchCoverage
| order by Week desc, Service
| render timechart with (title="Code Coverage Trend (90 Days)", ytitle="Coverage %")
// Security Vulnerability Trend (Last 180 Days)
SecurityScan
| where ScanDate >= ago(180d)
| where Project == "ConnectSoft.ATP"
| summarize
    CriticalCount = countif(Severity == "Critical"),
    HighCount = countif(Severity == "High"),
    MediumCount = countif(Severity == "Medium"),
    LowCount = countif(Severity == "Low")
  by bin(ScanDate, 7d), Service
| extend TotalVulnerabilities = CriticalCount + HighCount + MediumCount + LowCount
| project
    Week = format_datetime(ScanDate, 'yyyy-MM-dd'),
    Service,
    CriticalCount,
    HighCount,
    MediumCount,
    LowCount,
    TotalVulnerabilities
| order by Week desc

Azure DevOps Dashboard Configuration

Purpose: Provide at-a-glance visibility into quality gate health across all ATP services with drill-down capabilities for root cause analysis.

Dashboard Structure:

ConnectSoft ATP — Quality Gates Dashboard
═══════════════════════════════════════════════════════════

┌─────────────────────────────────────────────────────────┐
│ BUILD HEALTH                                            │
├─────────────────────────────────────────────────────────┤
│ • Build Success Rate (30d): 97.2% ↗️                    │
│ • Average Build Duration: 8.3 min → Target: <10 min    │
│ • Failed Builds (7d): 3 builds                          │
│ • Top Failure Reasons:                                  │
│   1. Code coverage below threshold (2 builds)           │
│   2. Security scan failed (1 build)                     │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ TEST RESULTS                                            │
├─────────────────────────────────────────────────────────┤
│ • Test Pass Rate: 99.8% → Target: 100%                 │
│ • Flaky Tests Detected: 4 tests (1.3% of total)        │
│ • Average Test Duration: 4.1 min → Target: <5 min      │
│ • Coverage Trend (30d):                                 │
│   - Ingestion: 76.2% ↗️                                 │
│   - Query: 81.5% →                                      │
│   - Integrity: 86.1% ↗️                                 │
│   - Export: 71.8% ↗️                                    │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ SECURITY POSTURE                                        │
├─────────────────────────────────────────────────────────┤
│ • Critical Vulnerabilities: 0 ✅                        │
│ • High Vulnerabilities: 1 ⚠️ (1 accepted risk)         │
│ • Medium Vulnerabilities: 5 (backlog)                   │
│ • Secrets Detected: 0 ✅                                │
│ • Container Scan Pass: 96.8%                            │
│ • License Compliance: 100% ✅                           │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ DEPLOYMENT FREQUENCY (DORA Metrics)                     │
├─────────────────────────────────────────────────────────┤
│ • Deployment Frequency: 12.3/month → Elite (>1/week)   │
│ • Lead Time (Commit→Prod): 3.2 days → High (1-7 days)  │
│ • MTTR (Incident→Fix): 2.1 hours → Elite (<1 hour)     │
│ • Change Failure Rate: 3.9% → Elite (<5%)              │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ QUALITY GATE VIOLATIONS (Last 30 Days)                  │
├─────────────────────────────────────────────────────────┤
│ 1. Coverage gate failed: 12 builds (39%)                │
│ 2. Security scan failed: 8 builds (26%)                 │
│ 3. Test failures: 7 builds (23%)                        │
│ 4. API breaking changes: 4 builds (13%)                 │
│ 5. SBOM generation failed: 0 builds (0%)                │
│                                                          │
│ Mean Time to Fix: 3.2 hours                             │
└─────────────────────────────────────────────────────────┘

Azure DevOps Dashboard Widgets (JSON configuration):

{
  "name": "ATP Quality Gates Dashboard",
  "description": "Quality gate metrics and trends for ConnectSoft ATP",
  "widgets": [
    {
      "name": "Build Success Rate",
      "position": { "row": 1, "column": 1 },
      "size": { "rowSpan": 2, "columnSpan": 2 },
      "settings": {
        "query": "Build | where Repository startswith 'ConnectSoft.ATP' | where QueueTime >= ago(30d) | summarize SuccessRate = round((todouble(countif(Result == 'succeeded')) / count()) * 100, 2) by bin(QueueTime, 1d)"
      },
      "contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.QueryScalarWidget"
    },
    {
      "name": "Test Coverage Trend",
      "position": { "row": 1, "column": 3 },
      "size": { "rowSpan": 2, "columnSpan": 3 },
      "settings": {
        "query": "CodeCoverage | where BuildCompletedDate >= ago(90d) | summarize AvgCoverage = avg(LineCoveragePercent) by bin(BuildCompletedDate, 7d), Repository"
      },
      "contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.QueryChartWidget"
    },
    {
      "name": "Security Vulnerability Count",
      "position": { "row": 3, "column": 1 },
      "size": { "rowSpan": 2, "columnSpan": 2 },
      "settings": {
        "query": "SecurityScan | where ScanDate >= ago(30d) | summarize CriticalCount = countif(Severity == 'Critical'), HighCount = countif(Severity == 'High') by Service"
      },
      "contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.QueryTableWidget"
    },
    {
      "name": "DORA Metrics",
      "position": { "row": 3, "column": 3 },
      "size": { "rowSpan": 2, "columnSpan": 3 },
      "settings": {
        "metrics": [
          { "name": "Deployment Frequency", "value": "12.3/month", "target": ">1/week", "classification": "Elite" },
          { "name": "Lead Time", "value": "3.2 days", "target": "<7 days", "classification": "High" },
          { "name": "MTTR", "value": "2.1 hours", "target": "<1 hour", "classification": "Elite" },
          { "name": "Change Failure Rate", "value": "3.9%", "target": "<5%", "classification": "Elite" }
        ]
      },
      "contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.MarkdownWidget"
    },
    {
      "name": "Quality Gate Violations",
      "position": { "row": 5, "column": 1 },
      "size": { "rowSpan": 3, "columnSpan": 5 },
      "settings": {
        "query": "QualityGateViolation | where ViolationDate >= ago(30d) | summarize Count = count() by GateType, FailureReason | order by Count desc"
      },
      "contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.QueryChartWidget"
    }
  ]
}

Dashboard Widget KQL Queries (Detailed):

// Widget 1: Build Success Rate (30-Day Trend)
Build
| where Repository startswith "ConnectSoft.ATP"
| where QueueTime >= ago(30d)
| summarize
    TotalBuilds = count(),
    SuccessfulBuilds = countif(Result == "succeeded"),
    FailedBuilds = countif(Result != "succeeded")
  by bin(QueueTime, 1d)
| extend SuccessRate = round((todouble(SuccessfulBuilds) / TotalBuilds) * 100, 2)
| project
    Date = format_datetime(QueueTime, 'yyyy-MM-dd'),
    SuccessRate,
    TotalBuilds,
    SuccessfulBuilds,
    FailedBuilds
| order by Date desc
| render timechart with (title="Build Success Rate (30 Days)", ytitle="Success Rate %", xtitle="Date", ymin=0, ymax=100)
// Widget 2: Test Coverage by Service (Current + Trend)
CodeCoverage
| where BuildCompletedDate >= ago(90d)
| where Repository startswith "ConnectSoft.ATP"
| extend Service = extract(@"ConnectSoft\.ATP\.(\w+)", 1, Repository)
| summarize
    CurrentCoverage = round(avg(LineCoveragePercent), 2),
    PreviousCoverage = round(avgif(LineCoveragePercent, BuildCompletedDate < ago(30d)), 2)
  by Service
| extend
    Trend = case(
        CurrentCoverage > PreviousCoverage + 2, "↗️ Improving",
        CurrentCoverage < PreviousCoverage - 2, "⚠️ Regressing",
        "→ Stable"
    ),
    Target = 70.0,
    Status = case(
        CurrentCoverage >= 70, "✅ Met",
        CurrentCoverage >= 60, "⚠️ Close",
        "❌ Below"
    )
| project
    Service,
    CurrentCoverage,
    Target,
    Trend,
    Status
| order by CurrentCoverage desc
// Widget 3: Security Vulnerability Summary (Current State)
SecurityScan
| where ScanDate >= ago(7d)
| where Project == "ConnectSoft.ATP"
| summarize arg_max(ScanDate, *) by Service  // Latest scan per service
| summarize
    CriticalCount = sumif(VulnerabilityCount, Severity == "Critical"),
    HighCount = sumif(VulnerabilityCount, Severity == "High"),
    MediumCount = sumif(VulnerabilityCount, Severity == "Medium"),
    LowCount = sumif(VulnerabilityCount, Severity == "Low")
  by Service
| extend TotalVulnerabilities = CriticalCount + HighCount + MediumCount + LowCount
| extend
    RiskLevel = case(
        CriticalCount > 0, "🔴 Critical",
        HighCount > 0, "🟠 High",
        MediumCount > 5, "🟡 Medium",
        "🟢 Low"
    )
| project
    Service,
    RiskLevel,
    CriticalCount,
    HighCount,
    MediumCount,
    LowCount,
    TotalVulnerabilities
| order by CriticalCount desc, HighCount desc
// Widget 4: Flaky Test Detection (Last 30 Days)
TestRun
| where StartedDate >= ago(30d)
| where BuildDefinitionName startswith "ConnectSoft.ATP"
| join kind=inner (
    TestResult
    | where CompletedDate >= ago(30d)
  ) on TestRunId
| summarize
    TotalRuns = count(),
    PassCount = countif(Outcome == "Passed"),
    FailCount = countif(Outcome == "Failed")
  by TestCaseName, BuildDefinitionName
| where TotalRuns >= 10  // Only tests run at least 10 times
| extend FlakyScore = round((todouble(FailCount) / TotalRuns) * 100, 2)
| where FlakyScore > 0 and FlakyScore < 100  // Exclude always-passing and always-failing tests
| project
    Service = extract(@"ConnectSoft\.ATP\.(\w+)", 1, BuildDefinitionName),
    TestCaseName,
    TotalRuns,
    PassCount,
    FailCount,
    FlakyScore
| order by FlakyScore desc, TotalRuns desc
| take 20

DORA Metrics (DevOps Research & Assessment)

Purpose: Measure software delivery performance using industry-standard DORA metrics to benchmark ATP against elite-performing teams.

DORA Metric Definitions:

Metric Definition ATP Target Industry Elite Current Performance Classification
Deployment Frequency How often code is deployed to production >1/week On-demand (multiple/day) 12.3/month (~3/week) Elite
Lead Time for Changes Time from commit to production deployment <7 days <1 day 3.2 days High ⚠️
Mean Time to Recovery (MTTR) Time to restore service after incident <1 hour <1 hour 2.1 hours Medium ⚠️
Change Failure Rate Percentage of deployments causing incidents <5% <5% 3.9% Elite

DORA Metrics Calculation (KQL):

// Deployment Frequency (Deployments per month)
Deployment
| where DeploymentTime >= ago(90d)
| where Environment == "Production"
| where Project == "ConnectSoft.ATP"
| summarize DeploymentCount = count() by bin(DeploymentTime, 30d)
| extend DeploymentsPerMonth = DeploymentCount
| project
    Month = format_datetime(DeploymentTime, 'yyyy-MM'),
    DeploymentsPerMonth
| order by Month desc
// Lead Time for Changes (Commit → Production)
Build
| where QueueTime >= ago(90d)
| where Repository startswith "ConnectSoft.ATP"
| where Result == "succeeded"
| join kind=inner (
    Deployment
    | where Environment == "Production"
  ) on BuildNumber
| extend LeadTimeHours = datetime_diff('hour', DeploymentTime, SourceVersion.CommitTime)
| summarize
    AvgLeadTimeHours = round(avg(LeadTimeHours), 2),
    P50LeadTimeHours = round(percentile(LeadTimeHours, 50), 2),
    P95LeadTimeHours = round(percentile(LeadTimeHours, 95), 2)
  by bin(QueueTime, 30d)
| extend
    AvgLeadTimeDays = round(AvgLeadTimeHours / 24, 1),
    Classification = case(
        AvgLeadTimeHours < 24, "Elite (<1 day)",
        AvgLeadTimeHours < 168, "High (1-7 days)",
        AvgLeadTimeHours < 720, "Medium (1-30 days)",
        "Low (>30 days)"
    )
| project
    Month = format_datetime(QueueTime, 'yyyy-MM'),
    AvgLeadTimeDays,
    P50LeadTimeHours,
    P95LeadTimeHours,
    Classification
| order by Month desc
// Mean Time to Recovery (MTTR)
Incident
| where CreatedDate >= ago(90d)
| where Project == "ConnectSoft.ATP"
| where Severity in ("P1", "P2")
| extend RecoveryTimeMinutes = datetime_diff('minute', ResolvedDate, CreatedDate)
| summarize
    AvgMTTRMinutes = round(avg(RecoveryTimeMinutes), 2),
    P50MTTRMinutes = round(percentile(RecoveryTimeMinutes, 50), 2),
    P95MTTRMinutes = round(percentile(RecoveryTimeMinutes, 95), 2),
    IncidentCount = count()
  by bin(CreatedDate, 30d)
| extend
    AvgMTTRHours = round(AvgMTTRMinutes / 60, 1),
    Classification = case(
        AvgMTTRMinutes < 60, "Elite (<1 hour)",
        AvgMTTRMinutes < 1440, "High (1-24 hours)",
        "Medium (>24 hours)"
    )
| project
    Month = format_datetime(CreatedDate, 'yyyy-MM'),
    IncidentCount,
    AvgMTTRHours,
    P50MTTRMinutes,
    P95MTTRMinutes,
    Classification
| order by Month desc
// Change Failure Rate (Deployments → Incidents)
Deployment
| where DeploymentTime >= ago(90d)
| where Environment == "Production"
| join kind=leftouter (
    Incident
    | where Severity in ("P1", "P2")
    | extend DeploymentCausedIncident = true
  ) on DeploymentId
| summarize
    TotalDeployments = count(),
    FailedDeployments = countif(DeploymentCausedIncident == true)
  by bin(DeploymentTime, 30d)
| extend ChangeFailureRate = round((todouble(FailedDeployments) / TotalDeployments) * 100, 2)
| extend
    Classification = case(
        ChangeFailureRate < 5, "Elite (<5%)",
        ChangeFailureRate < 15, "High (5-15%)",
        "Medium (>15%)"
    )
| project
    Month = format_datetime(DeploymentTime, 'yyyy-MM'),
    TotalDeployments,
    FailedDeployments,
    ChangeFailureRate,
    Classification
| order by Month desc

Alerting on Gate Failures

Purpose: Provide immediate feedback when quality gates fail, enabling rapid remediation and preventing quality regressions.

Alert Configuration Matrix:

Gate Failure Severity Channel Recipients SLA Escalation
Build Failure Medium Slack #atp-builds Team lead, build author 4 hours Architect (8h)
Test Failure (>5%) High Slack + Email Team lead, QA lead 2 hours Architect (4h)
Coverage Drop (>5%) Medium Email Team lead, architect 1 business day Weekly review
Security Critical Critical PagerDuty + Slack + Email Security team, architect, SRE 1 hour CISO (2h)
Security High High Email + Slack Security team, team lead 24 hours Security Officer (48h)
SBOM Generation Failed High Email Team lead, compliance officer 4 hours Compliance team (8h)
API Breaking Change Critical Slack + Email Architect, API team 2 hours CTO (4h)
Schema Breaking Change Critical Slack + Email Architect, integration team 2 hours CTO (4h)
Deployment Failure Critical PagerDuty + Slack SRE on-call, team lead 15 minutes SRE Lead (30m)
Health Check Failure Critical PagerDuty SRE on-call 5 minutes SRE Lead (15m)

Alert Routing Configuration (Azure Monitor):

{
  "name": "ATP Quality Gate Alerts",
  "description": "Alert rules for quality gate failures",
  "actionGroups": [
    {
      "name": "ATP-Team-Lead",
      "shortName": "ATPLead",
      "emailReceivers": [
        { "name": "Team Lead", "emailAddress": "atp-lead@connectsoft.example" }
      ],
      "smsReceivers": [],
      "webhookReceivers": [
        {
          "name": "Slack-ATP-Builds",
          "serviceUri": "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX"
        }
      ]
    },
    {
      "name": "ATP-Security-Team",
      "shortName": "ATPSec",
      "emailReceivers": [
        { "name": "Security Team", "emailAddress": "security@connectsoft.example" }
      ],
      "azureFunctionReceivers": [
        {
          "name": "PagerDuty-Integration",
          "functionAppResourceId": "/subscriptions/.../resourceGroups/ATP-Prod-RG/providers/Microsoft.Web/sites/atp-pagerduty-function",
          "functionName": "SendToPagerDuty",
          "httpTriggerUrl": "https://atp-pagerduty-function.azurewebsites.net/api/SendToPagerDuty"
        }
      ]
    },
    {
      "name": "ATP-SRE-On-Call",
      "shortName": "ATPSRE",
      "emailReceivers": [
        { "name": "SRE On-Call", "emailAddress": "sre-oncall@connectsoft.example" }
      ],
      "azureFunctionReceivers": [
        {
          "name": "PagerDuty-SRE",
          "functionAppResourceId": "/subscriptions/.../resourceGroups/ATP-Prod-RG/providers/Microsoft.Web/sites/atp-pagerduty-function",
          "functionName": "SendToPagerDuty",
          "httpTriggerUrl": "https://atp-pagerduty-function.azurewebsites.net/api/SendToPagerDuty"
        }
      ],
      "smsReceivers": [
        { "name": "SRE On-Call Mobile", "phoneNumber": "+1234567890" }
      ]
    }
  ],
  "alertRules": [
    {
      "name": "Build-Failure-Alert",
      "description": "Alert when ATP build fails",
      "severity": 2,
      "enabled": true,
      "query": "Build | where Repository startswith 'ConnectSoft.ATP' | where Result != 'succeeded'",
      "frequency": "PT5M",
      "timeWindow": "PT5M",
      "actionGroups": ["ATP-Team-Lead"],
      "throttling": "PT1H"
    },
    {
      "name": "Security-Critical-Vulnerability-Alert",
      "description": "Alert when critical vulnerability detected",
      "severity": 0,
      "enabled": true,
      "query": "SecurityScan | where Severity == 'Critical' | where ScanDate >= ago(5m)",
      "frequency": "PT5M",
      "timeWindow": "PT5M",
      "actionGroups": ["ATP-Security-Team"],
      "throttling": "PT15M"
    },
    {
      "name": "Coverage-Drop-Alert",
      "description": "Alert when coverage drops >5% from baseline",
      "severity": 2,
      "enabled": true,
      "query": "CodeCoverage | where LineCoveragePercent < (prev(LineCoveragePercent) - 5)",
      "frequency": "PT1H",
      "timeWindow": "PT1H",
      "actionGroups": ["ATP-Team-Lead"],
      "throttling": "PT24H"
    },
    {
      "name": "Deployment-Failure-Alert",
      "description": "Alert when production deployment fails",
      "severity": 0,
      "enabled": true,
      "query": "Deployment | where Environment == 'Production' | where Result != 'succeeded'",
      "frequency": "PT1M",
      "timeWindow": "PT5M",
      "actionGroups": ["ATP-SRE-On-Call"],
      "throttling": "PT5M"
    }
  ]
}

Slack Alert Integration (C# Azure Function):

// SendSlackAlert.cs — Send quality gate failure alerts to Slack
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System.Net.Http;
using System.Text;
using System.Text.Json;
using System.Threading.Tasks;

public static class SendSlackAlert
{
    private static readonly HttpClient HttpClient = new HttpClient();

    [FunctionName("SendSlackAlert")]
    public static async Task Run(
        [QueueTrigger("quality-gate-alerts")] QualityGateAlert alert,
        ILogger log)
    {
        log.LogInformation($"Sending Slack alert for {alert.GateType} failure");

        var slackWebhookUrl = Environment.GetEnvironmentVariable("SLACK_WEBHOOK_URL");

        // Build Slack message
        var slackMessage = new
        {
            text = $"⚠️ *Quality Gate Failure*: {alert.GateType}",
            blocks = new[]
            {
                new
                {
                    type = "header",
                    text = new
                    {
                        type = "plain_text",
                        text = $"🚨 Quality Gate Failure: {alert.GateType}"
                    }
                },
                new
                {
                    type = "section",
                    fields = new[]
                    {
                        new { type = "mrkdwn", text = $"*Service:*\n{alert.Service}" },
                        new { type = "mrkdwn", text = $"*Build:*\n{alert.BuildNumber}" },
                        new { type = "mrkdwn", text = $"*Gate:*\n{alert.GateType}" },
                        new { type = "mrkdwn", text = $"*Severity:*\n{alert.Severity}" }
                    }
                },
                new
                {
                    type = "section",
                    text = new
                    {
                        type = "mrkdwn",
                        text = $"*Failure Reason:*\n```{alert.FailureReason}```"
                    }
                },
                new
                {
                    type = "section",
                    text = new
                    {
                        type = "mrkdwn",
                        text = $"*Remediation:*\n{alert.RemediationGuidance}"
                    }
                },
                new
                {
                    type = "actions",
                    elements = new[]
                    {
                        new
                        {
                            type = "button",
                            text = new { type = "plain_text", text = "View Build" },
                            url = alert.BuildUrl,
                            style = "primary"
                        },
                        new
                        {
                            type = "button",
                            text = new { type = "plain_text", text = "View Logs" },
                            url = alert.LogsUrl
                        }
                    }
                }
            }
        };

        var json = JsonSerializer.Serialize(slackMessage);
        var content = new StringContent(json, Encoding.UTF8, "application/json");

        var response = await HttpClient.PostAsync(slackWebhookUrl, content);
        response.EnsureSuccessStatusCode();

        log.LogInformation($"Slack alert sent successfully for {alert.GateType}");
    }
}

public class QualityGateAlert
{
    public string Service { get; set; }
    public string BuildNumber { get; set; }
    public string GateType { get; set; }
    public string Severity { get; set; }
    public string FailureReason { get; set; }
    public string RemediationGuidance { get; set; }
    public string BuildUrl { get; set; }
    public string LogsUrl { get; set; }
}

Continuous Improvement Framework

Purpose: Use quality metrics to drive quarterly improvement cycles with measurable outcomes and accountability.

Improvement Process:

graph TD
    A[Monthly Metrics Review] --> B{Metrics Below Target?}
    B -->|No| C[Continue Monitoring]
    B -->|Yes| D[Root Cause Analysis]

    D --> E[Identify Improvement Areas]
    E --> F[Create Improvement Backlog]
    F --> G[Prioritize by Impact]

    G --> H[Quarterly Planning]
    H --> I[Assign Improvement Epics]
    I --> J[Implement Improvements]

    J --> K[Measure Impact]
    K --> L{Target Achieved?}
    L -->|Yes| M[Document Success]
    L -->|No| N[Iterate on Solution]

    M --> A
    N --> E

    C --> A

    style D fill:#feca57
    style E fill:#feca57
    style M fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Monthly Quality Review Meeting:

Cadence: First Tuesday of each month, 10:00 AM
Duration: 60 minutes
Attendees: Architects, Team Leads, QA Lead, SRE Lead

Agenda:

  1. Metrics Review (20 minutes)

    • Build success rate, test pass rate, coverage trends
    • Security scan results, vulnerability trends
    • Deployment success rate, DORA metrics
  2. Quality Gate Violations (15 minutes)

    • Top 5 failure reasons (coverage, security, tests, API breaking changes)
    • Mean time to fix (MTTF) trend
    • Repeat offenders (same failures across builds)
  3. Improvement Opportunities (15 minutes)

    • Metrics below target (identify root causes)
    • Process bottlenecks (manual approval delays, test timeouts)
    • Tool enhancements (better linters, faster test execution)
  4. Action Items (10 minutes)

    • Assign improvement epics to teams
    • Set measurable targets for next month
    • Review progress on previous action items

Quality Improvement Backlog (Azure DevOps Queries):

// Quality Improvement Work Items
WorkItem
| where WorkItemType == "Epic"
| where Tags contains "QualityImprovement"
| where State in ("New", "Active")
| summarize
    TotalEpics = count(),
    InProgressEpics = countif(State == "Active"),
    CompletedEpics = countif(State == "Closed")
  by AssignedTo
| extend CompletionRate = round((todouble(CompletedEpics) / TotalEpics) * 100, 2)
| project
    Owner = AssignedTo,
    TotalEpics,
    InProgressEpics,
    CompletedEpics,
    CompletionRate
| order by CompletionRate desc

Quarterly Improvement Roadmap (Example):

Quarter Focus Area Initiatives Target Metric Improvement Owner
Q1 2025 Test Coverage Add unit tests for uncovered paths; improve integration tests Coverage 70% → 80% QA Lead
Q2 2025 Security Posture Upgrade vulnerable dependencies; implement secrets rotation High vulnerabilities 5 → 0 Security Team
Q3 2025 Build Performance Parallelize test execution; optimize Docker layer caching Build duration 8min → 5min Platform Team
Q4 2025 Deployment Reliability Implement automated rollback; enhance health checks Deployment success 96% → 99% SRE Team

Quality Trend Analysis & Predictions

Purpose: Use historical data to predict future quality trends and proactively address issues before they impact production.

Trend Analysis Script (Python):

#!/usr/bin/env python3
# scripts/analyze-quality-trends.py

import pandas as pd
import numpy as np
from scipy.stats import linregress
from datetime import datetime, timedelta
import json

def analyze_build_success_trend(builds_df):
    """
    Analyze build success rate trend using linear regression.
    Predict success rate for next 30 days.
    """
    builds_df['date'] = pd.to_datetime(builds_df['QueueTime'])
    builds_df['date_numeric'] = (builds_df['date'] - builds_df['date'].min()).dt.days

    # Calculate success rate per day
    daily_success = builds_df.groupby('date').agg({
        'Result': lambda x: (x == 'succeeded').sum() / len(x) * 100
    }).reset_index()
    daily_success.columns = ['date', 'success_rate']
    daily_success['date_numeric'] = (daily_success['date'] - daily_success['date'].min()).dt.days

    # Linear regression
    slope, intercept, r_value, p_value, std_err = linregress(
        daily_success['date_numeric'],
        daily_success['success_rate']
    )

    # Predict next 30 days
    last_date_numeric = daily_success['date_numeric'].max()
    future_dates = range(last_date_numeric + 1, last_date_numeric + 31)
    predictions = [slope * d + intercept for d in future_dates]

    # Determine trend classification
    if slope > 0.1:
        trend = "↗️ Improving"
        recommendation = "Continue current practices; success rate trending upward"
    elif slope < -0.1:
        trend = "⚠️ Regressing"
        recommendation = "URGENT: Investigate root causes of declining build success"
    else:
        trend = "→ Stable"
        recommendation = "Maintain current quality standards"

    return {
        "metric": "Build Success Rate",
        "current": round(daily_success['success_rate'].iloc[-1], 2),
        "trend": trend,
        "slope": round(slope, 4),
        "r_squared": round(r_value ** 2, 4),
        "predicted_30d": round(predictions[-1], 2),
        "recommendation": recommendation
    }

def analyze_coverage_trend(coverage_df):
    """Analyze code coverage trend and predict future coverage."""
    coverage_df['date'] = pd.to_datetime(coverage_df['BuildCompletedDate'])
    coverage_df['date_numeric'] = (coverage_df['date'] - coverage_df['date'].min()).dt.days

    # Linear regression on line coverage
    slope, intercept, r_value, _, _ = linregress(
        coverage_df['date_numeric'],
        coverage_df['LineCoveragePercent']
    )

    # Predict next 30 days
    last_date_numeric = coverage_df['date_numeric'].max()
    predicted_coverage = slope * (last_date_numeric + 30) + intercept

    # Determine if coverage will meet target (70%) in next 90 days
    days_to_target = (70 - intercept) / slope if slope > 0 else -1

    return {
        "metric": "Code Coverage",
        "current": round(coverage_df['LineCoveragePercent'].iloc[-1], 2),
        "target": 70.0,
        "trend_slope": round(slope, 4),
        "predicted_30d": round(predicted_coverage, 2),
        "days_to_target": int(days_to_target) if days_to_target > 0 else "N/A",
        "recommendation": f"At current rate, will reach 70% target in {int(days_to_target)} days" if days_to_target > 0 else "Increase test coverage velocity"
    }

def main():
    # Load data from Azure DevOps Analytics API (example)
    builds_df = pd.read_json("builds.json")
    coverage_df = pd.read_json("coverage.json")

    # Analyze trends
    build_trend = analyze_build_success_trend(builds_df)
    coverage_trend = analyze_coverage_trend(coverage_df)

    # Generate report
    report = {
        "generated_at": datetime.utcnow().isoformat(),
        "trends": [build_trend, coverage_trend],
        "summary": {
            "metrics_improving": sum(1 for t in [build_trend, coverage_trend] if "Improving" in t.get("trend", "")),
            "metrics_regressing": sum(1 for t in [build_trend, coverage_trend] if "Regressing" in t.get("trend", ""))
        }
    }

    # Output report
    print(json.dumps(report, indent=2))

    # Exit with error if metrics regressing
    if report["summary"]["metrics_regressing"] > 0:
        print("\n⚠️ WARNING: Some metrics are regressing. Review recommendations.")
        exit(1)

    print("\n✅ Quality trends are positive or stable.")
    exit(0)

if __name__ == "__main__":
    main()

Summary

  • Quality Gate Metrics: 15 tracked metrics (build success, test pass rate, coverage, security scan, SBOM, deployment, flaky tests, MTTR, API/schema changes, vulnerabilities, compliance)
  • Metrics Scorecard: Current values, targets, trends (improving/stable/regressing), blocker status, measurement frequency
  • KQL Queries: 8 detailed queries (build success rate, test coverage trend, security vulnerabilities, flaky test detection, DORA metrics)
  • Azure DevOps Dashboard: 5-widget configuration (build health, test coverage, security posture, DORA metrics, quality gate violations)
  • DORA Metrics: 4 metrics (deployment frequency 12.3/month Elite, lead time 3.2 days High, MTTR 2.1 hours Medium, change failure rate 3.9% Elite)
  • Alert Configuration: 10 alert types with severity, channels (Slack/Email/PagerDuty), recipients, SLAs, escalation paths
  • Alert Routing: Azure Monitor action groups (ATP-Team-Lead, ATP-Security-Team, ATP-SRE-On-Call) with email/SMS/webhook/Azure Function receivers
  • Slack Integration: C# Azure Function sending rich Slack messages with failure details, remediation guidance, action buttons
  • Continuous Improvement Framework: Monthly quality review meetings, quarterly roadmap (Q1-Q4 2025), improvement backlog tracking
  • Trend Analysis: Python script using linear regression to predict quality trends, identify regressions, provide recommendations

Remediation & Continuous Improvement

Purpose: Provide systematic approach to resolving quality gate violations, preventing recurrence, and continuously raising quality standards through data-driven threshold ratcheting.

Violation Response Workflow:

graph TD
    A[Quality Gate Failure Detected] --> B[Alert Sent to Team]
    B --> C[Developer/SRE Triage]
    C --> D{Root Cause Identified?}
    D -->|No| E[Escalate to Architect]
    D -->|Yes| F{Fix Type?}

    E --> D

    F -->|Code Fix| G[Implement Code Changes]
    F -->|Dependency Update| H[Update Dependencies]
    F -->|Risk Acceptance| I[Create Risk Acceptance Record]
    F -->|Threshold Adjustment| J[Propose Threshold Change]

    G --> K[Re-run Pipeline]
    H --> K
    I --> L[Document in ADR]
    J --> M[Quality Gate Retrospective]

    K --> N{Gate Passed?}
    N -->|No| C
    N -->|Yes| O[Verify Fix]

    L --> O
    M --> O

    O --> P[Document Lessons Learned]
    P --> Q[Update Runbook]
    Q --> R[Close Incident]

    style A fill:#ff6b6b
    style N fill:#feca57
    style R fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Step 1: Detect — Pipeline fails with clear error message

Error Message Format (Standardized):

========================================
❌ QUALITY GATE FAILURE
========================================

Gate: Test Coverage
Service: ConnectSoft.ATP.Ingestion
Build: 1.0.42
Threshold: 75% line coverage
Actual: 72.3% line coverage
Difference: -2.7%

📊 Coverage by Module:
  • Controllers: 85.2% ✅
  • Services: 78.9% ✅
  • Repositories: 68.4% ❌ (below threshold)
  • Models: 95.6% ✅

🔍 Remediation Guidance:
  1. Add unit tests for uncovered repository methods
  2. Focus on Repositories/AuditEventRepository.cs (52% coverage)
  3. Run: dotnet test --collect:"XPlat Code Coverage" --filter FullyQualifiedName~Repository
  4. Exclude generated code if necessary (update .runsettings)

📚 Documentation:
  • Coverage policy: docs/ci-cd/quality-gates.md#test-coverage-gates
  • Runbook: docs/operations/runbooks/coverage-failure.md

⏱️  Estimated Fix Time: 2-4 hours
========================================

Step 2: Triage — Developer or SRE investigates root cause

Triage Checklist:

## Quality Gate Failure Triage

**Gate**: ___________________  
**Build**: ___________________  
**Assignee**: ___________________  
**Triage Start**: ___________________

### Initial Assessment
- [ ] Error message reviewed
- [ ] Build logs analyzed
- [ ] Previous successful build identified (for comparison)
- [ ] Recent code changes reviewed (Git diff)

### Root Cause Analysis
- [ ] Root cause identified: ___________________
- [ ] Contributing factors: ___________________
- [ ] Similar failures in history? ___________________

### Fix Strategy
- [ ] **Code Fix** — Implement missing tests/fix bugs
- [ ] **Dependency Update** — Upgrade/downgrade package
- [ ] **Risk Acceptance** — Document accepted risk (with justification)
- [ ] **Threshold Adjustment** — Propose threshold change (with retrospective)
- [ ] **False Positive** — Report tool issue for investigation

### Estimated Time to Fix
- [ ] <4 hours (immediate fix)
- [ ] 4-8 hours (within sprint)
- [ ] >8 hours (requires spike/research)

**Triage Completed**: ___________________  
**Next Action**: ___________________

Step 3: Fix — Code changes, dependency updates, or risk acceptance

Fix Implementation Patterns:

Failure Type Fix Pattern Example Estimated Time
Coverage Below Threshold Add unit tests for uncovered code xUnit tests for repository methods 2-4 hours
Test Failure Fix bug or update test expectations Fix race condition in integration test 1-8 hours
Security Critical Upgrade dependency or patch code Upgrade System.Text.Json 6.0 → 8.0 1-2 hours
Security High Upgrade dependency or accept risk Suppress false positive with justification 2-4 hours
API Breaking Change Create /v2/ endpoint or revert Implement /v2/audit-events with new schema 1-2 days
SBOM Generation Failed Fix project references or restore Repair NuGet package references 30 minutes
OpenTelemetry Missing Add ActivitySource instrumentation Add using statement + activity creation 1-2 hours
Health Check Failed Fix dependency connection or timeout Increase database health check timeout 30 minutes

Risk Acceptance Record (ADR Template):

# ADR-XXX: Risk Acceptance — [Vulnerability/Issue Description]

## Status
**Accepted** — Date: YYYY-MM-DD  
**Expires**: YYYY-MM-DD (6 months from acceptance)

## Context
**Vulnerability**: CVE-XXXX-XXXXX  
**Severity**: High (CVSS 7.8)  
**Affected Package**: Newtonsoft.Json 12.0.3  
**Exploitability**: Low (requires authenticated admin access)

## Decision
Accept risk for 6 months due to:
1. No patch available from vendor (Microsoft investigating)
2. Exploitability requires admin credentials (mitigated by RBAC)
3. Breaking change to migrate to System.Text.Json (requires 2-week refactoring)

## Mitigation
- [ ] Enable Azure Firewall rules to block external access
- [ ] Add runtime validation to reject malicious JSON payloads
- [ ] Monitor vendor advisory for patch availability
- [ ] Schedule migration to System.Text.Json for Q2 2025

## Acceptance Criteria
- Patch becomes available → Apply immediately
- 6 months elapse → Escalate to CTO for re-acceptance or mandatory fix
- Exploitation detected in wild → Immediate hotfix required

**Accepted By**: [Architect Name]  
**Reviewed By**: [Security Officer Name]  
**Escalation Contact**: [CTO Email]

Step 4: Verify — Re-run pipeline; ensure gate passes

Verification Script (PowerShell):

# scripts/verify-quality-gate-fix.ps1
param(
    [Parameter(Mandatory=$true)]
    [string]$BuildId,

    [Parameter(Mandatory=$true)]
    [string]$GateType  # Coverage, Security, Test, etc.
)

Write-Host "🔍 Verifying quality gate fix for build $BuildId..." -ForegroundColor Cyan

# Query Azure DevOps Build API
$azureDevOpsUrl = $env:AZURE_DEVOPS_URL
$pat = $env:AZURE_DEVOPS_PAT

$headers = @{
    Authorization = "Basic " + [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(":$pat"))
}

$buildUrl = "$azureDevOpsUrl/_apis/build/builds/$BuildId`?api-version=7.0"
$build = Invoke-RestMethod -Uri $buildUrl -Headers $headers -Method Get

# Check if build succeeded
if ($build.result -eq "succeeded") {
    Write-Host "✅ Build passed: $($build.buildNumber)" -ForegroundColor Green

    # Verify specific gate passed
    switch ($GateType) {
        "Coverage" {
            $coverageUrl = "$azureDevOpsUrl/_apis/test/CodeCoverage?buildId=$BuildId&api-version=7.0"
            $coverage = Invoke-RestMethod -Uri $coverageUrl -Headers $headers -Method Get

            $lineCoverage = $coverage.coverageData[0].coverageStats | Where-Object { $_.label -eq "Lines" } | Select-Object -ExpandProperty covered
            $totalLines = $coverage.coverageData[0].coverageStats | Where-Object { $_.label -eq "Lines" } | Select-Object -ExpandProperty total
            $coveragePercent = [math]::Round(($lineCoverage / $totalLines) * 100, 2)

            Write-Host "  Coverage: $coveragePercent% (threshold: 75%)" -ForegroundColor Green
        }
        "Security" {
            Write-Host "  Security scan passed (no critical/high vulnerabilities)" -ForegroundColor Green
        }
        "Test" {
            $testUrl = "$azureDevOpsUrl/_apis/test/ResultSummaryByBuild?buildId=$BuildId&api-version=7.0"
            $testResults = Invoke-RestMethod -Uri $testUrl -Headers $headers -Method Get

            $passRate = [math]::Round(($testResults.aggregatedResultsAnalysis.totalTests - $testResults.aggregatedResultsAnalysis.resultsDifference.failureCount) / $testResults.aggregatedResultsAnalysis.totalTests * 100, 2)

            Write-Host "  Test pass rate: $passRate% (threshold: 100%)" -ForegroundColor Green
        }
    }

    Write-Host "`n✅ Quality gate fix verified successfully" -ForegroundColor Green
    exit 0
} else {
    Write-Host "❌ Build failed: $($build.result)" -ForegroundColor Red
    Write-Host "  Review build logs for details" -ForegroundColor Yellow
    exit 1
}

Step 5: Document — Update ADR if architectural decision required

Lessons Learned Template:

# Lessons Learned — Quality Gate Failure [Build Number]

**Date**: YYYY-MM-DD  
**Gate**: [Gate Type]  
**Build**: [Build Number]  
**Service**: [Service Name]  
**Time to Fix**: [X hours/days]

## Failure Summary
**Error Message**: [Exact error from pipeline]  
**Root Cause**: [Technical root cause]  
**Impact**: [Build blocked, deployment delayed, etc.]

## Resolution
**Fix Applied**: [Code changes, dependency updates, configuration changes]  
**Verification**: [How fix was verified]  
**Pull Request**: #[PR number]

## Prevention
**Process Improvement**: [Changes to prevent recurrence]  
**Automation Enhancement**: [New linter rules, pre-commit hooks, etc.]  
**Documentation Update**: [Updated runbooks, ADRs, etc.]

## Action Items
- [ ] Update runbook: docs/operations/runbooks/[gate-type]-failure.md
- [ ] Add pre-commit hook to catch issue locally
- [ ] Share lessons learned in team meeting

**Author**: [Developer Name]  
**Reviewed By**: [Tech Lead Name]

Ratcheting Thresholds

Purpose: Continuously raise quality standards by incrementally increasing thresholds as team capabilities improve, preventing quality regression.

Ratcheting Strategy:

Threshold Type Current Q1 2025 Target Q2 2025 Target Q3 2025 Target Rationale
Line Coverage (Ingestion) 75% 77% 80% 82% Sustained improvement; add 2% per quarter
Line Coverage (Query) 80% 82% 85% 87% Complex query logic requires higher coverage
Branch Coverage (All) 60% 62% 65% 67% Improve conditional logic testing
Critical Vulnerabilities 0 0 0 0 Zero tolerance maintained
High Vulnerabilities 0 0 0 0 Ratchet down from current 1 accepted risk
Medium Vulnerabilities <10 <8 <5 <3 Gradual reduction; backlog cleanup
Flaky Test Rate <2% <1.5% <1% <0.5% Improve test reliability
Mean Time to Fix Gate <4h <3h <2h <1h Faster remediation through automation

Ratcheting Automation (C# Azure Function):

// RatchetQualityThresholds.cs — Quarterly threshold adjustment
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

public static class RatchetQualityThresholds
{
    [FunctionName("RatchetQualityThresholds")]
    public static async Task Run(
        [TimerTrigger("0 0 0 1 1,4,7,10 *")] TimerInfo timer,  // Quarterly: Jan 1, Apr 1, Jul 1, Oct 1
        ILogger log)
    {
        log.LogInformation("Evaluating quality threshold ratcheting for quarter");

        var currentQuarter = GetCurrentQuarter();

        // Get historical metrics (last 90 days)
        var metrics = await GetHistoricalMetricsAsync();

        // Evaluate ratcheting eligibility per service
        var ratchetRecommendations = new List<RatchetRecommendation>();

        foreach (var service in metrics.GroupBy(m => m.Service))
        {
            var serviceName = service.Key;
            var serviceMetrics = service.ToList();

            // Calculate sustained performance (avg last 90 days)
            var avgLineCoverage = serviceMetrics.Average(m => m.LineCoverage);
            var avgBranchCoverage = serviceMetrics.Average(m => m.BranchCoverage);
            var currentThreshold = GetCurrentThreshold(serviceName);

            log.LogInformation($"Service: {serviceName}, Avg Coverage: {avgLineCoverage:F2}%, Current Threshold: {currentThreshold}%");

            // Ratchet if sustained performance exceeds threshold by 5%
            if (avgLineCoverage >= currentThreshold + 5)
            {
                var newThreshold = currentThreshold + 2;  // Ratchet up by 2%

                ratchetRecommendations.Add(new RatchetRecommendation
                {
                    Service = serviceName,
                    MetricType = "LineCoverage",
                    CurrentThreshold = currentThreshold,
                    NewThreshold = newThreshold,
                    SustainedPerformance = avgLineCoverage,
                    Justification = $"Sustained performance of {avgLineCoverage:F2}% exceeds threshold by {(avgLineCoverage - currentThreshold):F2}%",
                    ApprovalRequired = true
                });

                log.LogInformation($"  ✅ Ratchet recommendation: {currentThreshold}% → {newThreshold}%");
            }
            else if (avgLineCoverage < currentThreshold)
            {
                log.LogWarning($"  ⚠️  Performance below threshold: {avgLineCoverage:F2}% < {currentThreshold}%");
            }
            else
            {
                log.LogInformation($"  → Threshold maintained (performance within 5% of threshold)");
            }
        }

        // Create work items for threshold adjustments
        if (ratchetRecommendations.Any())
        {
            await CreateRatchetWorkItemsAsync(ratchetRecommendations, currentQuarter);
            log.LogInformation($"Created {ratchetRecommendations.Count} ratchet recommendation work items for Q{currentQuarter} 2025");
        }
        else
        {
            log.LogInformation("No ratchet recommendations for this quarter");
        }
    }

    private static int GetCurrentQuarter()
    {
        var month = DateTime.UtcNow.Month;
        return (month - 1) / 3 + 1;
    }

    private static async Task<List<QualityMetric>> GetHistoricalMetricsAsync()
    {
        // Query Azure DevOps Analytics API for last 90 days
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static double GetCurrentThreshold(string serviceName)
    {
        // Retrieve current threshold from configuration
        var thresholds = new Dictionary<string, double>
        {
            ["ConnectSoft.ATP.Ingestion"] = 75.0,
            ["ConnectSoft.ATP.Query"] = 80.0,
            ["ConnectSoft.ATP.Integrity"] = 85.0,
            ["ConnectSoft.ATP.Export"] = 70.0
        };

        return thresholds.ContainsKey(serviceName) ? thresholds[serviceName] : 70.0;
    }

    private static async Task CreateRatchetWorkItemsAsync(List<RatchetRecommendation> recommendations, int quarter)
    {
        // Create Azure DevOps work items for architect review
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }
}

public class RatchetRecommendation
{
    public string Service { get; set; }
    public string MetricType { get; set; }
    public double CurrentThreshold { get; set; }
    public double NewThreshold { get; set; }
    public double SustainedPerformance { get; set; }
    public string Justification { get; set; }
    public bool ApprovalRequired { get; set; }
}

public class QualityMetric
{
    public string Service { get; set; }
    public DateTime Date { get; set; }
    public double LineCoverage { get; set; }
    public double BranchCoverage { get; set; }
}

Threshold Ratcheting Policy:

# Threshold Ratcheting Policy
policy:
  # Coverage thresholds
  coverage:
    incremental: 2%  # Increase threshold by 2% per quarter
    sustainedPerformance: 5%  # Must exceed threshold by 5% for 90 days
    maxThreshold: 95%  # Cap at 95% (allow for generated code exclusions)
    reviewCadence: Quarterly

  # Security vulnerability thresholds
  security:
    critical: 0  # Zero tolerance (never ratchet)
    high: 0  # Zero tolerance (never ratchet)
    medium: -2  # Reduce by 2 per quarter (if sustained)
    low: -5  # Reduce by 5 per quarter

  # Flaky test rate
  flakyTests:
    incremental: -0.5%  # Reduce by 0.5% per quarter
    sustainedImprovement: 30 days  # Must maintain improvement for 30 days
    targetRate: 0%  # Ultimate goal: zero flaky tests

  # Mean time to fix
  mttf:
    incremental: -30min  # Reduce by 30 minutes per quarter
    sustainedImprovement: 60 days
    targetTime: 1h  # Ultimate goal: fix within 1 hour

Flaky Test Quarantine & Remediation

Purpose: Systematically eliminate flaky tests by quarantining unreliable tests and requiring fixes within 2 sprints.

Flaky Test Detection (automated daily):

// DetectFlakyTests.cs — Daily detection of unreliable tests
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

public static class DetectFlakyTests
{
    [FunctionName("DetectFlakyTests")]
    public static async Task Run(
        [TimerTrigger("0 0 6 * * *")] TimerInfo timer,  // Daily: 6:00 AM UTC
        ILogger log)
    {
        log.LogInformation("Detecting flaky tests (last 30 days)");

        // Query test results (last 30 days)
        var testResults = await GetTestResultsAsync(days: 30);

        // Calculate flaky score per test
        var flakyTests = testResults
            .GroupBy(t => t.TestCaseName)
            .Where(g => g.Count() >= 10)  // Only tests run at least 10 times
            .Select(g => new
            {
                TestCaseName = g.Key,
                TotalRuns = g.Count(),
                PassCount = g.Count(t => t.Outcome == "Passed"),
                FailCount = g.Count(t => t.Outcome == "Failed"),
                FlakyScore = (double)g.Count(t => t.Outcome == "Failed") / g.Count() * 100
            })
            .Where(t => t.FlakyScore > 0 && t.FlakyScore < 100)  // Exclude always-passing/failing
            .Where(t => t.FlakyScore > 10)  // Flaky if >10% failure rate
            .OrderByDescending(t => t.FlakyScore)
            .ToList();

        log.LogInformation($"Detected {flakyTests.Count} flaky tests");

        // Quarantine flaky tests (create work items)
        foreach (var test in flakyTests)
        {
            // Check if work item already exists
            var existingWorkItem = await GetExistingFlakyTestWorkItemAsync(test.TestCaseName);

            if (existingWorkItem == null)
            {
                // Create new work item
                await CreateFlakyTestWorkItemAsync(new FlakyTestWorkItem
                {
                    Title = $"[Flaky Test] {test.TestCaseName}",
                    Description = $@"
## Flaky Test Detection

**Test**: `{test.TestCaseName}`  
**Flaky Score**: {test.FlakyScore:F2}% ({test.FailCount}/{test.TotalRuns} runs failed)  
**Detection Date**: {DateTime.UtcNow:yyyy-MM-dd}  
**Deadline**: {DateTime.UtcNow.AddDays(28):yyyy-MM-dd} (2 sprints)

## Remediation Actions
- [ ] Investigate root cause (race condition, timing dependency, shared state)
- [ ] Fix test (add synchronization, isolate state, increase timeout)
- [ ] Verify fix (test passes 20+ consecutive runs)
- [ ] Remove from quarantine list

## Quarantine Status
- [ ] Test disabled in pipeline (add `[Fact(Skip=""Flaky"")]`)
- [ ] Work item assigned to original test author
- [ ] Deadline tracked (2 sprints from detection)
                    ",
                    Priority = test.FlakyScore > 50 ? 1 : 2,  // P1 if >50% flaky
                    Tags = new[] { "FlakyTest", "TechnicalDebt", "QualityImprovement" },
                    AssignedTo = await GetTestAuthorAsync(test.TestCaseName)
                });

                log.LogInformation($"  ✅ Created work item for: {test.TestCaseName} (flaky score: {test.FlakyScore:F2}%)");
            }
            else
            {
                log.LogInformation($"  → Work item already exists for: {test.TestCaseName}");
            }
        }

        // Generate daily flaky test report
        await GenerateFlakyTestReportAsync(flakyTests);
    }

    private static async Task<List<TestResult>> GetTestResultsAsync(int days)
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task<WorkItem> GetExistingFlakyTestWorkItemAsync(string testCaseName)
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task CreateFlakyTestWorkItemAsync(FlakyTestWorkItem workItem)
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task<string> GetTestAuthorAsync(string testCaseName)
    {
        // Git blame to find original test author
        // Implementation omitted for brevity
        return "unassigned";
    }

    private static async Task GenerateFlakyTestReportAsync(List<dynamic> flakyTests)
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }
}

public class FlakyTestWorkItem
{
    public string Title { get; set; }
    public string Description { get; set; }
    public int Priority { get; set; }
    public string[] Tags { get; set; }
    public string AssignedTo { get; set; }
}

Quality Gate Retrospectives

Purpose: Regularly review quality gate effectiveness, identify false positives, adjust thresholds, and propose new gates based on lessons learned.

Retrospective Cadence:

  • Frequency: Monthly (first Tuesday, immediately after metrics review)
  • Duration: 60 minutes
  • Participants: Tech Leads, QA Engineers, Security Officers, SRE
  • Facilitator: Rotating (different team member each month)

Retrospective Agenda:

## Quality Gate Retrospective — [Month YYYY]

**Date**: [Date]  
**Facilitator**: [Name]  
**Participants**: [Names]

### 1. Metrics Review (15 minutes)
**Presented By**: Metrics Lead

- Quality scorecard review (15 metrics)
- DORA metrics update
- Quality gate violation trends

**Questions for Discussion**:
- Which metrics improved this month?
- Which metrics regressed? Root causes?
- Are we tracking the right metrics?

---

### 2. False Positives & Threshold Adjustments (20 minutes)
**Presented By**: Team Leads

**False Positives Identified**:
| Gate | False Positive Count | Root Cause | Proposed Fix |
|------|---------------------|------------|--------------|
| [Gate Type] | [Count] | [Why it failed incorrectly] | [Tool fix, threshold adjustment] |

**Threshold Adjustment Proposals**:
| Metric | Current Threshold | Proposed Threshold | Justification |
|--------|-------------------|-------------------|---------------|
| [Metric] | [Current] | [Proposed] | [Why adjustment needed] |

**Discussion**:
- Are thresholds too aggressive? Too lenient?
- Should we ratchet thresholds this quarter?

---

### 3. New Gate Proposals (15 minutes)
**Presented By**: Quality Champions

**Proposed New Gates**:
| Gate | Purpose | Enforcement Point | Blocker | Estimated Effort |
|------|---------|-------------------|---------|------------------|
| [Gate Name] | [Why needed] | [CI/CD stage] | [Yes/No] | [Hours/Days] |

**Discussion**:
- Which new gates add value without friction?
- Priority for implementation?

---

### 4. Gate Effectiveness & Developer Experience (10 minutes)
**Presented By**: Team (Open Forum)

**Questions**:
- Which gates caught real issues this month?
- Which gates caused frustration or delays?
- Are error messages clear and actionable?
- Is remediation guidance helpful?

**Feedback**:
- [Positive feedback]
- [Improvement suggestions]

---

### 5. Action Items (5 minutes)
**Facilitator**: Retrospective Lead

- [ ] Action 1: [Description] — **Owner**: [Name], **Due**: [Date]
- [ ] Action 2: [Description] — **Owner**: [Name], **Due**: [Date]
- [ ] Action 3: [Description] — **Owner**: [Name], **Due**: [Date]

---

### Retrospective Outcomes
- **Continue Doing**: [Effective practices to maintain]
- **Start Doing**: [New practices to adopt]
- **Stop Doing**: [Ineffective practices to eliminate]

**Next Retrospective**: [Date]

Retrospective Action Item Tracking (Azure DevOps):

// Quality Gate Retrospective Action Items
WorkItem
| where WorkItemType == "Task"
| where Tags contains "Retrospective"
| where Tags contains "QualityGate"
| where CreatedDate >= ago(90d)
| summarize
    TotalItems = count(),
    CompletedItems = countif(State == "Closed"),
    InProgressItems = countif(State == "Active"),
    OverdueItems = countif(State != "Closed" and DueDate < now())
  by AssignedTo
| extend CompletionRate = round((todouble(CompletedItems) / TotalItems) * 100, 2)
| project
    Owner = AssignedTo,
    TotalItems,
    CompletedItems,
    InProgressItems,
    OverdueItems,
    CompletionRate
| order by OverdueItems desc, CompletionRate asc

Gate Effectiveness Scoring

Purpose: Quantify gate effectiveness to prioritize improvements and retire ineffective gates.

Effectiveness Metrics:

Gate True Positives False Positives False Negatives Precision Recall F1 Score Effectiveness
Test Coverage 42 3 1 93.3% 97.7% 95.5% Excellent
Security Scan 35 8 0 81.4% 100% 89.7% Good
API Breaking Change 12 1 0 92.3% 100% 96.0% Excellent
Flaky Test Detection 18 5 3 78.3% 85.7% 81.8% Good
Health Check 8 0 0 100% 100% 100% Excellent
Load Test 4 2 1 66.7% 80.0% 72.7% Acceptable ⚠️

Definitions:

  • True Positive (TP): Gate correctly blocked a problematic build (issue found in production would have occurred)
  • False Positive (FP): Gate incorrectly blocked a valid build (no actual issue)
  • False Negative (FN): Gate incorrectly passed a problematic build (issue escaped to production)
  • Precision: TP / (TP + FP) — How often gate failures are correct
  • Recall: TP / (TP + FN) — How often gate catches issues
  • F1 Score: Harmonic mean of precision and recall

Effectiveness Classification:

  • Excellent (F1 ≥90%): Gate is highly effective; maintain current configuration
  • Good (F1 80-89%): Gate is effective; minor tuning may improve precision
  • Acceptable (F1 70-79%): Gate adds value; investigate false positives
  • Poor (F1 <70%): Gate may be ineffective; consider retiring or major overhaul

Gate Effectiveness Tracking (C#):

// TrackGateEffectiveness.cs — Track true/false positives/negatives
public class GateEffectivenessTracker
{
    public async Task RecordGateOutcomeAsync(GateOutcome outcome)
    {
        var record = new GateEffectivenessRecord
        {
            GateType = outcome.GateType,
            BuildId = outcome.BuildId,
            Timestamp = DateTime.UtcNow,

            GateResult = outcome.GateResult,  // Passed/Failed
            ActualIssue = outcome.ActualIssue,  // Was there a real issue?

            // Classification
            OutcomeType = ClassifyOutcome(outcome.GateResult, outcome.ActualIssue),

            // Context
            FailureReason = outcome.FailureReason,
            RemediationTime = outcome.RemediationTime,
            EscapedToProduction = outcome.EscapedToProduction
        };

        await _cosmosClient.UpsertAsync(record);
    }

    private string ClassifyOutcome(string gateResult, bool actualIssue)
    {
        if (gateResult == "Failed" && actualIssue)
            return "TruePositive";  // Gate correctly caught issue

        if (gateResult == "Failed" && !actualIssue)
            return "FalsePositive";  // Gate incorrectly blocked valid build

        if (gateResult == "Passed" && actualIssue)
            return "FalseNegative";  // Gate missed issue (escaped to production)

        return "TrueNegative";  // Gate correctly passed valid build
    }

    public async Task<GateEffectivenessReport> CalculateEffectivenessAsync(string gateType, int days = 90)
    {
        var records = await _cosmosClient.QueryAsync<GateEffectivenessRecord>(
            r => r.GateType == gateType && r.Timestamp >= DateTime.UtcNow.AddDays(-days)
        );

        var tp = records.Count(r => r.OutcomeType == "TruePositive");
        var fp = records.Count(r => r.OutcomeType == "FalsePositive");
        var fn = records.Count(r => r.OutcomeType == "FalseNegative");
        var tn = records.Count(r => r.OutcomeType == "TrueNegative");

        var precision = tp + fp > 0 ? (double)tp / (tp + fp) * 100 : 0;
        var recall = tp + fn > 0 ? (double)tp / (tp + fn) * 100 : 0;
        var f1Score = precision + recall > 0 ? 2 * (precision * recall) / (precision + recall) : 0;

        var effectiveness = f1Score >= 90 ? "Excellent" :
                            f1Score >= 80 ? "Good" :
                            f1Score >= 70 ? "Acceptable" : "Poor";

        return new GateEffectivenessReport
        {
            GateType = gateType,
            TruePositives = tp,
            FalsePositives = fp,
            FalseNegatives = fn,
            TrueNegatives = tn,
            Precision = Math.Round(precision, 2),
            Recall = Math.Round(recall, 2),
            F1Score = Math.Round(f1Score, 2),
            Effectiveness = effectiveness,
            Recommendation = GetRecommendation(effectiveness, fp, fn)
        };
    }

    private string GetRecommendation(string effectiveness, int fp, int fn)
    {
        if (effectiveness == "Poor" && fp > fn)
            return "High false positive rate; consider relaxing threshold or improving detection logic";

        if (effectiveness == "Poor" && fn > fp)
            return "High false negative rate; consider tightening threshold or adding additional checks";

        if (effectiveness == "Acceptable" && fp > 5)
            return "Reduce false positives by refining gate logic or threshold";

        return "Gate is effective; maintain current configuration";
    }
}

Summary

  • Violation Response Workflow: 5-step process (detect, triage, fix, verify, document) with Mermaid diagram, standardized error messages, triage checklist, fix patterns, risk acceptance template, verification script
  • Ratcheting Thresholds: Quarterly adjustment policy (coverage +2%, vulnerabilities reduced, flaky tests -0.5%, MTTF -30min), C# Azure Function for automated recommendations, threshold ratcheting policy YAML
  • Flaky Test Quarantine: Daily detection (Azure Function), >10% failure rate triggers quarantine, work item creation with 2-sprint deadline, automated assignment to test author
  • Quality Gate Retrospectives: Monthly meetings (first Tuesday, 60min), 5-part agenda (metrics, false positives, new gates, effectiveness, action items), retrospective template, action item tracking (KQL query)
  • Gate Effectiveness Scoring: Precision/recall/F1 score calculation, 6-gate effectiveness table (test coverage 95.5%, security 89.7%, API breaking change 96.0%, flaky test 81.8%, health check 100%, load test 72.7%), C# tracker with true/false positive/negative classification, recommendations based on effectiveness

Exception Handling & Risk Acceptance

Purpose: Provide governance framework for suppressing quality gate violations when legitimate exceptions exist (false positives, accepted risks, mitigated vulnerabilities).

Suppression Principles:

  • Time-Bounded: All suppressions expire (max 6 months); require re-review and re-approval
  • Auditable: Every suppression logged in meta-audit stream with justification and approver
  • Minimal: Suppressions are exception, not the rule; prefer fixing issues over suppressing
  • Governed: Requires security officer or architect approval; no self-approval

Suppression File Formats

Purpose: Enable structured suppression of false positives and accepted risks across multiple quality gate tools.

Suppression Files by Gate Type:

Gate Type Suppression File Format Approval Required
OWASP Dependency Check dependency-check-suppressions.xml XML Security Officer
Secrets Detection (CredScan) credscan-suppressions.json JSON Security Officer
SonarQube .sonarqube/suppressions.xml XML Architect
StyleCop stylecop.json or .editorconfig JSON/INI Team Lead
Roslyn Analyzers .globalconfig or .editorconfig INI Architect
Trivy (Container Scan) .trivyignore Text Security Officer

OWASP Dependency Check Suppression (XML):

<?xml version="1.0" encoding="UTF-8"?>
<!-- dependency-check-suppressions.xml -->
<suppressions xmlns="https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.3.xsd">

  <!-- Suppression 1: False positive for Newtonsoft.Json -->
  <suppress>
    <packageUrl regex="true">^pkg:nuget/Newtonsoft\.Json@12\.0\.3$</packageUrl>
    <cve>CVE-2024-12345</cve>
    <reason>
      False positive. CVE affects Newtonsoft.Json deserialization with TypeNameHandling enabled.
      ATP does not use TypeNameHandling; all deserialization uses safe defaults.
      Confirmed by security team analysis on 2024-10-15.
    </reason>
    <approvedBy>security-team@connectsoft.example</approvedBy>
    <approvedDate>2024-10-15</approvedDate>
    <expiresOn>2025-04-15</expiresOn>  <!-- 6 months from approval -->
    <notes>
      Re-review on expiration. If CVE still reported, consider upgrade to System.Text.Json.
      Tracking issue: ATP-1234
    </notes>
  </suppress>

  <!-- Suppression 2: Accepted risk for legacy library -->
  <suppress>
    <packageUrl regex="true">^pkg:nuget/OldLibrary@1\.2\.3$</packageUrl>
    <cve>CVE-2023-98765</cve>
    <reason>
      Accepted risk. OldLibrary has known vulnerability (CVSS 6.5) but is only used in
      non-production dev tooling (data seeders). Not deployed to production.
      Migration to ModernLibrary scheduled for Q2 2025.
    </reason>
    <approvedBy>architect@connectsoft.example</approvedBy>
    <approvedDate>2024-09-01</approvedDate>
    <expiresOn>2025-03-01</expiresOn>
    <notes>
      Mitigation: OldLibrary isolated to dev environment only.
      Epic for migration: ATP-EPIC-567
    </notes>
  </suppress>

  <!-- Suppression 3: Vulnerability mitigated by application controls -->
  <suppress>
    <packageUrl regex="true">^pkg:nuget/Azure\.Storage\.Blobs@12\.14\.0$</packageUrl>
    <vulnerabilityName>CWE-22</vulnerabilityName>
    <reason>
      Path traversal vulnerability (CWE-22) mitigated by application-level path validation.
      All blob paths validated against allowlist regex before passing to Azure.Storage.Blobs.
      Code review completed by security team on 2024-10-20.
    </reason>
    <approvedBy>security-officer@connectsoft.example</approvedBy>
    <approvedDate>2024-10-20</approvedDate>
    <expiresOn>2025-04-20</expiresOn>
    <notes>
      Mitigation code: BlobStorageService.cs lines 45-58
      Unit tests validate path validation: BlobStorageServiceTests.cs
    </notes>
  </suppress>

</suppressions>

Secrets Detection Suppression (JSON):

{
  "$schema": "https://aka.ms/credscan/suppression-schema.json",
  "suppressions": [
    {
      "fingerprint": "12345abcdef67890",
      "pattern": "ConnectionStrings__DefaultConnection",
      "reason": "False positive. This is a configuration key name, not an actual secret. Real connection string loaded from Key Vault at runtime.",
      "approvedBy": "security-team@connectsoft.example",
      "approvedDate": "2024-10-15",
      "expiresOn": "2025-04-15",
      "notes": "Configuration pattern documented in appsettings.json schema"
    },
    {
      "fingerprint": "abcdef1234567890",
      "pattern": "-----BEGIN CERTIFICATE-----",
      "filePath": "tests/TestCertificates/test-cert.pem",
      "reason": "Test certificate for development only. Not a real production certificate. Certificate is self-signed and expires in 30 days.",
      "approvedBy": "security-team@connectsoft.example",
      "approvedDate": "2024-09-01",
      "expiresOn": "2025-03-01",
      "notes": "Test certificates isolated to tests/ directory; never deployed to production"
    },
    {
      "fingerprint": "9876543210fedcba",
      "pattern": "xoxb-",
      "filePath": "docs/examples/slack-integration.md",
      "reason": "Example Slack token in documentation. Placeholder value, not a real token. Format: xoxb-XXXX-YYYY-ZZZZ",
      "approvedBy": "tech-lead@connectsoft.example",
      "approvedDate": "2024-10-01",
      "expiresOn": "2025-04-01",
      "notes": "Documentation example; clarified with comment that it's a placeholder"
    }
  ]
}

SonarQube Suppression (XML):

<?xml version="1.0" encoding="UTF-8"?>
<!-- .sonarqube/suppressions.xml -->
<suppressions>

  <!-- Suppression for S3776: Cognitive Complexity -->
  <suppression>
    <ruleKey>csharpsquid:S3776</ruleKey>
    <filePath>src/ConnectSoft.ATP.Integrity/Services/TamperEvidenceService.cs</filePath>
    <lineNumber>142</lineNumber>
    <reason>
      High cognitive complexity (35) in hash chain validation method.
      Complexity inherent to cryptographic validation algorithm (Merkle tree traversal).
      Refactoring would reduce readability and introduce bugs.
      Code reviewed and approved by cryptography expert.
    </reason>
    <approvedBy>architect@connectsoft.example</approvedBy>
    <approvedDate>2024-10-10</approvedDate>
    <expiresOn>2025-04-10</expiresOn>
  </suppression>

  <!-- Suppression for S1135: TODO comments -->
  <suppression>
    <ruleKey>csharpsquid:S1135</ruleKey>
    <filePath>src/ConnectSoft.ATP.Query/Services/QueryOptimizer.cs</filePath>
    <lineNumber>78</lineNumber>
    <reason>
      TODO comment tracking planned optimization for Q2 2025.
      Work item created: ATP-789. TODO will be removed when implemented.
    </reason>
    <approvedBy>tech-lead@connectsoft.example</approvedBy>
    <approvedDate>2024-10-01</approvedDate>
    <expiresOn>2025-06-01</expiresOn>  <!-- Extended to Q2 2025 -->
  </suppression>

</suppressions>

Trivy Container Scan Suppression (.trivyignore):

# .trivyignore — Suppress container image vulnerabilities

# CVE-2024-11111: False positive for Alpine base image
# Reason: CVE affects OpenSSL 3.0.x; Alpine 3.18 uses LibreSSL (not affected)
# Approved By: security-team@connectsoft.example
# Approved Date: 2024-10-15
# Expires On: 2025-04-15
CVE-2024-11111

# CVE-2023-22222: Accepted risk for curl vulnerability
# Reason: curl only used in health check scripts (non-production utility)
# Mitigation: Health check scripts do not accept user input
# Approved By: architect@connectsoft.example
# Approved Date: 2024-09-20
# Expires On: 2025-03-20
CVE-2023-22222

# CVE-2024-33333: Mitigated by runtime validation
# Reason: Vulnerability in JSON parser mitigated by schema validation before parsing
# Approved By: security-officer@connectsoft.example
# Approved Date: 2024-10-25
# Expires On: 2025-04-25
CVE-2024-33333

Risk Acceptance Process

Purpose: Provide formal governance for accepting security or quality risks when remediation is not immediately feasible.

Risk Acceptance Workflow:

graph TD
    A[Quality Gate Failure] --> B{Can Fix Immediately?}
    B -->|Yes| C[Implement Fix]
    B -->|No| D[Evaluate Risk Acceptance]

    D --> E{Risk Acceptable?}
    E -->|No| F[Block Deployment]
    E -->|Yes| G[Document Justification]

    G --> H[Create Risk Acceptance Record]
    H --> I{Risk Level?}
    I -->|Critical/High| J[Security Officer Approval]
    I -->|Medium| K[Architect Approval]
    I -->|Low| L[Team Lead Approval]

    J --> M{Approved?}
    K --> M
    L --> M

    M -->|No| F
    M -->|Yes| N[Create Suppression File]

    N --> O[Set Expiration Date]
    O --> P[Log in Meta-Audit Stream]
    P --> Q[Schedule Re-Review]
    Q --> R[Allow Deployment]

    F --> S[Remediate or Escalate]
    C --> R

    style F fill:#ff6b6b
    style R fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Risk Acceptance Criteria:

Risk Level Examples Approval Required Max Duration Re-Review Cadence
Critical (CVSS 9-10) RCE, data breach, auth bypass CISO + Security Officer 30 days Weekly
High (CVSS 7-8.9) Privilege escalation, XSS, SQLi Security Officer + Architect 90 days Monthly
Medium (CVSS 4-6.9) Information disclosure, DoS Architect 180 days (6 months) Quarterly
Low (CVSS 0-3.9) Low-impact bugs, code smells Team Lead 365 days (1 year) Annually

Risk Acceptance Steps:

Step 1: Justification — Document why risk is acceptable

Acceptable Justifications:

## Valid Risk Acceptance Justifications

### False Positives
- Tool incorrectly flagged code as vulnerable (verified by manual analysis)
- CVE does not apply to ATP's usage pattern (e.g., feature not enabled)
- Vulnerability requires preconditions not present in ATP (e.g., specific OS version)

### Mitigated Risks
- Application-level controls prevent exploitation (e.g., input validation)
- Network isolation prevents attack vector (e.g., private VNet)
- Defense-in-depth compensating controls (e.g., WAF rules, rate limiting)

### Temporary Exceptions
- No patch available from vendor (waiting for upstream fix)
- Patch introduces breaking changes (migration requires extensive refactoring)
- Library only used in non-production environments (dev/test tooling)

### Business Decisions
- Cost of remediation outweighs risk (executive sign-off required)
- Feature scheduled for deprecation (will be removed within 6 months)
- Third-party dependency with no viable alternative (accepted risk with monitoring)

Step 2: Approval — Security officer or architect sign-off required

Approval Matrix:

Risk Level Approver 1 Approver 2 Approver 3 Documentation Required
Critical CISO Security Officer Architect ADR + Mitigation Plan + Monitoring Plan
High Security Officer Architect ADR + Mitigation Plan
Medium Architect ADR or suppression comment
Low Team Lead Suppression comment

Risk Acceptance Form (Azure DevOps Work Item):

# Work Item Type: Risk Acceptance
fields:
  - field: System.Title
    value: "[Risk Acceptance] CVE-XXXX-XXXXX  [Package Name]"

  - field: System.Description
    value: |
      ## Vulnerability Details
      **CVE ID**: CVE-XXXX-XXXXX  
      **Package**: [Package Name @ Version]  
      **Severity**: [Critical/High/Medium/Low] (CVSS [Score])  
      **CWE**: [CWE-###]  
      **Description**: [Vulnerability description]

      ## Risk Assessment
      **Exploitability**: [Low/Medium/High]  
      **Impact**: [Low/Medium/High/Critical]  
      **Attack Vector**: [Network/Adjacent/Local/Physical]  
      **Privileges Required**: [None/Low/High]  
      **User Interaction**: [None/Required]

      ## Justification for Acceptance
      **Category**: [False Positive / Mitigated Risk / Temporary Exception / Business Decision]  
      **Rationale**: 
      [Detailed explanation of why this risk is acceptable]

      **Evidence**:
      - [ ] Manual code review completed (no vulnerable code path)
      - [ ] Mitigation controls validated (input validation, network isolation, WAF)
      - [ ] Vendor advisory reviewed (no patch available)
      - [ ] Alternative libraries evaluated (no viable replacement)

      ## Mitigation Controls
      **Primary Control**: [Description]  
      **Secondary Control**: [Description]  
      **Monitoring**: [How risk is monitored for exploitation attempts]

      ## Remediation Plan
      **Timeline**: [When will this be permanently fixed]  
      **Epic/Story**: [Link to work item for permanent fix]  
      **Fallback**: [What happens if vulnerability is exploited]

      ## Re-Review Schedule
      **Initial Approval**: [YYYY-MM-DD]  
      **Expiration Date**: [YYYY-MM-DD] (max 6 months)  
      **Re-Review Cadence**: [Weekly/Monthly/Quarterly]  
      **Escalation**: If not fixed by expiration → Escalate to CISO

  - field: Microsoft.VSTS.Common.Priority
    value: 1  # P1 for Critical/High, P2 for Medium/Low

  - field: Custom.RiskLevel
    value: High  # Critical / High / Medium / Low

  - field: Custom.CVEID
    value: CVE-XXXX-XXXXX

  - field: Custom.CVSSScore
    value: 7.8

  - field: Custom.ApprovedBy
    value: security-officer@connectsoft.example

  - field: Custom.ExpirationDate
    value: 2025-04-15

  - field: Custom.MitigationControls
    value: "Input validation; network isolation; WAF rules"

Step 3: Expiration — Time-bound suppression (max 6 months); re-review on expiry

Suppression Expiration Tracker (C# Azure Function):

// TrackSuppressionExpirations.cs — Alert on expiring suppressions
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using System.Xml.Linq;

public static class TrackSuppressionExpirations
{
    [FunctionName("TrackSuppressionExpirations")]
    public static async Task Run(
        [TimerTrigger("0 0 9 * * 1")] TimerInfo timer,  // Weekly: Monday 9:00 AM
        ILogger log)
    {
        log.LogInformation("Checking for expiring suppressions");

        var expiringSuppressions = new List<SuppressionExpiration>();

        // Check OWASP Dependency Check suppressions
        var owaspSuppressions = await CheckOwaspSuppressionsAsync();
        expiringSuppressions.AddRange(owaspSuppressions);

        // Check CredScan suppressions
        var credscanSuppressions = await CheckCredscanSuppressionsAsync();
        expiringSuppressions.AddRange(credscanSuppressions);

        // Check SonarQube suppressions
        var sonarSuppressions = await CheckSonarQubeSuppressionsAsync();
        expiringSuppressions.AddRange(sonarSuppressions);

        // Check Trivy suppressions
        var trivySuppressions = await CheckTrivySuppressionsAsync();
        expiringSuppressions.AddRange(trivySuppressions);

        // Filter suppressions expiring within 30 days
        var expiringSoon = expiringSuppressions
            .Where(s => s.ExpirationDate <= DateTime.UtcNow.AddDays(30))
            .OrderBy(s => s.ExpirationDate)
            .ToList();

        log.LogInformation($"Found {expiringSoon.Count} suppressions expiring within 30 days");

        if (expiringSoon.Any())
        {
            // Send alert email
            await SendExpirationAlertAsync(expiringSoon);

            // Create work items for re-review
            foreach (var suppression in expiringSoon)
            {
                await CreateReReviewWorkItemAsync(suppression);
            }

            log.LogInformation($"Created {expiringSoon.Count} re-review work items");
        }
    }

    private static async Task<List<SuppressionExpiration>> CheckOwaspSuppressionsAsync()
    {
        var suppressions = new List<SuppressionExpiration>();

        // Load suppression file from source control
        var suppressionXml = await File.ReadAllTextAsync("dependency-check-suppressions.xml");
        var doc = XDocument.Parse(suppressionXml);

        var ns = "https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.3.xsd";

        foreach (var suppress in doc.Descendants(XName.Get("suppress", ns)))
        {
            var expiresOn = suppress.Element(XName.Get("expiresOn", ns))?.Value;
            if (DateTime.TryParse(expiresOn, out var expirationDate))
            {
                suppressions.Add(new SuppressionExpiration
                {
                    Tool = "OWASP Dependency Check",
                    CVE = suppress.Element(XName.Get("cve", ns))?.Value ?? "N/A",
                    Package = suppress.Element(XName.Get("packageUrl", ns))?.Value ?? "N/A",
                    Reason = suppress.Element(XName.Get("reason", ns))?.Value ?? "N/A",
                    ApprovedBy = suppress.Element(XName.Get("approvedBy", ns))?.Value ?? "Unknown",
                    ExpirationDate = expirationDate,
                    DaysUntilExpiration = (expirationDate - DateTime.UtcNow).Days
                });
            }
        }

        return suppressions;
    }

    private static async Task<List<SuppressionExpiration>> CheckCredscanSuppressionsAsync()
    {
        // Similar implementation for credscan-suppressions.json
        // Implementation omitted for brevity
        return new List<SuppressionExpiration>();
    }

    private static async Task<List<SuppressionExpiration>> CheckSonarQubeSuppressionsAsync()
    {
        // Similar implementation for .sonarqube/suppressions.xml
        // Implementation omitted for brevity
        return new List<SuppressionExpiration>();
    }

    private static async Task<List<SuppressionExpiration>> CheckTrivySuppressionsAsync()
    {
        // Parse .trivyignore file for expiration comments
        // Implementation omitted for brevity
        return new List<SuppressionExpiration>();
    }

    private static async Task SendExpirationAlertAsync(List<SuppressionExpiration> suppressions)
    {
        // Send email to security team with expiring suppressions
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task CreateReReviewWorkItemAsync(SuppressionExpiration suppression)
    {
        // Create Azure DevOps work item for re-review
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }
}

public class SuppressionExpiration
{
    public string Tool { get; set; }
    public string CVE { get; set; }
    public string Package { get; set; }
    public string Reason { get; set; }
    public string ApprovedBy { get; set; }
    public DateTime ExpirationDate { get; set; }
    public int DaysUntilExpiration { get; set; }
}

Step 4: Audit Trail — Suppression logged in meta-audit stream

Meta-Audit Event for Suppression (C#):

// Log suppression to meta-audit stream
public class SuppressionAuditLogger
{
    private readonly IAuditLogger _auditLogger;

    public async Task LogSuppressionCreatedAsync(SuppressionRecord suppression)
    {
        await _auditLogger.LogAsync(new AuditEvent
        {
            EventId = Guid.NewGuid(),
            TenantId = Guid.Parse("00000000-0000-0000-0000-000000000000"),  // Platform-level
            Action = "SuppressionCreated",
            UserId = suppression.ApprovedBy,
            Timestamp = DateTime.UtcNow,

            Metadata = new Dictionary<string, object>
            {
                ["suppression.tool"] = suppression.Tool,
                ["suppression.cve"] = suppression.CVE,
                ["suppression.package"] = suppression.Package,
                ["suppression.reason"] = suppression.Reason,
                ["suppression.approvedBy"] = suppression.ApprovedBy,
                ["suppression.expirationDate"] = suppression.ExpirationDate,
                ["suppression.riskLevel"] = suppression.RiskLevel,
                ["suppression.mitigationControls"] = string.Join("; ", suppression.MitigationControls)
            }
        });
    }

    public async Task LogSuppressionExpiredAsync(SuppressionRecord suppression)
    {
        await _auditLogger.LogAsync(new AuditEvent
        {
            EventId = Guid.NewGuid(),
            TenantId = Guid.Parse("00000000-0000-0000-0000-000000000000"),
            Action = "SuppressionExpired",
            UserId = "system",
            Timestamp = DateTime.UtcNow,

            Metadata = new Dictionary<string, object>
            {
                ["suppression.tool"] = suppression.Tool,
                ["suppression.cve"] = suppression.CVE,
                ["suppression.package"] = suppression.Package,
                ["suppression.originalApprover"] = suppression.ApprovedBy,
                ["suppression.expirationDate"] = suppression.ExpirationDate,
                ["suppression.action"] = "RequiresReReview"
            }
        });
    }

    public async Task LogSuppressionRenewedAsync(SuppressionRecord suppression, string renewedBy)
    {
        await _auditLogger.LogAsync(new AuditEvent
        {
            EventId = Guid.NewGuid(),
            TenantId = Guid.Parse("00000000-0000-0000-0000-000000000000"),
            Action = "SuppressionRenewed",
            UserId = renewedBy,
            Timestamp = DateTime.UtcNow,

            Metadata = new Dictionary<string, object>
            {
                ["suppression.tool"] = suppression.Tool,
                ["suppression.cve"] = suppression.CVE,
                ["suppression.package"] = suppression.Package,
                ["suppression.originalApprover"] = suppression.ApprovedBy,
                ["suppression.renewedBy"] = renewedBy,
                ["suppression.newExpirationDate"] = suppression.ExpirationDate.AddMonths(6),
                ["suppression.renewalJustification"] = suppression.RenewalJustification
            }
        });
    }
}

public class SuppressionRecord
{
    public string Tool { get; set; }
    public string CVE { get; set; }
    public string Package { get; set; }
    public string Reason { get; set; }
    public string ApprovedBy { get; set; }
    public DateTime ApprovedDate { get; set; }
    public DateTime ExpirationDate { get; set; }
    public string RiskLevel { get; set; }
    public List<string> MitigationControls { get; set; }
    public string RenewalJustification { get; set; }
}

Suppression Audit Query (KQL):

// Query suppressions from meta-audit stream
AuditEvent
| where Action in ("SuppressionCreated", "SuppressionExpired", "SuppressionRenewed")
| where Timestamp >= ago(365d)
| extend
    Tool = tostring(Metadata.['suppression.tool']),
    CVE = tostring(Metadata.['suppression.cve']),
    Package = tostring(Metadata.['suppression.package']),
    ApprovedBy = tostring(Metadata.['suppression.approvedBy']),
    ExpirationDate = todatetime(Metadata.['suppression.expirationDate']),
    RiskLevel = tostring(Metadata.['suppression.riskLevel'])
| summarize
    TotalSuppressions = countif(Action == "SuppressionCreated"),
    ActiveSuppressions = countif(Action == "SuppressionCreated" and ExpirationDate > now()),
    ExpiredSuppressions = countif(Action == "SuppressionExpired"),
    RenewedSuppressions = countif(Action == "SuppressionRenewed")
  by Tool, RiskLevel
| project
    Tool,
    RiskLevel,
    TotalSuppressions,
    ActiveSuppressions,
    ExpiredSuppressions,
    RenewedSuppressions
| order by RiskLevel, Tool

Suppression Governance & Compliance

Purpose: Ensure suppressions comply with SOC 2, GDPR, and HIPAA requirements for risk management and audit trails.

Governance Controls:

Control Requirement Implementation Audit Evidence
Approval Authority Suppressions require appropriate approval level Azure DevOps approval workflow Approval work item history
Justification All suppressions must document rationale Suppression file comments + ADR Suppression files in Git history
Expiration No suppressions exceed 6 months (Critical/High) Automated expiration tracking Weekly expiration reports
Audit Trail All suppressions logged in meta-audit stream SuppressionAuditLogger Meta-audit stream query
Periodic Review Active suppressions reviewed quarterly Quality gate retrospective Retrospective meeting notes
Removal Suppressions removed when issue resolved Git commit removing suppression Git history, audit log

Suppression Compliance Report (Monthly):

// GenerateSuppressionComplianceReport.cs — Monthly compliance report
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

public static class GenerateSuppressionComplianceReport
{
    [FunctionName("GenerateSuppressionComplianceReport")]
    public static async Task Run(
        [TimerTrigger("0 0 8 1 * *")] TimerInfo timer,  // Monthly: 1st day, 8:00 AM
        ILogger log)
    {
        log.LogInformation("Generating monthly suppression compliance report");

        var reportMonth = DateTime.UtcNow.AddMonths(-1).ToString("yyyy-MM");

        // Load all active suppressions
        var owaspSuppressions = await LoadOwaspSuppressionsAsync();
        var credscanSuppressions = await LoadCredscanSuppressionsAsync();
        var sonarSuppressions = await LoadSonarQubeSuppressionsAsync();
        var trivySuppressions = await LoadTrivySuppressionsAsync();

        var allSuppressions = new List<Suppression>()
            .Concat(owaspSuppressions)
            .Concat(credscanSuppressions)
            .Concat(sonarSuppressions)
            .Concat(trivySuppressions)
            .ToList();

        // Compliance checks
        var expired = allSuppressions.Where(s => s.ExpirationDate < DateTime.UtcNow).ToList();
        var expiringSoon = allSuppressions.Where(s => s.ExpirationDate <= DateTime.UtcNow.AddDays(30) && s.ExpirationDate >= DateTime.UtcNow).ToList();
        var noExpiration = allSuppressions.Where(s => s.ExpirationDate == default).ToList();
        var noApprover = allSuppressions.Where(s => string.IsNullOrEmpty(s.ApprovedBy)).ToList();
        var criticalHigh = allSuppressions.Where(s => s.RiskLevel == "Critical" || s.RiskLevel == "High").ToList();

        // Generate report
        var report = new SuppressionComplianceReport
        {
            ReportMonth = reportMonth,
            GeneratedAt = DateTime.UtcNow,

            TotalSuppressions = allSuppressions.Count,
            ActiveSuppressions = allSuppressions.Count - expired.Count,
            ExpiredSuppressions = expired.Count,
            ExpiringSoon = expiringSoon.Count,

            ComplianceIssues = new List<string>(),
            Recommendations = new List<string>()
        };

        // Check compliance violations
        if (expired.Count > 0)
        {
            report.ComplianceIssues.Add($"{expired.Count} suppressions have expired and must be removed or renewed");
        }

        if (noExpiration.Count > 0)
        {
            report.ComplianceIssues.Add($"{noExpiration.Count} suppressions have no expiration date (violates policy)");
        }

        if (noApprover.Count > 0)
        {
            report.ComplianceIssues.Add($"{noApprover.Count} suppressions have no approver (violates governance)");
        }

        if (criticalHigh.Count > 0 && criticalHigh.Any(s => (s.ExpirationDate - s.ApprovedDate).Days > 90))
        {
            report.ComplianceIssues.Add($"Critical/High suppressions exceed 90-day limit (policy violation)");
        }

        // Generate recommendations
        if (expiringSoon.Count > 0)
        {
            report.Recommendations.Add($"Review {expiringSoon.Count} suppressions expiring within 30 days");
        }

        if (criticalHigh.Count > 5)
        {
            report.Recommendations.Add($"High number of Critical/High suppressions ({criticalHigh.Count}); prioritize remediation");
        }

        // Archive report to immutable storage
        await ArchiveComplianceReportAsync(report);

        // Send report to stakeholders
        await SendComplianceReportAsync(report);

        log.LogInformation($"Suppression compliance report generated: {report.ComplianceIssues.Count} issues, {report.Recommendations.Count} recommendations");
    }

    private static async Task<List<Suppression>> LoadOwaspSuppressionsAsync()
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task<List<Suppression>> LoadCredscanSuppressionsAsync()
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task<List<Suppression>> LoadSonarQubeSuppressionsAsync()
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task<List<Suppression>> LoadTrivySuppressionsAsync()
    {
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task ArchiveComplianceReportAsync(SuppressionComplianceReport report)
    {
        // Archive to Azure Blob with legal hold (7-year retention)
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }

    private static async Task SendComplianceReportAsync(SuppressionComplianceReport report)
    {
        // Send to security team, compliance officer, architects
        // Implementation omitted for brevity
        throw new NotImplementedException();
    }
}

public class Suppression
{
    public string Tool { get; set; }
    public string CVE { get; set; }
    public string Package { get; set; }
    public string Reason { get; set; }
    public string ApprovedBy { get; set; }
    public DateTime ApprovedDate { get; set; }
    public DateTime ExpirationDate { get; set; }
    public string RiskLevel { get; set; }
}

public class SuppressionComplianceReport
{
    public string ReportMonth { get; set; }
    public DateTime GeneratedAt { get; set; }
    public int TotalSuppressions { get; set; }
    public int ActiveSuppressions { get; set; }
    public int ExpiredSuppressions { get; set; }
    public int ExpiringSoon { get; set; }
    public List<string> ComplianceIssues { get; set; }
    public List<string> Recommendations { get; set; }
}

Summary

  • Suppression Files: 6 formats (OWASP XML, CredScan JSON, SonarQube XML, StyleCop JSON, Roslyn .globalconfig, Trivy .trivyignore) with approval metadata
  • Risk Acceptance Process: 4-step workflow (justification, approval, expiration, audit trail) with Mermaid diagram
  • Risk Acceptance Criteria: 4 risk levels (Critical 30 days, High 90 days, Medium 180 days, Low 365 days) with approval matrix
  • Valid Justifications: False positives, mitigated risks, temporary exceptions, business decisions
  • Approval Matrix: 4 approval levels (CISO+Security+Architect for critical, Security+Architect for high, Architect for medium, Team Lead for low)
  • Risk Acceptance Form: Azure DevOps work item template with vulnerability details, risk assessment, justification, mitigation controls, remediation plan
  • Suppression Expiration Tracker: Weekly C# Azure Function checking all suppression files for expiring items (within 30 days), creates re-review work items
  • Meta-Audit Logging: 3 audit events (SuppressionCreated, SuppressionExpired, SuppressionRenewed) with complete metadata
  • Suppression Compliance Report: Monthly C# Azure Function generating compliance report (expired suppressions, missing approvals, policy violations), archived to immutable storage (WORM, 7-year retention)
  • Governance Controls: 6 controls mapped to SOC 2/GDPR/HIPAA (approval authority, justification, expiration, audit trail, periodic review, removal)

Testing Quality Gates

Purpose: Enforce comprehensive test quality standards across unit, integration, and regression test suites to ensure high-quality, maintainable, and reliable test automation.

Testing Quality Philosophy:

  • Comprehensive Coverage: Tests cover critical paths, edge cases, error conditions, and tenant isolation
  • Fast Feedback: Unit tests complete in <30s; integration tests in <5min; regression tests in <15min
  • Reliable Execution: Zero flaky tests tolerated in main suite; quarantine mechanism for unstable tests
  • Maintainable Tests: High assertion density, clear naming conventions, isolated test data
  • Continuous Validation: Tests run on every commit (unit), every build (integration), every deployment (regression)

Testing Quality Workflow:

graph TD
    A[Code Commit] --> B[Unit Tests]
    B --> C{All Pass?}
    C -->|No| D[Block Build]
    C -->|Yes| E[Integration Tests]

    E --> F{All Pass?}
    F -->|No| D
    F -->|Yes| G[Coverage Check]

    G --> H{Meets Threshold?}
    H -->|No| D
    H -->|Yes| I[Test Quality Gates]

    I --> J{Quality Metrics OK?}
    J -->|No| K[Warning/Block]
    J -->|Yes| L[Build Artifacts]

    L --> M[Deploy to Dev]
    M --> N[Regression Tests]

    N --> O{All Pass?}
    O -->|No| P[Rollback]
    O -->|Yes| Q[Promote to Test]

    D --> R[Fix Tests]
    K --> R
    P --> R

    style D fill:#ff6b6b
    style P fill:#ff6b6b
    style Q fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Unit Test Quality Gates

Purpose: Ensure high-quality unit tests that are fast, focused, isolated, and maintainable.

Unit Test Quality Criteria:

# Unit test validation gates
unitTestGates:
  # Quantitative thresholds
  minTests: 50  # Minimum tests per service (adjustable per service)
  maxDuration: 30  # Maximum total suite duration (seconds)
  flakyThreshold: 5  # Maximum flaky test rate (percentage)
  assertionDensity: 1.5  # Minimum assertions per test (avg)
  quarantineLimit: 3  # Maximum quarantined tests allowed

  # Qualitative requirements
  namingConvention: "MethodName_Scenario_ExpectedResult"  # Enforced pattern
  arrangeActAssert: true  # AAA pattern enforced
  singleResponsibility: true  # One logical assertion per test
  noExternalDependencies: true  # No database, network, file system

  # Coverage requirements (already covered by Test Coverage Gates)
  lineCoverage: 70  # Minimum line coverage (per service)
  branchCoverage: 60  # Minimum branch coverage (per service)

Unit Test Quality Validation (PowerShell):

# Validate-UnitTestQuality.ps1 — Enforce unit test quality standards

param(
    [string]$TestResultsPath = "TestResults",
    [int]$MinTests = 50,
    [int]$MaxDurationSeconds = 30,
    [double]$MaxFlakyRate = 5.0,
    [double]$MinAssertionDensity = 1.5,
    [int]$MaxQuarantinedTests = 3
)

$ErrorActionPreference = "Stop"

Write-Host "Validating unit test quality..." -ForegroundColor Cyan

# Parse test results (VSTest format)
$testResultFiles = Get-ChildItem -Path $TestResultsPath -Filter "*.trx" -Recurse

if ($testResultFiles.Count -eq 0) {
    Write-Error "No test result files found in $TestResultsPath"
    exit 1
}

$totalTests = 0
$totalDuration = 0
$totalAssertions = 0
$flakyTests = 0
$quarantinedTests = 0

foreach ($file in $testResultFiles) {
    [xml]$trx = Get-Content $file.FullName

    $ns = @{ns = "http://microsoft.com/schemas/VisualStudio/TeamTest/2010"}

    # Count tests
    $unitTests = $trx | Select-Xml -XPath "//ns:UnitTest" -Namespace $ns
    $totalTests += $unitTests.Count

    # Calculate duration
    $testResults = $trx | Select-Xml -XPath "//ns:UnitTestResult" -Namespace $ns
    foreach ($result in $testResults) {
        $duration = $result.Node.duration
        if ($duration -match 'PT([\d.]+)S') {
            $totalDuration += [double]$matches[1]
        }

        # Check for flaky test markers
        if ($result.Node.outcome -eq "Failed" -and $result.Node.testName -match "Flaky") {
            $flakyTests++
        }

        # Check for quarantined tests
        if ($result.Node.testName -match "\[Quarantine\]") {
            $quarantinedTests++
        }
    }

    # Count assertions (parse test source code for Assert.* calls)
    # Simplified: assume 1.8 assertions per test (actual implementation would parse source)
    $totalAssertions = $totalTests * 1.8
}

Write-Host "Test Quality Metrics:" -ForegroundColor Yellow
Write-Host "  Total Unit Tests: $totalTests" -ForegroundColor White
Write-Host "  Total Duration: ${totalDuration}s" -ForegroundColor White
Write-Host "  Flaky Tests: $flakyTests" -ForegroundColor White
Write-Host "  Quarantined Tests: $quarantinedTests" -ForegroundColor White
Write-Host "  Assertion Density: $($totalAssertions / $totalTests)" -ForegroundColor White

# Validate thresholds
$failed = $false

if ($totalTests -lt $MinTests) {
    Write-Error "Insufficient unit tests: $totalTests < $MinTests"
    $failed = $true
}

if ($totalDuration -gt $MaxDurationSeconds) {
    Write-Error "Unit test suite too slow: ${totalDuration}s > ${MaxDurationSeconds}s"
    $failed = $true
}

$flakyRate = ($flakyTests / $totalTests) * 100
if ($flakyRate -gt $MaxFlakyRate) {
    Write-Error "Flaky test rate too high: ${flakyRate}% > ${MaxFlakyRate}%"
    $failed = $true
}

$assertionDensity = $totalAssertions / $totalTests
if ($assertionDensity -lt $MinAssertionDensity) {
    Write-Error "Assertion density too low: $assertionDensity < $MinAssertionDensity"
    $failed = $true
}

if ($quarantinedTests -gt $MaxQuarantinedTests) {
    Write-Error "Too many quarantined tests: $quarantinedTests > $MaxQuarantinedTests"
    $failed = $true
}

if ($failed) {
    Write-Host "Unit test quality gates FAILED" -ForegroundColor Red
    exit 1
}

Write-Host "Unit test quality gates PASSED" -ForegroundColor Green
exit 0

Azure Pipelines Integration:

# azure-pipelines.yml — Unit test quality gates
- stage: CI_Stage
  jobs:
  - job: Build_Test_Validate
    steps:
    # ... build steps ...

    # Run unit tests
    - task: DotNetCoreCLI@2
      displayName: 'Run Unit Tests'
      inputs:
        command: 'test'
        projects: '**/*Tests.csproj'
        arguments: '--configuration Release --filter Category=Unit --collect:"XPlat Code Coverage" --logger trx'
        publishTestResults: true

    # Validate unit test quality
    - task: PowerShell@2
      displayName: 'Validate Unit Test Quality'
      inputs:
        filePath: 'scripts/Validate-UnitTestQuality.ps1'
        arguments: >
          -TestResultsPath "$(Agent.TempDirectory)/TestResults"
          -MinTests 50
          -MaxDurationSeconds 30
          -MaxFlakyRate 5.0
          -MinAssertionDensity 1.5
          -MaxQuarantinedTests 3
      continueOnError: false  # Block build on failure

Unit Test Naming Convention Enforcement (Roslyn Analyzer):

// ATP003: Unit test naming convention analyzer
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Diagnostics;

[DiagnosticAnalyzer(LanguageNames.CSharp)]
public class UnitTestNamingAnalyzer : DiagnosticAnalyzer
{
    private const string DiagnosticId = "ATP003";
    private const string Title = "Unit test method does not follow naming convention";
    private const string MessageFormat = "Test method '{0}' should follow pattern 'MethodName_Scenario_ExpectedResult'";
    private const string Category = "Testing";

    private static readonly DiagnosticDescriptor Rule = new DiagnosticDescriptor(
        DiagnosticId,
        Title,
        MessageFormat,
        Category,
        DiagnosticSeverity.Warning,
        isEnabledByDefault: true,
        description: "Unit test methods should use the naming pattern MethodName_Scenario_ExpectedResult for clarity.");

    public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics => ImmutableArray.Create(Rule);

    public override void Initialize(AnalysisContext context)
    {
        context.ConfigureGeneratedCodeAnalysis(GeneratedCodeAnalysisFlags.None);
        context.EnableConcurrentExecution();
        context.RegisterSyntaxNodeAction(AnalyzeMethod, SyntaxKind.MethodDeclaration);
    }

    private void AnalyzeMethod(SyntaxNodeAnalysisContext context)
    {
        var methodDeclaration = (MethodDeclarationSyntax)context.Node;
        var methodSymbol = context.SemanticModel.GetDeclaredSymbol(methodDeclaration);

        if (methodSymbol == null)
            return;

        // Check if method has [Fact] or [Theory] attribute (xUnit)
        var hasTestAttribute = methodSymbol.GetAttributes().Any(attr =>
            attr.AttributeClass?.Name == "FactAttribute" ||
            attr.AttributeClass?.Name == "TheoryAttribute" ||
            attr.AttributeClass?.Name == "TestAttribute" ||  // NUnit
            attr.AttributeClass?.Name == "TestMethodAttribute");  // MSTest

        if (!hasTestAttribute)
            return;

        var methodName = methodSymbol.Name;

        // Validate naming pattern: MethodName_Scenario_ExpectedResult
        // Must have at least 2 underscores
        var underscoreCount = methodName.Count(c => c == '_');
        if (underscoreCount < 2)
        {
            var diagnostic = Diagnostic.Create(Rule, methodDeclaration.Identifier.GetLocation(), methodName);
            context.ReportDiagnostic(diagnostic);
            return;
        }

        // Validate each segment is PascalCase
        var segments = methodName.Split('_');
        foreach (var segment in segments)
        {
            if (string.IsNullOrWhiteSpace(segment) || !char.IsUpper(segment[0]))
            {
                var diagnostic = Diagnostic.Create(Rule, methodDeclaration.Identifier.GetLocation(), methodName);
                context.ReportDiagnostic(diagnostic);
                return;
            }
        }
    }
}

Unit Test Quality Examples (C#):

// ✅ GOOD: High-quality unit test following AAA pattern
[Fact]
public void CreateAuditRecord_WithValidData_ReturnsSuccess()
{
    // Arrange
    var service = new AuditRecordService();
    var request = new CreateAuditRecordRequest
    {
        TenantId = Guid.NewGuid(),
        Action = "UserLogin",
        UserId = "user-123",
        Timestamp = DateTime.UtcNow
    };

    // Act
    var result = service.CreateAuditRecord(request);

    // Assert
    Assert.NotNull(result);
    Assert.True(result.Success);
    Assert.NotEqual(Guid.Empty, result.RecordId);
}

// ✅ GOOD: Edge case testing
[Theory]
[InlineData(null)]
[InlineData("")]
[InlineData("   ")]
public void CreateAuditRecord_WithInvalidAction_ThrowsArgumentException(string invalidAction)
{
    // Arrange
    var service = new AuditRecordService();
    var request = new CreateAuditRecordRequest { Action = invalidAction };

    // Act & Assert
    Assert.Throws<ArgumentException>(() => service.CreateAuditRecord(request));
}

// ❌ BAD: Poor naming, no AAA structure
[Fact]
public void Test1()
{
    var service = new AuditRecordService();
    var result = service.CreateAuditRecord(new CreateAuditRecordRequest());
    Assert.NotNull(result);
}

// ❌ BAD: Multiple responsibilities (should be 2 separate tests)
[Fact]
public void CreateAndUpdateAuditRecord_ReturnsSuccess()
{
    var service = new AuditRecordService();

    var createResult = service.CreateAuditRecord(new CreateAuditRecordRequest());
    Assert.True(createResult.Success);

    var updateResult = service.UpdateAuditRecord(createResult.RecordId, new UpdateRequest());
    Assert.True(updateResult.Success);
}

Integration Test Quality Gates

Purpose: Ensure high-quality integration tests that validate service interactions, data persistence, and tenant isolation.

Integration Test Quality Criteria:

# Integration test validation gates
integrationTestGates:
  # Quantitative thresholds
  minTests: 20  # Minimum integration tests per service
  maxDuration: 300  # Maximum total suite duration (5 minutes)

  # Service container requirements
  serviceContainers:
    required:
      - redis      # Cache integration
      - sql        # Database integration
      - rabbitmq   # Message bus integration
    optional:
      - otel       # Observability integration
      - seq        # Logging integration
      - cosmos     # NoSQL integration (for Query service)

  # Functional requirements
  isolationVerified: true  # Tenant isolation tests required
  contractTests: true      # API contract validation required
  errorScenarios: true     # Error handling tests required
  retryLogic: true         # Retry/resilience tests required

  # Data management
  testDataIsolation: true  # Each test uses isolated data
  cleanupVerified: true    # Test data cleanup verified
  idempotency: true        # Idempotency tests required (for state-mutating ops)

Integration Test Quality Validation (PowerShell):

# Validate-IntegrationTestQuality.ps1 — Enforce integration test quality

param(
    [string]$TestResultsPath = "TestResults",
    [int]$MinTests = 20,
    [int]$MaxDurationSeconds = 300,
    [string[]]$RequiredContainers = @("redis", "sql", "rabbitmq")
)

$ErrorActionPreference = "Stop"

Write-Host "Validating integration test quality..." -ForegroundColor Cyan

# Parse test results
$testResultFiles = Get-ChildItem -Path $TestResultsPath -Filter "*integration*.trx" -Recurse

if ($testResultFiles.Count -eq 0) {
    Write-Error "No integration test result files found"
    exit 1
}

$totalTests = 0
$totalDuration = 0
$isolationTests = 0
$contractTests = 0
$errorScenarioTests = 0

foreach ($file in $testResultFiles) {
    [xml]$trx = Get-Content $file.FullName
    $ns = @{ns = "http://microsoft.com/schemas/VisualStudio/TeamTest/2010"}

    $testResults = $trx | Select-Xml -XPath "//ns:UnitTestResult" -Namespace $ns
    $totalTests += $testResults.Count

    foreach ($result in $testResults) {
        # Duration
        if ($result.Node.duration -match 'PT([\d.]+)S') {
            $totalDuration += [double]$matches[1]
        }

        # Count specific test categories
        $testName = $result.Node.testName
        if ($testName -match "TenantIsolation") { $isolationTests++ }
        if ($testName -match "Contract") { $contractTests++ }
        if ($testName -match "Error|Exception") { $errorScenarioTests++ }
    }
}

Write-Host "Integration Test Metrics:" -ForegroundColor Yellow
Write-Host "  Total Integration Tests: $totalTests" -ForegroundColor White
Write-Host "  Total Duration: ${totalDuration}s" -ForegroundColor White
Write-Host "  Tenant Isolation Tests: $isolationTests" -ForegroundColor White
Write-Host "  Contract Tests: $contractTests" -ForegroundColor White
Write-Host "  Error Scenario Tests: $errorScenarioTests" -ForegroundColor White

# Validate thresholds
$failed = $false

if ($totalTests -lt $MinTests) {
    Write-Error "Insufficient integration tests: $totalTests < $MinTests"
    $failed = $true
}

if ($totalDuration -gt $MaxDurationSeconds) {
    Write-Error "Integration test suite too slow: ${totalDuration}s > ${MaxDurationSeconds}s"
    $failed = $true
}

if ($isolationTests -eq 0) {
    Write-Error "No tenant isolation tests found (required)"
    $failed = $true
}

if ($contractTests -eq 0) {
    Write-Error "No API contract tests found (required)"
    $failed = $true
}

# Verify service containers are running
foreach ($container in $RequiredContainers) {
    $running = docker ps --filter "name=$container" --filter "status=running" --format "{{.Names}}"
    if (-not $running) {
        Write-Error "Required service container not running: $container"
        $failed = $true
    }
}

if ($failed) {
    Write-Host "Integration test quality gates FAILED" -ForegroundColor Red
    exit 1
}

Write-Host "Integration test quality gates PASSED" -ForegroundColor Green
exit 0

Azure Pipelines Integration:

# azure-pipelines.yml — Integration test quality gates
- stage: CI_Stage
  jobs:
  - job: Build_Test_Validate

    # Service containers for integration tests
    services:
      redis: redis
      sql: mssql
      rabbitmq: rabbitmq
      otel: otel-collector
      seq: seq

    steps:
    # ... build steps ...

    # Run integration tests
    - task: DotNetCoreCLI@2
      displayName: 'Run Integration Tests'
      inputs:
        command: 'test'
        projects: '**/*Tests.csproj'
        arguments: '--configuration Release --filter Category=Integration --logger trx --results-directory $(Agent.TempDirectory)/TestResults'
        publishTestResults: true
      env:
        ConnectionStrings__Redis: 'redis:6379'
        ConnectionStrings__Database: 'Server=sql;Database=ATP_Test;User Id=sa;Password=P@ssw0rd123!'
        ConnectionStrings__RabbitMQ: 'amqp://guest:guest@rabbitmq:5672'

    # Validate integration test quality
    - task: PowerShell@2
      displayName: 'Validate Integration Test Quality'
      inputs:
        filePath: 'scripts/Validate-IntegrationTestQuality.ps1'
        arguments: >
          -TestResultsPath "$(Agent.TempDirectory)/TestResults"
          -MinTests 20
          -MaxDurationSeconds 300
          -RequiredContainers @("redis", "sql", "rabbitmq")
      continueOnError: false

Integration Test Examples (C#):

// ✅ GOOD: Tenant isolation integration test
[Fact]
[Trait("Category", "Integration")]
public async Task CreateAuditRecord_TenantIsolation_RecordsNotVisibleAcrossTenants()
{
    // Arrange
    var tenant1Id = Guid.NewGuid();
    var tenant2Id = Guid.NewGuid();

    var service = new AuditRecordService(_dbContext, _cache);

    var record1 = new CreateAuditRecordRequest
    {
        TenantId = tenant1Id,
        Action = "UserLogin",
        UserId = "user-1"
    };

    var record2 = new CreateAuditRecordRequest
    {
        TenantId = tenant2Id,
        Action = "UserLogin",
        UserId = "user-2"
    };

    // Act
    var result1 = await service.CreateAuditRecordAsync(record1);
    var result2 = await service.CreateAuditRecordAsync(record2);

    var tenant1Records = await service.QueryAuditRecordsAsync(new QueryRequest { TenantId = tenant1Id });
    var tenant2Records = await service.QueryAuditRecordsAsync(new QueryRequest { TenantId = tenant2Id });

    // Assert
    Assert.Single(tenant1Records);
    Assert.Single(tenant2Records);
    Assert.Equal(result1.RecordId, tenant1Records.First().Id);
    Assert.Equal(result2.RecordId, tenant2Records.First().Id);
    Assert.DoesNotContain(tenant1Records, r => r.TenantId == tenant2Id);
    Assert.DoesNotContain(tenant2Records, r => r.TenantId == tenant1Id);
}

// ✅ GOOD: Error scenario integration test
[Fact]
[Trait("Category", "Integration")]
public async Task CreateAuditRecord_DatabaseUnavailable_ThrowsServiceException()
{
    // Arrange
    // Simulate database failure by stopping SQL container
    await _dockerCompose.StopAsync("sql");

    var service = new AuditRecordService(_dbContext, _cache);
    var request = new CreateAuditRecordRequest { TenantId = Guid.NewGuid(), Action = "Test" };

    // Act & Assert
    await Assert.ThrowsAsync<ServiceException>(async () =>
    {
        await service.CreateAuditRecordAsync(request);
    });

    // Cleanup: restart SQL
    await _dockerCompose.StartAsync("sql");
}

// ✅ GOOD: Contract validation test
[Fact]
[Trait("Category", "Integration")]
public async Task CreateAuditRecord_Contract_ResponseMatchesOpenAPISchema()
{
    // Arrange
    var client = _factory.CreateClient();
    var request = new CreateAuditRecordRequest { TenantId = Guid.NewGuid(), Action = "Test" };

    // Act
    var response = await client.PostAsJsonAsync("/api/audit-records", request);

    // Assert
    response.EnsureSuccessStatusCode();

    var json = await response.Content.ReadAsStringAsync();
    var schema = await LoadOpenAPISchemaAsync("CreateAuditRecordResponse");

    var validationResult = _schemaValidator.Validate(json, schema);
    Assert.True(validationResult.IsValid, $"Response does not match schema: {string.Join(", ", validationResult.Errors)}");
}

Regression Test Quality Gates (Staging)

Purpose: Ensure comprehensive regression testing in staging environment before production deployment.

Regression Test Quality Criteria:

# Regression test validation gates (Staging environment)
regressionTestGates:
  # Pass rate requirements
  passRate: 100  # All regression tests must pass
  criticalScenariosPass: 100  # All @security, @compliance tests must pass

  # Coverage matrix
  environmentCoverage:
    - dev      # Basic smoke tests
    - test     # Full regression suite
    - staging  # Production-like regression + load tests

  # Scenario coverage
  tenantScenarios:
    - single   # Single-tenant scenarios
    - multi    # Multi-tenant scenarios
    - isolation  # Tenant isolation validation

  # Test categories (required)
  requiredCategories:
    - smoke       # Critical path smoke tests
    - regression  # Full regression suite
    - security    # Security regression tests
    - compliance  # Compliance validation tests
    - performance # Performance regression tests

  # Duration thresholds
  maxDuration: 900  # 15 minutes maximum
  parallelization: true  # Tests must support parallel execution

Regression Test Quality Validation (PowerShell):

# Validate-RegressionTestQuality.ps1 — Staging regression test validation

param(
    [string]$TestResultsPath = "TestResults",
    [int]$RequiredPassRate = 100,
    [string]$Environment = "Staging",
    [int]$MaxDurationSeconds = 900
)

$ErrorActionPreference = "Stop"

Write-Host "Validating regression test quality for $Environment..." -ForegroundColor Cyan

# Parse test results
$testResultFiles = Get-ChildItem -Path $TestResultsPath -Filter "*regression*.trx" -Recurse

if ($testResultFiles.Count -eq 0) {
    Write-Error "No regression test result files found"
    exit 1
}

$totalTests = 0
$passedTests = 0
$failedTests = 0
$criticalTests = 0
$criticalPassed = 0
$totalDuration = 0

$smokeTests = 0
$securityTests = 0
$complianceTests = 0
$performanceTests = 0

foreach ($file in $testResultFiles) {
    [xml]$trx = Get-Content $file.FullName
    $ns = @{ns = "http://microsoft.com/schemas/VisualStudio/TeamTest/2010"}

    $testResults = $trx | Select-Xml -XPath "//ns:UnitTestResult" -Namespace $ns
    $totalTests += $testResults.Count

    foreach ($result in $testResults) {
        # Duration
        if ($result.Node.duration -match 'PT([\d.]+)S') {
            $totalDuration += [double]$matches[1]
        }

        # Pass/Fail
        if ($result.Node.outcome -eq "Passed") {
            $passedTests++
        } else {
            $failedTests++
        }

        # Critical tests (@security, @compliance tags)
        $testName = $result.Node.testName
        if ($testName -match "@security|@compliance") {
            $criticalTests++
            if ($result.Node.outcome -eq "Passed") {
                $criticalPassed++
            }
        }

        # Category counts
        if ($testName -match "Smoke") { $smokeTests++ }
        if ($testName -match "Security") { $securityTests++ }
        if ($testName -match "Compliance") { $complianceTests++ }
        if ($testName -match "Performance") { $performanceTests++ }
    }
}

$passRate = ($passedTests / $totalTests) * 100
$criticalPassRate = if ($criticalTests -gt 0) { ($criticalPassed / $criticalTests) * 100 } else { 100 }

Write-Host "Regression Test Metrics ($Environment):" -ForegroundColor Yellow
Write-Host "  Total Regression Tests: $totalTests" -ForegroundColor White
Write-Host "  Passed: $passedTests" -ForegroundColor Green
Write-Host "  Failed: $failedTests" -ForegroundColor $(if ($failedTests -gt 0) { "Red" } else { "White" })
Write-Host "  Pass Rate: ${passRate}%" -ForegroundColor White
Write-Host "  Critical Tests: $criticalTests" -ForegroundColor White
Write-Host "  Critical Pass Rate: ${criticalPassRate}%" -ForegroundColor White
Write-Host "  Total Duration: ${totalDuration}s" -ForegroundColor White
Write-Host "" -ForegroundColor White
Write-Host "  Category Breakdown:" -ForegroundColor Yellow
Write-Host "    Smoke: $smokeTests" -ForegroundColor White
Write-Host "    Security: $securityTests" -ForegroundColor White
Write-Host "    Compliance: $complianceTests" -ForegroundColor White
Write-Host "    Performance: $performanceTests" -ForegroundColor White

# Validate thresholds
$failed = $false

if ($passRate -lt $RequiredPassRate) {
    Write-Error "Regression test pass rate too low: ${passRate}% < ${RequiredPassRate}%"
    $failed = $true
}

if ($criticalPassRate -lt 100) {
    Write-Error "Critical tests failed: ${criticalPassRate}% pass rate (must be 100%)"
    $failed = $true
}

if ($totalDuration -gt $MaxDurationSeconds) {
    Write-Error "Regression test suite too slow: ${totalDuration}s > ${MaxDurationSeconds}s"
    $failed = $true
}

# Validate required categories
if ($smokeTests -eq 0) {
    Write-Error "No smoke tests found (required)"
    $failed = $true
}

if ($securityTests -eq 0) {
    Write-Error "No security tests found (required)"
    $failed = $true
}

if ($complianceTests -eq 0) {
    Write-Error "No compliance tests found (required)"
    $failed = $true
}

if ($failed) {
    Write-Host "Regression test quality gates FAILED" -ForegroundColor Red
    exit 1
}

Write-Host "Regression test quality gates PASSED" -ForegroundColor Green
exit 0

Azure Pipelines Integration:

# azure-pipelines.yml — Regression test quality gates
- stage: Deploy_Staging
  dependsOn: CI_Stage
  jobs:
  - deployment: DeployToStaging
    environment: ATP-Staging
    strategy:
      runOnce:
        deploy:
          steps:
          # Deploy to staging
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-staging
              package: $(Pipeline.Workspace)/drop/*.zip

          # Wait for deployment to stabilize
          - task: PowerShell@2
            displayName: 'Wait for Service Stabilization'
            inputs:
              targetType: 'inline'
              script: Start-Sleep -Seconds 60

          # Run regression tests
          - task: DotNetCoreCLI@2
            displayName: 'Run Regression Tests'
            inputs:
              command: 'test'
              projects: '**/*RegressionTests.csproj'
              arguments: '--configuration Release --logger trx --results-directory $(Agent.TempDirectory)/TestResults'
              publishTestResults: true
            env:
              TestEnvironment: 'Staging'
              BaseUrl: 'https://atp-ingestion-staging.azurewebsites.net'

          # Validate regression test quality
          - task: PowerShell@2
            displayName: 'Validate Regression Test Quality'
            inputs:
              filePath: 'scripts/Validate-RegressionTestQuality.ps1'
              arguments: >
                -TestResultsPath "$(Agent.TempDirectory)/TestResults"
                -RequiredPassRate 100
                -Environment "Staging"
                -MaxDurationSeconds 900
            continueOnError: false

          # On failure: rollback
          - task: PowerShell@2
            displayName: 'Rollback on Test Failure'
            condition: failed()
            inputs:
              targetType: 'inline'
              script: |
                Write-Host "Regression tests failed; rolling back deployment..."
                az webapp deployment slot swap `
                  --name atp-ingestion-staging `
                  --resource-group ATP-Staging-RG `
                  --slot staging `
                  --target-slot production `
                  --action swap

Regression Test Examples (C# with SpecFlow):

// ✅ GOOD: BDD-style regression test with Gherkin
[Binding]
public class AuditRecordRegressionSteps
{
    private readonly ScenarioContext _scenarioContext;
    private readonly HttpClient _client;
    private HttpResponseMessage _response;

    public AuditRecordRegressionSteps(ScenarioContext scenarioContext)
    {
        _scenarioContext = scenarioContext;
        _client = new HttpClient { BaseAddress = new Uri(Environment.GetEnvironmentVariable("BaseUrl")) };
    }

    [Given(@"a tenant with ID ""(.*)""")]
    public void GivenATenantWithID(string tenantId)
    {
        _scenarioContext["TenantId"] = Guid.Parse(tenantId);
    }

    [When(@"I create an audit record with action ""(.*)""")]
    public async Task WhenICreateAnAuditRecordWithAction(string action)
    {
        var request = new CreateAuditRecordRequest
        {
            TenantId = (Guid)_scenarioContext["TenantId"],
            Action = action,
            UserId = "test-user",
            Timestamp = DateTime.UtcNow
        };

        _response = await _client.PostAsJsonAsync("/api/audit-records", request);
        _scenarioContext["Response"] = _response;
    }

    [Then(@"the response status code should be (.*)")]
    public void ThenTheResponseStatusCodeShouldBe(int expectedStatusCode)
    {
        var response = (HttpResponseMessage)_scenarioContext["Response"];
        Assert.Equal(expectedStatusCode, (int)response.StatusCode);
    }

    [Then(@"the audit record should be retrievable")]
    public async Task ThenTheAuditRecordShouldBeRetrievable()
    {
        var response = (HttpResponseMessage)_scenarioContext["Response"];
        var createResult = await response.Content.ReadFromJsonAsync<CreateAuditRecordResponse>();

        var getResponse = await _client.GetAsync($"/api/audit-records/{createResult.RecordId}");
        Assert.Equal(HttpStatusCode.OK, getResponse.StatusCode);

        var record = await getResponse.Content.ReadFromJsonAsync<AuditRecordDto>();
        Assert.Equal(createResult.RecordId, record.Id);
    }

    [Then(@"the audit record should be immutable")]
    [Trait("Category", "Compliance")]
    public async Task ThenTheAuditRecordShouldBeImmutable()
    {
        var response = (HttpResponseMessage)_scenarioContext["Response"];
        var createResult = await response.Content.ReadFromJsonAsync<CreateAuditRecordResponse>();

        // Attempt to update the record (should fail)
        var updateRequest = new { Action = "ModifiedAction" };
        var updateResponse = await _client.PutAsJsonAsync($"/api/audit-records/{createResult.RecordId}", updateRequest);

        Assert.Equal(HttpStatusCode.MethodNotAllowed, updateResponse.StatusCode);
    }
}

Gherkin Feature File:

# AuditRecordRegression.feature
Feature: Audit Record Regression Tests
  As a system operator
  I want to ensure audit records work correctly in staging
  So that production deployments are safe

@smoke @regression
Scenario: Create and retrieve audit record
  Given a tenant with ID "00000000-0000-0000-0000-000000000001"
  When I create an audit record with action "UserLogin"
  Then the response status code should be 201
  And the audit record should be retrievable

@security @compliance @regression
Scenario: Audit records are immutable
  Given a tenant with ID "00000000-0000-0000-0000-000000000001"
  When I create an audit record with action "UserLogin"
  Then the response status code should be 201
  And the audit record should be immutable

@security @regression
Scenario: Tenant isolation is enforced
  Given a tenant with ID "00000000-0000-0000-0000-000000000001"
  And another tenant with ID "00000000-0000-0000-0000-000000000002"
  When I create an audit record for tenant 1
  And I query audit records for tenant 2
  Then the response should contain zero records
  And the tenant 1 record should not be visible

Test Quality Metrics & Reporting

Purpose: Track and report on test quality metrics to enable continuous improvement.

Test Quality Scorecard:

Metric Target Current Status
Unit Test Count ≥50 per service 67 ✅ Pass
Unit Test Duration <30s 24s ✅ Pass
Unit Test Pass Rate 100% 100% ✅ Pass
Unit Test Flaky Rate <5% 2.1% ✅ Pass
Assertion Density ≥1.5 1.8 ✅ Pass
Integration Test Count ≥20 per service 28 ✅ Pass
Integration Test Duration <5min 4min 12s ✅ Pass
Tenant Isolation Tests ≥5 8 ✅ Pass
Contract Tests ≥10 12 ✅ Pass
Regression Test Pass Rate 100% 100% ✅ Pass
Critical Scenario Pass Rate 100% 100% ✅ Pass
Regression Test Duration <15min 12min 45s ✅ Pass

Test Quality Dashboard (KQL):

// Query test quality metrics from Azure DevOps
TestResults
| where TestSuite in ("UnitTests", "IntegrationTests", "RegressionTests")
| where CompletedDate >= ago(7d)
| summarize
    TotalTests = count(),
    PassedTests = countif(Outcome == "Passed"),
    FailedTests = countif(Outcome == "Failed"),
    FlakyTests = countif(Outcome == "Failed" and TestName contains "Flaky"),
    AvgDuration = avg(Duration),
    MaxDuration = max(Duration)
  by TestSuite, bin(CompletedDate, 1d)
| extend PassRate = (PassedTests * 100.0) / TotalTests
| extend FlakyRate = (FlakyTests * 100.0) / TotalTests
| project
    Date = CompletedDate,
    TestSuite,
    TotalTests,
    PassRate,
    FlakyRate,
    AvgDurationSeconds = AvgDuration / 1000,
    MaxDurationSeconds = MaxDuration / 1000
| order by Date desc, TestSuite

Summary

  • Unit Test Quality Gates: Min 50 tests, <30s duration, <5% flaky rate, ≥1.5 assertion density, ≤3 quarantined tests, naming convention enforced (ATP003 analyzer)
  • Integration Test Quality Gates: Min 20 tests, <5min duration, required service containers (redis, sql, rabbitmq), tenant isolation/contract/error scenario tests required
  • Regression Test Quality Gates: 100% pass rate, 100% critical scenario pass rate, <15min duration, environment/tenant/category coverage validated
  • PowerShell Validators: 3 quality gate validation scripts (unit, integration, regression) integrated into Azure Pipelines
  • Test Examples: 10+ C# examples demonstrating AAA pattern, tenant isolation, error handling, contract validation, BDD/Gherkin
  • Test Quality Metrics: 12-metric scorecard tracked via KQL dashboard, test quality trends analyzed for continuous improvement

Governance & Continuous Evolution

Purpose: Establish clear ownership for each quality gate category and define evolution roadmap for continuous improvement of quality gate effectiveness.

Governance Principles:

  • Owned & Accountable: Each gate type has a designated owner responsible for threshold maintenance and updates
  • Regularly Reviewed: Quality gates reviewed quarterly (minimum) or monthly for security/compliance
  • Evidence-Based Evolution: Gate thresholds adjusted based on historical data and team capability
  • Transparent Communication: Gate changes communicated to all stakeholders with rationale and migration plan
  • Continuous Improvement: Roadmap for enhancing gate automation, accuracy, and developer experience

Quality Gate Ownership

Purpose: Define clear accountability for each quality gate category with designated owners, reviewers, and update frequency.

Ownership Matrix:

Gate Type Owner Reviewer Update Frequency Escalation Path
Build Quality Tech Lead Architect Quarterly CTO
Test Coverage QA Lead Tech Lead Quarterly VP Engineering
Security Security Officer CISO Monthly CISO → Board
SBOM & Supply Chain Security Officer CISO Monthly CISO → Board
Compliance Compliance Officer DPO (Data Protection Officer) Monthly Legal Counsel
Performance SRE Lead Architect Quarterly VP Engineering
Observability SRE Lead Tech Lead Quarterly VP Engineering
Contract & API Architect Tech Lead Quarterly CTO
Approval Gates CAB (Change Advisory Board) VP Engineering As-needed CTO

Owner Responsibilities:

## Quality Gate Owner Responsibilities

### 1. Threshold Maintenance
- Review gate thresholds quarterly (or monthly for security/compliance)
- Analyze historical gate pass/fail trends
- Recommend threshold adjustments based on team capability and risk tolerance
- Document rationale for threshold changes in ADR (Architecture Decision Record)

### 2. Gate Effectiveness Monitoring
- Track gate precision/recall (true positives, false positives)
- Identify and remediate false positive patterns
- Monitor gate execution time (ensure gates provide fast feedback)
- Review gate failure remediation time (MTTR)

### 3. Stakeholder Communication
- Communicate gate changes to development teams with 2-week notice
- Provide migration guides and examples for new gates
- Conduct training sessions for complex gates (e.g., Roslyn analyzers)
- Publish monthly gate health reports to stakeholders

### 4. Continuous Improvement
- Propose new gates for emerging risks (e.g., AI model validation, privacy-preserving ML)
- Automate manual gates where feasible (e.g., shift approval gates to pre-deployment validation)
- Improve gate error messages for faster remediation
- Contribute to evolution roadmap

Reviewer Responsibilities:

## Quality Gate Reviewer Responsibilities

### 1. Threshold Review & Approval
- Review proposed threshold changes
- Validate rationale and supporting data
- Approve or reject changes based on risk assessment
- Ensure changes align with organizational standards

### 2. Risk Assessment
- Evaluate security/compliance impact of threshold changes
- Identify potential risks from relaxing thresholds
- Recommend compensating controls if thresholds lowered
- Escalate high-risk changes to executive leadership

### 3. Audit & Governance
- Ensure gate changes documented in version control (Git)
- Verify gate changes logged in meta-audit stream
- Validate gate changes comply with SOC 2/GDPR/HIPAA
- Support external audits with gate evidence

Quality Gate Change Request Process:

graph TD
    A[Owner Proposes Change] --> B[Document Rationale in ADR]
    B --> C[Analyze Historical Data]
    C --> D[Create Change Request]

    D --> E{Reviewer Approval?}
    E -->|No| F[Reject with Feedback]
    E -->|Yes| G{Security/Compliance Impact?}

    G -->|High| H[Escalate to CISO/DPO]
    G -->|Low/Medium| I[Schedule Rollout]

    H --> J{Executive Approval?}
    J -->|No| F
    J -->|Yes| I

    I --> K[Communicate to Teams]
    K --> L[Update Pipeline Config]
    L --> M[Deploy to Dev]
    M --> N[Soak Period 2 weeks]
    N --> O{Issues Detected?}

    O -->|Yes| P[Rollback]
    O -->|No| Q[Deploy to Test]
    Q --> R[Deploy to Staging]
    R --> S[Deploy to Production]

    F --> T[Owner Revises Proposal]
    P --> T

    S --> U[Log Change in Meta-Audit]

    style F fill:#ff6b6b
    style P fill:#ff6b6b
    style S fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Quality Gate Change Request Template (Azure DevOps Work Item):

# Work Item Type: Quality Gate Change Request
fields:
  - field: System.Title
    value: "[Gate Change] [Gate Type]  [Change Summary]"

  - field: System.Description
    value: |
      ## Change Summary
      **Gate Type**: [Build Quality / Test Coverage / Security / etc.]  
      **Current Threshold**: [Current value]  
      **Proposed Threshold**: [New value]  
      **Change Type**: [Tighten / Relax / Add New Gate / Remove Gate]

      ## Rationale
      **Business Justification**: 
      [Why is this change needed? What problem does it solve?]

      **Historical Data**:
      - Current pass rate: [X%]
      - Average violations per build: [N]
      - Remediation time (MTTR): [N hours]
      - False positive rate: [X%]

      **Risk Assessment**:
      - **Security Impact**: [None / Low / Medium / High]
      - **Compliance Impact**: [None / Low / Medium / High]
      - **Developer Productivity Impact**: [Positive / Neutral / Negative]

      ## Migration Plan
      **Rollout Strategy**: [Gradual / Immediate]  
      **Soak Period**: [2 weeks in Dev/Test]  
      **Rollback Criteria**: [If >10% builds fail, rollback]

      **Communication Plan**:
      - [ ] Email to dev team (2 weeks before)
      - [ ] Slack announcement with examples
      - [ ] Training session scheduled (if complex change)
      - [ ] Documentation updated

      **Compensating Controls** (if relaxing threshold):
      [What additional controls mitigate risk?]

      ## Approval Checklist
      - [ ] Rationale documented
      - [ ] Historical data analyzed
      - [ ] Reviewer approved
      - [ ] Security/Compliance reviewed (if applicable)
      - [ ] Communication plan executed
      - [ ] ADR created (for significant changes)

  - field: Microsoft.VSTS.Common.Priority
    value: 2  # P2 default; P1 for urgent security changes

  - field: Custom.GateType
    value: TestCoverage

  - field: Custom.CurrentThreshold
    value: "70%"

  - field: Custom.ProposedThreshold
    value: "75%"

  - field: Custom.Owner
    value: qa-lead@connectsoft.example

  - field: Custom.Reviewer
    value: tech-lead@connectsoft.example

Evolution Roadmap

Purpose: Define strategic vision for quality gate evolution, incorporating ML/AI, automation, and continuous improvement.

2025 Evolution Roadmap:

Q1 2025: ML-Based Flaky Test Detection & Auto-Quarantine

Goal: Automatically detect and quarantine flaky tests using machine learning instead of manual threshold-based detection.

Features: - ML Model: Train model on historical test results (pass/fail patterns, duration variability, environment correlations) - Auto-Quarantine: Automatically move flaky tests to quarantine suite when ML confidence > 85% - Root Cause Analysis: Use ML to identify common flaky test patterns (timing issues, resource contention, test order dependencies) - Self-Healing: Attempt automated fixes (add retries, increase timeouts, improve test isolation)

Implementation (C# + ML.NET):

// FlakyTestPredictor.cs — ML-based flaky test detection
using Microsoft.ML;
using Microsoft.ML.Data;
using System;
using System.Linq;

public class FlakyTestPredictor
{
    private readonly MLContext _mlContext;
    private ITransformer _model;

    public FlakyTestPredictor()
    {
        _mlContext = new MLContext(seed: 0);
    }

    public void TrainModel(IEnumerable<TestExecution> historicalData)
    {
        var dataView = _mlContext.Data.LoadFromEnumerable(historicalData);

        // Feature engineering: extract patterns from test executions
        var pipeline = _mlContext.Transforms.Categorical.OneHotEncoding("TestName")
            .Append(_mlContext.Transforms.NormalizeMinMax("Duration"))
            .Append(_mlContext.Transforms.NormalizeMinMax("PassRate"))
            .Append(_mlContext.Transforms.Concatenate("Features",
                "TestName", "Duration", "PassRate", "FailurePatternCount"))
            .Append(_mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(
                labelColumnName: "IsFlaky", featureColumnName: "Features"));

        _model = pipeline.Fit(dataView);

        // Evaluate model
        var predictions = _model.Transform(dataView);
        var metrics = _mlContext.BinaryClassification.Evaluate(predictions, labelColumnName: "IsFlaky");

        Console.WriteLine($"Model Accuracy: {metrics.Accuracy:P2}");
        Console.WriteLine($"AUC: {metrics.AreaUnderRocCurve:P2}");
    }

    public FlakyTestPrediction PredictFlakiness(TestExecution test)
    {
        var predictionEngine = _mlContext.Model.CreatePredictionEngine<TestExecution, FlakyTestPrediction>(_model);
        return predictionEngine.Predict(test);
    }
}

public class TestExecution
{
    public string TestName { get; set; }
    public float Duration { get; set; }
    public float PassRate { get; set; }  // Historical pass rate (0.0 to 1.0)
    public int FailurePatternCount { get; set; }  // Number of intermittent failure patterns

    [ColumnName("Label")]
    public bool IsFlaky { get; set; }
}

public class FlakyTestPrediction
{
    [ColumnName("PredictedLabel")]
    public bool IsFlaky { get; set; }

    public float Probability { get; set; }
    public float Score { get; set; }
}

Success Metrics: - ML model accuracy > 90% - False positive rate < 5% - Auto-quarantine reduces flaky test failures by 50%


Q2 2025: Predictive Gate Failure Analysis (Pre-Commit Warnings)

Goal: Predict quality gate failures before commit using ML analysis of code changes.

Features: - Pre-Commit Hooks: Analyze code changes locally before commit - ML Predictions: Predict likelihood of gate failures (coverage, security, complexity) - IDE Integration: VS Code/Visual Studio extensions show gate predictions in real-time - Remediation Suggestions: AI suggests fixes (add tests, refactor complex methods, update dependencies)

Implementation (PowerShell pre-commit hook):

# .git/hooks/pre-commit — Predictive gate failure analysis

$ErrorActionPreference = "Stop"

Write-Host "Running pre-commit gate analysis..." -ForegroundColor Cyan

# Get staged files
$stagedFiles = git diff --cached --name-only --diff-filter=ACM | Where-Object { $_ -match '\.cs$' }

if ($stagedFiles.Count -eq 0) {
    Write-Host "No C# files staged for commit" -ForegroundColor Gray
    exit 0
}

# Analyze code changes
$predictions = @()

foreach ($file in $stagedFiles) {
    # Call ML API to predict gate failures
    $response = Invoke-RestMethod -Uri "https://atp-ml-api.azurewebsites.net/predict/gate-failures" `
        -Method Post `
        -Body (@{
            filePath = $file
            changes = (git diff --cached $file)
        } | ConvertTo-Json) `
        -ContentType "application/json"

    if ($response.predictions.Count -gt 0) {
        $predictions += $response.predictions
    }
}

# Display predictions
if ($predictions.Count -gt 0) {
    Write-Host "" -ForegroundColor Yellow
    Write-Host "⚠️ Predicted Quality Gate Failures:" -ForegroundColor Yellow

    foreach ($prediction in $predictions) {
        Write-Host "  ❌ $($prediction.gateType): $($prediction.reason)" -ForegroundColor Red
        Write-Host "     File: $($prediction.file):$($prediction.line)" -ForegroundColor Gray
        Write-Host "     Confidence: $($prediction.confidence)%" -ForegroundColor Gray
        Write-Host "     Suggestion: $($prediction.suggestion)" -ForegroundColor Cyan
        Write-Host "" -ForegroundColor White
    }

    # Prompt user
    $response = Read-Host "Proceed with commit anyway? (y/N)"
    if ($response -ne "y") {
        Write-Host "Commit aborted. Fix predicted issues and try again." -ForegroundColor Red
        exit 1
    }
}

Write-Host "✅ Pre-commit analysis passed" -ForegroundColor Green
exit 0

Success Metrics: - 80% of predicted failures match actual failures - Developers fix 60% of predicted issues before commit - Average gate failure remediation time reduced by 40%


Q3 2025: Self-Healing Pipelines (Auto-Retry Transient Failures)

Goal: Automatically detect and retry transient failures (network timeouts, resource contention, flaky dependencies).

Features: - Failure Pattern Recognition: ML identifies transient vs. permanent failures - Smart Retry: Automatically retry with exponential backoff and jitter - Root Cause Logging: Log failure patterns for post-mortem analysis - Auto-Escalation: Escalate to human if retries exhausted

Implementation (Azure Pipelines YAML):

# Self-healing pipeline with smart retry
steps:
- task: DotNetCoreCLI@2
  displayName: 'Run Integration Tests'
  inputs:
    command: 'test'
    arguments: '--configuration Release --filter Category=Integration'
  retryCountOnTaskFailure: 3  # Azure Pipelines native retry
  env:
    RETRY_STRATEGY: 'exponential'  # Custom retry with exponential backoff

# Custom retry logic with ML-based transient failure detection
- task: PowerShell@2
  displayName: 'Smart Retry on Transient Failures'
  condition: failed()  # Only run if previous step failed
  inputs:
    targetType: 'inline'
    script: |
      # Analyze failure logs
      $failureLogs = Get-Content "$(Agent.TempDirectory)/test-logs.txt"

      # Call ML API to classify failure (transient vs. permanent)
      $response = Invoke-RestMethod -Uri "https://atp-ml-api.azurewebsites.net/classify/failure" `
        -Method Post `
        -Body (@{ logs = $failureLogs } | ConvertTo-Json) `
        -ContentType "application/json"

      if ($response.classification -eq "transient") {
        Write-Host "Transient failure detected; retrying with exponential backoff..."

        for ($i = 1; $i -le 3; $i++) {
          $backoff = [Math]::Pow(2, $i) * (Get-Random -Minimum 1000 -Maximum 2000)
          Write-Host "Retry attempt $i after ${backoff}ms..."
          Start-Sleep -Milliseconds $backoff

          # Retry test
          dotnet test --configuration Release --filter Category=Integration

          if ($LASTEXITCODE -eq 0) {
            Write-Host "✅ Retry succeeded"
            exit 0
          }
        }

        Write-Host "❌ All retries exhausted; escalating to human"
        exit 1
      } else {
        Write-Host "Permanent failure detected; no retry attempted"
        exit 1
      }

Success Metrics: - 70% of transient failures resolved by auto-retry - Average pipeline success rate increased from 95% to 98% - Manual intervention reduced by 50%


Q4 2025: AI-Assisted Code Quality Suggestions in PR Reviews

Goal: Provide real-time code quality suggestions in pull request comments using GPT-4 or similar LLMs.

Features: - Code Review Bot: Automatically reviews PRs and suggests improvements - Quality Gate Preview: Show predicted gate results before merge - Best Practice Suggestions: Recommend design patterns, refactorings, test coverage improvements - Security Vulnerability Detection: Identify potential security issues (SQL injection, XSS, etc.)

Implementation (GitHub Action / Azure DevOps Extension):

# .github/workflows/ai-code-review.yml
name: AI-Assisted Code Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
      with:
        fetch-depth: 0  # Full history for diff analysis

    - name: AI Code Review
      uses: connectsoft/ai-code-review-action@v1
      with:
        github-token: ${{ secrets.GITHUB_TOKEN }}
        openai-api-key: ${{ secrets.OPENAI_API_KEY }}
        model: 'gpt-4'
        review-scope: 'diff'  # Only review changed files

        # Quality gates to analyze
        gates:
          - complexity
          - test-coverage
          - security
          - best-practices
          - performance

        # Severity thresholds
        fail-on: 'critical'  # Block PR on critical issues
        warn-on: 'high'      # Comment warning on high issues

Example AI-Generated PR Comment:

## 🤖 AI Code Review — Ingestion Service PR #456

### ✅ Quality Gates Preview
- **Build Quality**: ✅ Pass (0 errors, 0 warnings)
- **Test Coverage**: ⚠️ Warning (68% line coverage, target 70%)
- **Security**: ✅ Pass (0 vulnerabilities)
- **Complexity**: ⚠️ Warning (1 method exceeds complexity threshold)

### 📝 Suggestions

**1. Test Coverage (Medium Priority)**
```csharp
// File: AuditRecordService.cs:45-78
public async Task<CreateAuditRecordResult> CreateAuditRecordAsync(CreateAuditRecordRequest request)
{
    // ... implementation ...
}

Issue: This method has no unit tests.
Suggestion: Add unit tests for edge cases: - Null request - Invalid tenant ID - Database connection failure

Example Test:

[Fact]
public async Task CreateAuditRecord_WithNullRequest_ThrowsArgumentNullException()
{
    var service = new AuditRecordService();
    await Assert.ThrowsAsync<ArgumentNullException>(() => 
        service.CreateAuditRecordAsync(null));
}


2. Cyclomatic Complexity (Low Priority)

// File: QueryOptimizer.cs:142-198
private QueryPlan OptimizeQuery(Query query)
{
    // ... complex logic with 15 decision points ...
}

Issue: Complexity = 18 (threshold: 15)
Suggestion: Extract sub-methods: - OptimizeFilters(Query query) - OptimizeJoins(Query query) - OptimizeSorting(Query query)


3. Security Best Practice (Medium Priority)

// File: ExportService.cs:89
var sql = $"SELECT * FROM AuditRecords WHERE TenantId = '{request.TenantId}'";

Issue: String interpolation in SQL (potential SQL injection)
Suggestion: Use parameterized queries:

var sql = "SELECT * FROM AuditRecords WHERE TenantId = @TenantId";
var parameters = new { TenantId = request.TenantId };


📊 Code Health Score: 85/100 (Good)

Breakdown: - Maintainability: 90/100 - Reliability: 85/100 - Security: 80/100 - Test Coverage: 75/100

Overall: Code is in good shape. Address test coverage and SQL injection issue before merge.

**Success Metrics**:
- 90% of AI suggestions accepted by developers
- Code review time reduced by 30%
- Security vulnerabilities detected pre-merge increased by 50%

---

### Continuous Improvement Framework

**Purpose**: Establish **systematic process** for continuously improving quality gate effectiveness.

**Improvement Cycle** (Monthly):

```mermaid
graph TD
    A[Collect Gate Metrics] --> B[Analyze Effectiveness]
    B --> C{Issues Identified?}

    C -->|No| D[Monitor & Continue]
    C -->|Yes| E[Root Cause Analysis]

    E --> F[Propose Improvements]
    F --> G[Prioritize by Impact]
    G --> H[Implement Changes]
    H --> I[Deploy to Dev]
    I --> J[Validate 2 Weeks]
    J --> K{Effective?}

    K -->|No| L[Rollback & Revise]
    K -->|Yes| M[Deploy to Prod]

    L --> F
    M --> D

    style L fill:#ff6b6b
    style M fill:#90EE90

Improvement Backlog (Azure DevOps Board):

Priority Improvement Owner Effort Expected Impact
P1 Reduce SonarQube false positives for S3776 Tech Lead 2 weeks High (reduce noise)
P1 Add Roslyn analyzer for async/await patterns Architect 3 weeks High (prevent deadlocks)
P2 Improve coverage exclusion documentation QA Lead 1 week Medium (reduce confusion)
P2 Automate flaky test detection (ML) SRE Lead 8 weeks High (Q1 2025 roadmap)
P3 Add custom SonarQube rules for ATP patterns Tech Lead 4 weeks Medium (ATP-specific quality)

Quality Gate Retrospective Template:

# Quality Gate Retrospective — [Month Year]

## Metrics Review

| Gate Type | Pass Rate | Avg Failure Time (MTTR) | False Positive Rate | Developer Satisfaction |
|-----------|-----------|-------------------------|---------------------|------------------------|
| Build Quality | 97.5% | 8 min | 2.1% | 8/10 |
| Test Coverage | 95.2% | 12 min | 5.3% | 7/10 |
| Security | 98.1% | 45 min | 8.7% | 6/10 |

## What Went Well
- ✅ Security gate detected 3 critical CVEs before production
- ✅ Coverage gate pushed team to improve from 68% to 73%
- ✅ SBOM gate helped with license compliance audit

## What Needs Improvement
- ❌ SonarQube false positives frustrating developers
- ❌ Dependency check scan too slow (15min average)
- ❌ PII redaction validation has edge cases

## Action Items
- [ ] Tune SonarQube quality profile (remove S1135 TODO rule)
- [ ] Parallelize dependency check scan (target <5min)
- [ ] Improve PII regex patterns (add phone number formats)
- [ ] Document coverage exclusion process (add to developer guide)

## Gate Changes This Month
- ✅ Increased coverage threshold: 70% → 72% (staged rollout)
- ✅ Added ATP003 Roslyn analyzer for test naming
- ✅ Relaxed SonarQube complexity threshold: 10 → 15 (with ADR)

## Developer Feedback Highlights
> "Coverage gate is helpful but sometimes blocks hotfixes. Need emergency bypass process."  
> — Developer A

> "SBOM generation is great for compliance but slows down builds. Can we cache?"  
> — Developer B

## Next Month Focus
1. Address SonarQube false positives
2. Optimize dependency check performance
3. Pilot ML-based flaky test detection (Q1 2025 roadmap)

Summary

  • Quality Gate Ownership: 9-gate ownership matrix (owner, reviewer, update frequency, escalation path), owner/reviewer responsibilities, change request process with Mermaid diagram, Azure DevOps change request template
  • Evolution Roadmap: 4 quarters of innovation (Q1: ML flaky test detection, Q2: predictive gate failure analysis, Q3: self-healing pipelines, Q4: AI-assisted PR reviews), C#/PowerShell/YAML implementations, success metrics per quarter
  • Continuous Improvement Framework: Monthly improvement cycle (Mermaid diagram), improvement backlog (Azure DevOps board), quality gate retrospective template

Appendix A — Quality Gate Summary Matrix

Purpose: Provide comprehensive reference for all quality gates with thresholds, enforcement points, blocker status, and applicable environments.

Gate Threshold Enforcement Blocker Environment Owner Bypass Allowed
Build Errors 0 CI (Build) ✅ Yes All Tech Lead ❌ No
Build Warnings 0 (TreatWarningsAsErrors) CI (Build) ✅ Yes All Tech Lead ❌ No
Line Coverage ≥70% (per service) CI (Test) ✅ Yes All QA Lead ⚠️ Emergency only
Branch Coverage ≥60% (per service) CI (Test) ✅ Yes All QA Lead ⚠️ Emergency only
SonarQube Bugs 0 CI (Build) ✅ Yes All Tech Lead ❌ No
SonarQube Vulnerabilities 0 CI (Build) ✅ Yes All Security Officer ❌ No
SonarQube Code Smells ≤10 (minor) CI (Build) ⚠️ Warning All Tech Lead ✅ Yes (with review)
Critical CVEs (CVSS 9-10) 0 CI (Security Scan) ✅ Yes All Security Officer ❌ No
High CVEs (CVSS 7-8.9) 0 CI (Security Scan) ✅ Yes All Security Officer ⚠️ With risk acceptance
Medium CVEs (CVSS 4-6.9) Fix within 30 days CI (Security Scan) ⚠️ Warning All Security Officer ✅ Yes
Secrets Detected 0 CI (Security Scan) ✅ Yes All Security Officer ❌ No
SBOM Generated Required (CycloneDX) CI (Build) ✅ Yes All Security Officer ❌ No
Container Scan (Critical) 0 Critical CI (Build) ✅ Yes Staging/Prod Security Officer ❌ No
Container Scan (High) 0 High CI (Build) ✅ Yes Staging/Prod Security Officer ⚠️ With risk acceptance
API Breaking Changes 0 CI (Build) ✅ Yes All Architect ⚠️ With versioning
Message Schema Breaking Changes 0 CI (Build) ✅ Yes All Architect ⚠️ With versioning
PII in Logs 0 CI (Compliance) ✅ Yes All Compliance Officer ❌ No
Audit Logging Violations (ATP001) 0 CI (Compliance) ✅ Yes All Compliance Officer ❌ No
Data Classification Missing (ATP002) 0 CI (Compliance) ✅ Yes All Compliance Officer ❌ No
Test Naming Convention (ATP003) 0 violations CI (Test) ⚠️ Warning All QA Lead ✅ Yes
Unit Test Count ≥50 per service CI (Test) ⚠️ Warning All QA Lead ✅ Yes
Integration Test Count ≥20 per service CI (Test) ⚠️ Warning All QA Lead ✅ Yes
Flaky Test Rate <5% CI (Test) ⚠️ Warning All QA Lead ✅ Yes
Quarantined Tests ≤3 CI (Test) ⚠️ Warning All QA Lead ✅ Yes
p50 Latency <100ms Staging (Load Test) ⚠️ Warning Staging SRE Lead ✅ Yes
p95 Latency <500ms Staging (Load Test) ✅ Yes (prod) Staging SRE Lead ❌ No (prod)
p99 Latency <1000ms Staging (Load Test) ⚠️ Warning Staging SRE Lead ✅ Yes
Error Rate <0.1% Staging (Load Test) ✅ Yes (prod) Staging SRE Lead ❌ No (prod)
Throughput ≥1000 RPS Staging (Load Test) ⚠️ Warning Staging SRE Lead ✅ Yes
Chaos Test Pass Rate (Critical) 100% Staging (Chaos) ✅ Yes (prod) Staging SRE Lead ❌ No (prod)
Chaos Test Pass Rate (Non-Critical) ≥95% Staging (Chaos) ⚠️ Warning Staging SRE Lead ✅ Yes
Health Checks (Liveness) 200 OK All ✅ Yes All SRE Lead ❌ No
Health Checks (Readiness) 200 OK All ✅ Yes All SRE Lead ❌ No
OpenTelemetry Instrumentation Required CI (Observability) ⚠️ Warning All SRE Lead ✅ Yes
Manual Approval (Staging) 1 approver Pre-deploy ✅ Yes Staging CAB ❌ No
Manual Approval (Production) 2 approvers Pre-deploy ✅ Yes Production CAB ⚠️ Emergency only
Regression Test Pass Rate 100% Staging (Deploy) ✅ Yes Staging QA Lead ❌ No
Critical Scenario Pass Rate 100% (@security, @compliance) Staging (Deploy) ✅ Yes Staging QA Lead ❌ No

Gate Bypass Approval Matrix:

Bypass Type Approver 1 Approver 2 Conditions Documentation Required
Coverage (Emergency Hotfix) Tech Lead Architect Active P1 incident ADR + Incident ticket
High CVE (Risk Acceptance) Security Officer CISO Mitigation controls in place Risk acceptance form
API Breaking Change Architect VP Engineering New major version (v2) API versioning strategy
Manual Approval (Emergency) CISO CTO Critical security patch Emergency change request

Appendix B — Example Pipeline with All Gates

Purpose: Provide complete reference implementation of Azure Pipelines with all quality gates integrated.

# azure-pipelines-complete.yml — Complete ATP Pipeline with All Quality Gates
name: $(majorMinorVersion).$(semanticVersion)

resources:
  repositories:
    - repository: templates
      type: git
      name: ConnectSoft/ConnectSoft.AzurePipelines
      ref: refs/tags/v2.3.1

  containers:
    - container: redis
      image: redis:7-alpine
      ports: [6379:6379]
    - container: mssql
      image: mcr.microsoft.com/mssql/server:2022-latest
      ports: [1433:1433]
      env:
        ACCEPT_EULA: Y
        SA_PASSWORD: P@ssw0rd123!
    - container: rabbitmq
      image: rabbitmq:3-management-alpine
      ports: [5672:5672, 15672:15672]
    - container: otel-collector
      image: otel/opentelemetry-collector:0.97.0
      ports: [4317:4317]

pool:
  vmImage: 'ubuntu-latest'

variables:
  majorMinorVersion: 1.0
  semanticVersion: $[counter(variables['majorMinorVersion'], 0)]
  buildNumber: $(majorMinorVersion).$(semanticVersion)
  solution: '**/*.slnx'
  exactSolution: 'ConnectSoft.ATP.Ingestion.slnx'
  buildConfiguration: 'Release'
  codeCoverageThreshold: 75
  restoreVstsFeed: 'e4c108b4-7989-4d22-93d6-391b77a39552'

trigger:
  branches:
    include: [master, main]
  paths:
    exclude: [README.md, docs/**]

#═══════════════════════════════════════════════════════════════════════════════
# Stage 1: CI (Build, Test, Security, Compliance)
#═══════════════════════════════════════════════════════════════════════════════
stages:
- stage: CI_Stage
  displayName: 'CI  Build, Test, Security, Compliance'
  jobs:
  - job: Build_Test_Scan
    displayName: 'Build, Test, Security & Compliance Scans'
    timeoutInMinutes: 30

    services:
      redis: redis
      mssql: mssql
      rabbitmq: rabbitmq
      otel: otel-collector

    steps:
    # ─────────────────────────────────────────────────────────────────────────
    # Setup
    # ─────────────────────────────────────────────────────────────────────────
    - task: UseDotNet@2
      displayName: 'Install .NET 8 SDK'
      inputs:
        version: '8.x'

    - task: NuGetAuthenticate@1
      displayName: 'Authenticate NuGet'

    # ─────────────────────────────────────────────────────────────────────────
    # BUILD QUALITY GATES
    # ─────────────────────────────────────────────────────────────────────────
    - task: DotNetCoreCLI@2
      displayName: 'dotnet restore'
      inputs:
        command: 'restore'
        projects: '$(exactSolution)'
        feedsToUse: 'select'
        vstsFeed: '$(restoreVstsFeed)'

    - task: DotNetCoreCLI@2
      displayName: '✅ BUILD QUALITY GATE: Zero Errors/Warnings'
      inputs:
        command: 'build'
        projects: '$(exactSolution)'
        arguments: >
          --configuration $(buildConfiguration)
          --no-restore
          /p:TreatWarningsAsErrors=true
          /p:EnforceCodeStyleInBuild=true
          /p:Deterministic=true
          /p:ContinuousIntegrationBuild=true
      continueOnError: false  # BLOCKER

    # SonarQube analysis
    - task: SonarCloudPrepare@1
      displayName: 'Prepare SonarQube Analysis'
      inputs:
        SonarCloud: 'SonarCloud-ConnectSoft'
        organization: 'connectsoft'
        scannerMode: 'MSBuild'
        projectKey: 'ConnectSoft_ATP_Ingestion'
        projectName: 'ATP Ingestion Service'

    - task: SonarCloudAnalyze@1
      displayName: '✅ BUILD QUALITY GATE: SonarQube Analysis'

    - task: SonarCloudPublish@1
      displayName: 'Publish SonarQube Results'
      inputs:
        pollingTimeoutSec: '300'

    # ─────────────────────────────────────────────────────────────────────────
    # TEST COVERAGE GATES
    # ─────────────────────────────────────────────────────────────────────────
    - task: DotNetCoreCLI@2
      displayName: 'Run Unit Tests'
      inputs:
        command: 'test'
        projects: '**/*Tests.csproj'
        arguments: >
          --configuration $(buildConfiguration)
          --no-build
          --filter "Category=Unit"
          --collect:"XPlat Code Coverage"
          --settings:CodeCoverage.runsettings
          --logger trx
        publishTestResults: true

    - task: PowerShell@2
      displayName: '✅ TESTING QUALITY GATE: Unit Test Quality'
      inputs:
        filePath: 'scripts/Validate-UnitTestQuality.ps1'
        arguments: >
          -MinTests 50
          -MaxDurationSeconds 30
          -MaxFlakyRate 5.0
      continueOnError: false  # BLOCKER

    - task: DotNetCoreCLI@2
      displayName: 'Run Integration Tests'
      inputs:
        command: 'test'
        projects: '**/*Tests.csproj'
        arguments: >
          --configuration $(buildConfiguration)
          --no-build
          --filter "Category=Integration"
          --logger trx
        publishTestResults: true
      env:
        ConnectionStrings__Redis: 'redis:6379'
        ConnectionStrings__Database: 'Server=mssql;Database=ATP_Test;User=sa;Password=P@ssw0rd123!'
        ConnectionStrings__RabbitMQ: 'amqp://guest:guest@rabbitmq:5672'

    - task: PowerShell@2
      displayName: '✅ TESTING QUALITY GATE: Integration Test Quality'
      inputs:
        filePath: 'scripts/Validate-IntegrationTestQuality.ps1'
        arguments: >
          -MinTests 20
          -MaxDurationSeconds 300
      continueOnError: false  # BLOCKER

    - task: PublishCodeCoverageResults@1
      displayName: 'Publish Code Coverage'
      inputs:
        codeCoverageTool: 'Cobertura'
        summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'

    - task: BuildQualityChecks@8
      displayName: '✅ TEST COVERAGE GATE: Coverage Threshold'
      inputs:
        checkCoverage: true
        coverageThreshold: $(codeCoverageThreshold)
        coverageFailOption: 'fixed'
        coverageType: 'lines'
        treatBuildWarningsAsErrors: true
        baselineEnabled: true
        baselineType: 'previous'
      continueOnError: false  # BLOCKER

    # ─────────────────────────────────────────────────────────────────────────
    # SECURITY GATES
    # ─────────────────────────────────────────────────────────────────────────
    - task: dependency-check-build-task@6
      displayName: '✅ SECURITY GATE: Dependency Scan (OWASP)'
      inputs:
        projectName: 'ConnectSoft.ATP.Ingestion'
        scanPath: '$(Build.SourcesDirectory)'
        format: 'HTML,JSON,XML'
        failOnCVSS: 7  # Block on High/Critical
        suppressionFile: 'dependency-check-suppressions.xml'
      continueOnError: false  # BLOCKER

    - task: CredScan@3
      displayName: '✅ SECURITY GATE: Secrets Detection'
      inputs:
        toolMajorVersion: 'V2'
        suppressionsFile: 'credscan-suppressions.json'
        debugMode: false
      continueOnError: false  # BLOCKER

    - script: |
        trivy image --severity HIGH,CRITICAL --exit-code 1 \
          connectsoft.azurecr.io/atp/ingestion:$(buildNumber)
      displayName: '✅ SECURITY GATE: Container Image Scan (Trivy)'
      continueOnError: false  # BLOCKER (staging/prod only)
      condition: or(eq(variables['Build.SourceBranch'], 'refs/heads/master'), startsWith(variables['Build.SourceBranch'], 'refs/tags/'))

    # ─────────────────────────────────────────────────────────────────────────
    # SBOM & SUPPLY CHAIN GATES
    # ─────────────────────────────────────────────────────────────────────────
    - task: CmdLine@2
      displayName: '✅ SBOM GATE: Generate SBOM (CycloneDX)'
      inputs:
        script: |
          dotnet tool install --global CycloneDX
          dotnet CycloneDX $(exactSolution) -o $(Build.ArtifactStagingDirectory)/sbom -f json
      continueOnError: false  # BLOCKER

    - task: PowerShell@2
      displayName: '✅ SBOM GATE: Validate SBOM Content'
      inputs:
        filePath: 'scripts/Validate-SBOM.ps1'
        arguments: '-SbomPath "$(Build.ArtifactStagingDirectory)/sbom"'
      continueOnError: false  # BLOCKER

    - task: PublishBuildArtifacts@1
      displayName: 'Publish SBOM Artifact'
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)/sbom'
        ArtifactName: 'sbom'

    # ─────────────────────────────────────────────────────────────────────────
    # COMPLIANCE GATES
    # ─────────────────────────────────────────────────────────────────────────
    - task: PowerShell@2
      displayName: '✅ COMPLIANCE GATE: PII Redaction Validation'
      inputs:
        filePath: 'scripts/Validate-PIIRedaction.ps1'
      continueOnError: false  # BLOCKER

    - task: PowerShell@2
      displayName: '✅ COMPLIANCE GATE: GDPR/HIPAA Checklist'
      inputs:
        filePath: 'scripts/Validate-ComplianceChecklist.ps1'
      continueOnError: false  # BLOCKER

    # ─────────────────────────────────────────────────────────────────────────
    # OBSERVABILITY GATES
    # ─────────────────────────────────────────────────────────────────────────
    - task: PowerShell@2
      displayName: '✅ OBSERVABILITY GATE: OpenTelemetry Validation'
      inputs:
        filePath: 'scripts/Validate-OpenTelemetry.ps1'
      continueOnError: false  # WARNING (not blocker)

    # ─────────────────────────────────────────────────────────────────────────
    # CONTRACT & API GATES
    # ─────────────────────────────────────────────────────────────────────────
    - task: PowerShell@2
      displayName: '✅ CONTRACT GATE: OpenAPI Breaking Change Detection'
      inputs:
        filePath: 'scripts/Validate-OpenAPIChanges.ps1'
      continueOnError: false  # BLOCKER

    # ─────────────────────────────────────────────────────────────────────────
    # Publish Artifacts
    # ─────────────────────────────────────────────────────────────────────────
    - task: PublishBuildArtifacts@1
      displayName: 'Publish Build Artifacts'
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)'
        ArtifactName: 'drop'

#═══════════════════════════════════════════════════════════════════════════════
# Stage 2: Deploy to Staging (with Performance & Regression Gates)
#═══════════════════════════════════════════════════════════════════════════════
- stage: Deploy_Staging
  displayName: 'Deploy to Staging'
  dependsOn: CI_Stage
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/master'))
  jobs:
  - deployment: DeployToStaging
    displayName: 'Deploy ATP Ingestion to Staging'
    environment: ATP-Staging  # ✅ APPROVAL GATE: 1 approver required
    strategy:
      runOnce:
        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: 'ConnectSoft-Production'
              appName: 'atp-ingestion-staging'
              package: '$(Pipeline.Workspace)/drop/*.zip'

          # Wait for stabilization
          - task: PowerShell@2
            displayName: 'Wait for Service Stabilization'
            inputs:
              targetType: 'inline'
              script: Start-Sleep -Seconds 60

          # ─────────────────────────────────────────────────────────────────
          # PERFORMANCE GATES
          # ─────────────────────────────────────────────────────────────────
          - task: LoadTest@1
            displayName: '✅ PERFORMANCE GATE: Load Testing'
            inputs:
              testPlan: 'load-tests/staging-load-test.jmx'
              thresholdP95: 500  # <500ms p95 latency
              thresholdErrorRate: 0.1  # <0.1% error rate
              thresholdThroughput: 1000  # ≥1000 RPS
            continueOnError: false  # BLOCKER for production

          - task: ChaosTest@1
            displayName: '✅ PERFORMANCE GATE: Chaos Engineering'
            inputs:
              testPlan: 'chaos-tests/staging-chaos-test.yaml'
              criticalScenariosPassRate: 100  # All critical scenarios must pass
            continueOnError: false  # BLOCKER for production

          # ─────────────────────────────────────────────────────────────────
          # REGRESSION TEST GATES
          # ─────────────────────────────────────────────────────────────────
          - task: DotNetCoreCLI@2
            displayName: 'Run Regression Tests'
            inputs:
              command: 'test'
              projects: '**/*RegressionTests.csproj'
              arguments: '--configuration Release --logger trx'
              publishTestResults: true
            env:
              TestEnvironment: 'Staging'
              BaseUrl: 'https://atp-ingestion-staging.azurewebsites.net'

          - task: PowerShell@2
            displayName: '✅ REGRESSION GATE: Regression Test Quality'
            inputs:
              filePath: 'scripts/Validate-RegressionTestQuality.ps1'
              arguments: >
                -RequiredPassRate 100
                -Environment "Staging"
              continueOnError: false  # BLOCKER

          # ─────────────────────────────────────────────────────────────────
          # OBSERVABILITY GATES
          # ─────────────────────────────────────────────────────────────────
          - task: HttpTest@1
            displayName: '✅ OBSERVABILITY GATE: Health Check Validation'
            inputs:
              url: 'https://atp-ingestion-staging.azurewebsites.net/health/ready'
              expectedStatusCode: 200
              retryCount: 3
            continueOnError: false  # BLOCKER

#═══════════════════════════════════════════════════════════════════════════════
# Stage 3: Deploy to Production (with Manual Approval + Canary)
#═══════════════════════════════════════════════════════════════════════════════
- stage: Deploy_Production
  displayName: 'Deploy to Production'
  dependsOn: Deploy_Staging
  condition: and(succeeded(), eq(variables['Build.Reason'], 'Manual'))
  jobs:
  - deployment: DeployToProduction
    displayName: 'Deploy ATP Ingestion to Production (Canary)'
    environment: ATP-Production  # ✅ APPROVAL GATE: 2 approvers + CAB required
    strategy:
      canary:
        increments: [10, 25, 50]  # 10% → 25% → 50% → 100%
        preDeploy:
          steps:
          - script: echo "Pre-deployment validation..."
          # Validate no active incidents
          - task: AzureFunction@1
            displayName: '✅ APPROVAL GATE: Check Active Incidents'
            inputs:
              function: 'ValidateNoActiveIncidents'
              failOnError: true

        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: 'ConnectSoft-Production'
              appName: 'atp-ingestion-prod'
              package: '$(Pipeline.Workspace)/drop/*.zip'
              trafficPercentage: $(strategy.increment)  # Canary traffic routing

        postRouteTraffic:
          steps:
          # Monitor canary metrics
          - task: PowerShell@2
            displayName: '✅ CANARY GATE: Monitor Metrics'
            inputs:
              targetType: 'inline'
              script: |
                Start-Sleep -Seconds 600  # Wait 10 minutes

                # Query Application Insights for error rate
                $errorRate = az monitor app-insights metrics show \
                  --app atp-appinsights-prod-eus \
                  --metric "requests/failed" \
                  --aggregation avg \
                  --offset 10m \
                  --query "value.segments[0]['requests/failed'].avg" -o tsv

                if ($errorRate -gt 0.01) {  # >1% error rate
                  Write-Error "Error rate too high: $errorRate%"
                  exit 1  # Trigger rollback
                }

        on:
          failure:
            steps:
            - script: echo "🔴 Canary deployment failed; rolling back..."
            - task: AzureAppServiceManage@0
              inputs:
                azureSubscription: 'ConnectSoft-Production'
                action: 'Swap Slots'
                webAppName: 'atp-ingestion-prod'
                sourceSlot: 'production'
                targetSlot: 'canary'

Appendix C — Cross-Reference Map

Purpose: Map quality gate topics to related documentation for comprehensive understanding.

Topic Primary Document Related Documents Notes
Azure Pipelines ci-cd/azure-pipelines.md ci-cd/quality-gates.md, ci-cd/environments.md Pipeline configuration, stages, templates, deployment strategies
Environments ci-cd/environments.md ci-cd/azure-pipelines.md, ci-cd/quality-gates.md Environment-specific thresholds, approvals, configuration management
Security & Compliance platform/security-compliance.md ci-cd/quality-gates.md, platform/data-residency-retention.md Security controls, compliance frameworks (SOC 2, GDPR, HIPAA)
Testing Strategies ci-cd/quality-gates.md (Section 15) implementation/template-integration.md, operations/progressive-rollout.md Unit, integration, regression tests per environment
Observability operations/observability.md ci-cd/quality-gates.md (Section 9), ci-cd/environments.md OpenTelemetry validation, health checks, metrics, tracing
SBOM & Supply Chain ci-cd/quality-gates.md (Section 6) platform/security-compliance.md, ci-cd/azure-pipelines.md SBOM generation, provenance, signing, SLSA, supply chain security
Code Coverage ci-cd/quality-gates.md (Section 4) implementation/template-integration.md Coverage thresholds, baseline protection, exclusions, per-service config
SonarQube ci-cd/quality-gates.md (Section 3) implementation/template-integration.md Static code analysis, quality profiles, Roslyn analyzers
Dependency Scanning ci-cd/quality-gates.md (Section 5) platform/security-compliance.md OWASP Dependency-Check, CVE management, suppression workflow
Compliance Gates ci-cd/quality-gates.md (Section 7) platform/data-residency-retention.md, platform/security-compliance.md Audit logging, PII redaction, GDPR/HIPAA checklists, data classification
Performance Gates ci-cd/quality-gates.md (Section 8) operations/progressive-rollout.md, operations/runbook.md Load testing, chaos engineering, latency/throughput thresholds
API Contracts ci-cd/quality-gates.md (Section 10) domain/contracts/rest-apis.md, domain/contracts/webhooks.md OpenAPI breaking change detection, message schema compatibility
Approval Gates ci-cd/quality-gates.md (Section 11) ci-cd/environments.md, ci-cd/azure-pipelines.md Manual approvals, CAB process, emergency procedures
Deployment Strategies operations/progressive-rollout.md ci-cd/azure-pipelines.md, ci-cd/environments.md Blue-green, canary, rolling deployments
Incident Management operations/runbook.md ci-cd/quality-gates.md (Section 12), operations/progressive-rollout.md Rollback procedures, incident response, post-mortems
Infrastructure as Code infrastructure/pulumi.md ci-cd/environments.md, ci-cd/azure-pipelines.md Pulumi/Bicep deployment, IaC overlays, drift detection
Data Residency platform/data-residency-retention.md ci-cd/quality-gates.md (Section 7), platform/security-compliance.md Data classification, retention policies, GDPR/HIPAA compliance
Template Integration implementation/template-integration.md ci-cd/azure-pipelines.md, ci-cd/quality-gates.md ConnectSoft microservice template usage, project structure
Development Plan planning/index.md planning/status-tracking.md, planning/_baseline-roadmap.md Epic planning organized by bounded contexts, 30-cycle baseline roadmap

Summary

  • Governance & Continuous Evolution: 9-gate ownership matrix, owner/reviewer responsibilities, change request process (Mermaid diagram), Azure DevOps work item template, 2025 evolution roadmap (4 quarters), continuous improvement framework (monthly cycle, retrospective template)
  • Appendix A: Complete quality gate summary matrix (40+ gates with thresholds, enforcement, blocker status, environments, bypass rules)
  • Appendix B: Full Azure Pipelines YAML example (~280 lines) demonstrating all gates (build, test coverage, security, SBOM, compliance, performance, observability, contract, approval)
  • Appendix C: Cross-reference map (18 topics mapped to primary and related documents across ci-cd, platform, operations, infrastructure, implementation, domain)