Quality Gates — Audit Trail Platform (ATP)¶
Quality by enforcement — ATP pipelines block progression when code quality, test coverage, security, or compliance thresholds are not met.
Purpose & Scope¶
This document defines the comprehensive quality gate framework for the ConnectSoft Audit Trail Platform (ATP), ensuring that every build, deployment, and release meets stringent standards for code quality, security, compliance, and operational excellence.
Purpose¶
Quality gates serve as automated checkpoints in the CI/CD pipeline, preventing low-quality or non-compliant code from progressing to production. ATP's quality gates enforce:
- Build Quality: Code compiles without errors or warnings, adheres to coding standards, and passes static analysis
- Test Coverage: Sufficient unit, integration, and E2E test coverage with 100% pass rates
- Security Posture: Zero critical/high vulnerabilities, no secrets in code, secure dependencies
- Compliance Adherence: SBOM generation, audit logging, PII redaction, regulatory alignment
- Performance Standards: Load tests, chaos tests, and observability validation in staging
- API Contract Stability: No breaking changes without versioning, backward compatibility maintained
By failing fast at each gate, ATP ensures that issues are detected and remediated early in the development lifecycle (shift-left), reducing the cost and risk of defects reaching production.
Scope¶
This document covers:
- Quality Gate Categories: Build, test, security, compliance, performance, observability, API contracts
- Thresholds & Metrics: Specific numeric thresholds per gate type (e.g., ≥70% code coverage, 0 critical CVEs)
- Enforcement Mechanisms: Pipeline configurations, Azure DevOps tasks, custom scripts, approval workflows
- Integration Points: Azure Pipelines stages (CI, staging, production), SonarQube, OWASP Dependency-Check, Trivy, OpenAPI-diff
- Exception Handling: Risk acceptance process, suppression files, time-bound exemptions
- Metrics & Dashboards: Quality gate pass/fail trends, remediation times, DORA metrics alignment
- Governance: Quality gate ownership (RACI), threshold evolution roadmap, retrospective cadence
Gate enforcement applies to:
- All ATP microservices (Ingestion, Query, Integrity, Export, Policy, Search, Gateway)
- Infrastructure as Code (Pulumi C# stacks)
- Shared libraries (ConnectSoft.Audit.Abstractions, ConnectSoft.Observability.OpenTelemetry)
- Database migration scripts (EF Core migrations, SQL scripts)
- CI/CD pipeline templates (ConnectSoft.AzurePipelines)
Out of Scope¶
This document does NOT cover:
- Code review processes: Manual peer review workflows, pull request templates, approval policies (see
development/code-review-guidelines.md) - Incident response: Post-production issue handling, on-call procedures (see
operations/runbook.md) - Deployment strategies: Blue-green, canary, rolling deployment mechanics (see
ci-cd/azure-pipelines.md) - Environment configuration: Environment-specific settings, secrets management (see
ci-cd/environments.md) - Architecture decisions: Why specific quality thresholds were chosen (see ADRs in
adrs/)
Readers & Ownership¶
Primary Audience:
- Development Teams: Understand quality requirements, fix gate violations, write testable code
- QA Engineers: Define test coverage thresholds, maintain test suites, analyze flaky tests
- Security Officers: Set vulnerability thresholds, review suppressions, validate compliance gates
- SRE Teams: Monitor performance gates, chaos test results, observability validation
- Platform Engineers: Configure pipeline gates, maintain enforcement tooling, update thresholds
Document Ownership:
- Author: Platform Engineering Team
- Reviewers: Tech Lead (Build/Test Gates), Security Officer (Security/Compliance Gates), SRE Lead (Performance/Observability Gates)
- Approver: Lead Architect
- Maintenance Cadence: Quarterly review with threshold adjustments based on maturity
Artifacts Produced by Quality Gates¶
Quality gates generate compliance artifacts that serve as evidence for audits (SOC 2, GDPR, HIPAA):
-
Build Artifacts:
- Compiled binaries with version stamps
- NuGet packages (
.nupkg,.snupkgsymbol packages) - Docker images with signed provenance (Cosign signatures)
-
Test Evidence:
- Test result files (
.trx, JUnit XML) - Code coverage reports (Cobertura XML, HTML)
- Load test results (JMeter
.jtl, HTML reports) - Chaos test results (JSON, logs)
- Test result files (
-
Security Artifacts:
- SBOM (Software Bill of Materials) in CycloneDX/SPDX format
- Vulnerability scan reports (OWASP Dependency-Check JSON/HTML)
- Container scan results (Trivy JSON)
- Secrets scan reports (CredScan, GitGuardian)
-
Compliance Evidence:
- Audit logging validation results
- PII redaction verification reports
- Compliance checklist attestations (GDPR, HIPAA)
- API contract diff reports (OpenAPI breaking changes)
-
Performance Metrics:
- Load test metrics (p50/p95/p99 latency, throughput, error rates)
- Chaos test pass/fail results per scenario
- Health check validation logs
-
Approval Records:
- Manual approval logs (approver identity, timestamp, justification)
- Change Advisory Board (CAB) meeting minutes
- Risk acceptance documentation (for suppressed vulnerabilities)
All artifacts are retained for 7 years (aligned with ATP's retention policy) and stored in immutable Azure Blob Storage with legal hold enabled for production builds.
Acceptance Criteria¶
This document is considered complete and accurate when:
- All gate types are documented with clear thresholds, enforcement mechanisms, and examples
- Integration with Azure Pipelines is fully described with working YAML configurations
- Exception handling processes are defined with suppression file formats and approval workflows
- Metrics and dashboards are specified with Azure DevOps dashboard configurations and KQL queries
- Cross-references to related ATP documentation (environments, pipelines, security) are complete
- Code examples are production-ready and tested in ATP pipelines
- Governance model defines RACI, ownership, and evolution roadmap
Success Metrics:
- ≥95% of builds pass all quality gates on first attempt (target by Q2 2025)
- ≤24 hours median time to remediate gate violations (target by Q2 2025)
- Zero critical/high vulnerabilities in production (maintained since inception)
- 100% of production builds have complete compliance artifacts (maintained)
Document Conventions¶
Symbols used: - ✅ Blocker gate: Pipeline fails immediately if threshold not met - ⚠️ Warning gate: Issue logged but pipeline continues (for non-critical metrics) - ℹ️ Informational: Metric tracked but no enforcement action - ❌ Action required: Gate failure requires developer intervention
Threshold notation:
≥70%: Greater than or equal to 70%<500ms: Less than 500 milliseconds0: Exactly zero (no tolerance)
Code block types:
yaml: Azure Pipelines YAML configurationscsharp: C# code examples (quality validation logic)powershell: PowerShell scripts (validation, remediation)bash: Bash scripts (Linux-based gates)xml: Suppression files, .csproj configurationsjson: SonarQube quality profiles, SBOM formats
Quality Gate Philosophy¶
ATP's quality gate framework is built on five core principles:
1. Shift-Left: Detect Issues Early¶
Principle: Identify and fix defects as early as possible in the development lifecycle to minimize cost and risk.
Implementation:
- Pre-commit hooks: Lint checks, unit tests run locally before commit
- PR validation: Automated PR builds run all CI gates (build, test, security)
- Branch policies: Require build validation, code coverage, code review before merge
- Fast feedback: CI gates complete in <10 minutes; developers notified immediately on failure
Benefits:
- 10x cheaper to fix bugs in development vs. production
- Faster development velocity (fewer context switches for late-stage fixes)
- Higher developer confidence (code is validated before merge)
ATP Example:
# Pre-commit hook (.git/hooks/pre-commit)
#!/bin/bash
echo "Running pre-commit quality checks..."
# Lint check
dotnet format --verify-no-changes --severity error
# Unit tests (fast subset)
dotnet test --filter Category=Unit --no-build
if [ $? -ne 0 ]; then
echo "❌ Pre-commit checks failed. Fix issues before committing."
exit 1
fi
echo "✅ Pre-commit checks passed"
2. Fail Fast: Block Progression Immediately¶
Principle: Halt the pipeline as soon as a quality gate fails; do not continue to subsequent stages.
Implementation:
- Exit code enforcement: All gate tasks return non-zero exit codes on failure
- Pipeline stage dependencies: Subsequent stages depend on prior stage success (
dependsOn: CI_Stage,condition: succeeded()) - No silent warnings: All warnings treated as errors in build configuration (
<TreatWarningsAsErrors>true</TreatWarningsAsErrors>) - Immediate notifications: Developers notified via email/Slack within 1 minute of gate failure
Benefits:
- Prevents compounding issues (e.g., deploying broken code to staging)
- Conserves CI/CD resources (no unnecessary downstream tasks)
- Clear accountability (developer knows immediately which gate failed)
ATP Example:
# Azure Pipelines stage with fail-fast behavior
stages:
- stage: CI_Stage
jobs:
- job: Build_Test_Scan
steps:
- task: DotNetCoreCLI@2
inputs:
command: 'build'
arguments: '--configuration Release /p:TreatWarningsAsErrors=true'
displayName: 'Build with Warnings as Errors'
# If build fails here, pipeline stops immediately
- task: BuildQualityChecks@8
inputs:
checkCoverage: true
coverageFailOption: 'fixed'
coverageThreshold: 70
displayName: 'Enforce Code Coverage Gate'
# If coverage < 70%, pipeline stops immediately
- stage: Deploy_Staging
dependsOn: CI_Stage
condition: succeeded() # Only runs if CI_Stage passed
3. Transparent Feedback: Clear Error Messages with Remediation¶
Principle: When a gate fails, provide actionable feedback with clear error messages and remediation steps.
Implementation:
- Descriptive error messages: Include gate name, actual vs. expected value, remediation steps
- Links to documentation: Error messages include URLs to relevant docs (e.g., coverage guide, security best practices)
- Suggested fixes: Where possible, provide copy-paste commands to fix issues (e.g.,
dotnet format,dotnet add package) - Trend analysis: Show gate pass/fail trends to identify recurring issues
Benefits:
- Faster remediation (developers don't waste time diagnosing issues)
- Self-service resolution (reduced dependency on platform team)
- Improved developer experience (clear guidance vs. cryptic errors)
ATP Example:
// Custom quality gate validation with clear feedback
public class CodeCoverageGateValidator
{
public ValidationResult ValidateCoverage(CoverageReport report, double threshold)
{
var lineCoverage = report.LineRate * 100;
if (lineCoverage < threshold)
{
return new ValidationResult
{
Passed = false,
ErrorMessage = $@"
❌ Code Coverage Gate Failed
Required: {threshold}% line coverage
Actual: {lineCoverage:F1}% line coverage
Deficit: {threshold - lineCoverage:F1}% ({report.UncoveredLines} lines not covered)
📋 Remediation Steps:
1. Identify uncovered code: dotnet reportgenerator -reports:coverage.xml -targetdir:coverage-report
2. Add unit tests for critical paths (Controllers, Services, Validators)
3. Re-run tests: dotnet test --collect:'XPlat Code Coverage'
4. Verify coverage: View HTML report in coverage-report/index.html
📚 Documentation: https://docs.connectsoft.example/quality-gates/coverage
🔍 Uncovered Files (Top 5):
{string.Join("\n", report.GetUncoveredFiles().Take(5).Select(f => $" - {f.Name}: {f.Coverage:F1}% covered"))}
",
Severity = ValidationSeverity.Error,
Category = "Test Coverage"
};
}
return ValidationResult.Success($"✅ Code coverage: {lineCoverage:F1}% (threshold: {threshold}%)");
}
}
4. Continuous Improvement: Ratchet Thresholds Upward¶
Principle: Never lower quality standards; continuously improve thresholds based on team maturity and historical performance.
Implementation:
- Quarterly reviews: Evaluate gate pass rates, remediation times, and threshold appropriateness
- Baseline protection: Track coverage/quality metrics per build; alert if metrics regress
- Incremental increases: Raise thresholds by 2-5% per quarter if sustained above target
- Zero tolerance for critical issues: Security gates (critical CVEs, secrets) have no grandfathering
Benefits:
- Prevents quality erosion over time (no "technical debt drift")
- Incentivizes proactive quality improvements (teams strive to exceed thresholds)
- Data-driven threshold evolution (based on actual team capability)
ATP Example:
# Threshold evolution tracking (quality-gate-history.yml)
coverageThresholds:
- effectiveDate: 2024-01-01
threshold: 65%
rationale: Initial baseline for ATP launch
- effectiveDate: 2024-04-01
threshold: 68%
rationale: Q1 2024 sustained at 72% avg; raised by 3%
- effectiveDate: 2024-07-01
threshold: 70%
rationale: Q2 2024 sustained at 75% avg; raised to 70% (industry standard)
- effectiveDate: 2025-01-01
threshold: 73%
rationale: Q4 2024 sustained at 78% avg; raised by 3%
approvedBy: Lead Architect
adrReference: ADR-042-coverage-threshold-increase
# Automated check: Prevent lowering thresholds
- script: |
CURRENT=$(cat azure-pipelines.yml | grep coverageThreshold | awk '{print $2}')
PREVIOUS=$(git show HEAD~1:azure-pipelines.yml | grep coverageThreshold | awk '{print $2}')
if (( $(echo "$CURRENT < $PREVIOUS" | bc -l) )); then
echo "❌ Error: Coverage threshold lowered from $PREVIOUS to $CURRENT"
echo " Quality thresholds can only be raised, never lowered."
exit 1
fi
displayName: 'Validate Threshold Ratcheting'
5. Evidence-Based: All Gate Results Logged as Compliance Artifacts¶
Principle: Every quality gate execution produces auditable evidence that is retained for compliance, security audits, and retrospectives.
Implementation:
- Artifact publishing: All gate results (test reports, scan results, SBOM) published to Azure Artifacts
- Immutable storage: Compliance artifacts stored in Azure Blob Storage with WORM (Write Once Read Many) enabled
- Metadata tagging: Artifacts tagged with build ID, commit SHA, approver identities, gate pass/fail status
- Retention enforcement: 7-year retention for production builds; 90 days for dev/test builds
Benefits:
- SOC 2/GDPR/HIPAA audit readiness (evidence available on-demand)
- Forensic analysis of production incidents (trace back to build artifacts)
- Quality trend analysis (identify gate failure patterns over time)
ATP Example:
# Publish compliance artifacts with metadata
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)/compliance'
ArtifactName: 'compliance-evidence-$(Build.BuildNumber)'
displayName: 'Publish Compliance Evidence'
- task: AzureCLI@2
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
# Upload to immutable blob storage with metadata
az storage blob upload-batch \
--source "$(Build.ArtifactStagingDirectory)/compliance" \
--destination compliance-artifacts \
--account-name atpcomplianceblob \
--metadata \
BuildId=$(Build.BuildId) \
CommitSha=$(Build.SourceVersion) \
Pipeline=$(Build.DefinitionName) \
Environment=Production \
RetentionYears=7 \
GatesPassed=true
# Enable legal hold (immutability)
az storage blob set-legal-hold \
--container compliance-artifacts \
--account-name atpcomplianceblob \
--tags audit-evidence=true
displayName: 'Archive Evidence with Legal Hold'
Summary¶
- Purpose: Quality gates enforce code quality, security, compliance, and performance standards at every CI/CD stage
- Scope: All ATP microservices, IaC, shared libraries, database migrations; covers build, test, security, compliance, performance, observability, API contract gates
- Out of Scope: Code review processes, incident response, deployment strategies, environment config (covered in other documents)
- Artifacts: Test results, coverage reports, SBOM, vulnerability scans, compliance evidence (retained 7 years for production)
- Philosophy: 5 core principles (Shift-Left, Fail Fast, Transparent Feedback, Continuous Improvement, Evidence-Based)
- Ownership: Platform Engineering (author), Tech Lead/Security/SRE (reviewers), Lead Architect (approver), Quarterly review cadence
- Success Metrics: ≥95% first-attempt pass rate, ≤24h remediation time, zero critical/high CVEs in production, 100% artifact completeness
Gate Types Overview¶
ATP implements six primary quality gate categories, each enforced at different stages of the CI/CD pipeline. Gates are cumulative: a build must pass all prior gates before progressing to subsequent stages.
Gate Execution Flow¶
graph LR
A[Code Commit] --> B[Build Quality Gates]
B --> C[Test Coverage Gates]
C --> D[Security Gates]
D --> E[Compliance Gates]
E --> F[CI Artifacts Published]
F --> G[Deploy to Staging]
G --> H[Performance Gates]
H --> I[Observability Gates]
I --> J[Deploy to Production]
B -.->|Fail| K[Pipeline Stopped]
C -.->|Fail| K
D -.->|Fail| K
E -.->|Fail| K
H -.->|Fail| L[Block Prod Deploy]
I -.->|Fail| L
style K fill:#ff6b6b
style L fill:#feca57
Key Characteristics:
- Sequential Execution: Gates execute in order (Build → Test → Security → Compliance → Performance → Observability)
- Early Termination: First gate failure stops the pipeline immediately (fail fast)
- Stage-Specific: Some gates only run in specific environments (e.g., performance gates in staging)
- Artifact Dependencies: Later gates may analyze artifacts from earlier gates (e.g., SBOM from build stage)
Gate Category Summary¶
| Gate Category | Purpose | Enforcement Point | Blocker | Typical Duration | Owner |
|---|---|---|---|---|---|
| Build Quality | Code compiles, no warnings, static analysis passes | CI stage (every commit) | ✅ Yes | 2-4 minutes | Tech Lead |
| Test Coverage | Sufficient unit/integration tests, 100% pass rate | CI stage (every commit) | ✅ Yes | 3-5 minutes | QA Lead |
| Security | No vulnerabilities, secrets detected, images scanned | CI + pre-deploy | ✅ Yes | 5-8 minutes | Security Officer |
| Compliance | SBOM generated, audit logging validated, PII redacted | CI + pre-deploy | ✅ Yes | 2-3 minutes | Compliance Officer |
| Performance | Load/chaos tests pass, latency/error thresholds met | Staging (pre-prod) | ⚠️ Warning (blocks prod) | 10-15 minutes | SRE Lead |
| Observability | Metrics, traces, logs validated, health checks pass | Staging (pre-prod) | ⚠️ Warning (blocks prod) | 3-5 minutes | SRE Lead |
Total CI Pipeline Duration (Build → Compliance): ~15-20 minutes
Total Staging Validation (Performance + Observability): ~13-20 minutes
End-to-End (Commit → Production-Ready): ~30-40 minutes
Gate Category 1: Build Quality¶
Purpose: Ensure code compiles successfully, adheres to coding standards, and passes static analysis before any further validation.
Enforcement Point: CI stage (triggered on every commit, PR, and main branch build)
Blocker Status: ✅ Yes — Pipeline fails immediately if build quality gates fail
Key Checks:
- Compilation: Code builds without errors using
dotnet build - Warnings as Errors: Zero warnings (all warnings treated as errors)
- Static Analysis: StyleCop, SonarQube, Meziantou, AsyncFixer rules enforced
- Code Style: EditorConfig rules, naming conventions, documentation comments
- Deprecated APIs: No usage of deprecated packages or APIs
Typical Failures:
- Compilation errors (syntax, missing references, type mismatches)
- Code style violations (naming, spacing, documentation)
- Static analysis issues (async/await patterns, nullability, resource disposal)
- Deprecated package usage (packages with known CVEs or EOL status)
Remediation Time: ≤30 minutes (most failures fixed by developers immediately)
Automation:
# Azure Pipelines: Build Quality Gate
- task: DotNetCoreCLI@2
inputs:
command: 'build'
projects: '$(solution)'
arguments: '--configuration Release /p:TreatWarningsAsErrors=true /p:EnforceCodeStyleInBuild=true'
displayName: 'Build with Warnings as Errors'
- task: SonarQubePrepare@5
inputs:
SonarQube: 'SonarCloud-ConnectSoft'
scannerMode: 'MSBuild'
projectKey: 'ConnectSoft_ATP'
- task: SonarQubeAnalyze@5
displayName: 'SonarQube Analysis'
- task: SonarQubePublish@5
inputs:
pollingTimeoutSec: '300'
Gate Category 2: Test Coverage¶
Purpose: Validate that sufficient automated tests exist and execute successfully, with adequate code coverage.
Enforcement Point: CI stage (after successful build)
Blocker Status: ✅ Yes — Pipeline fails if coverage thresholds not met or tests fail
Key Checks:
- Test Execution: All unit and integration tests pass (100% pass rate)
- Code Coverage: Line coverage ≥70%, branch coverage ≥60% (service-specific thresholds)
- Test Duration: Tests complete within acceptable time limits (unit <30s, integration <5min)
- Flaky Test Detection: No tests with <95% historical pass rate
- Test Quality: Minimum assertions per test, no skipped tests without justification
Typical Failures:
- Test failures due to bugs introduced in code changes
- Coverage drops below threshold (new code without tests)
- Flaky tests (intermittent failures due to timing, external dependencies)
- Test timeouts (long-running tests, deadlocks, infinite loops)
Remediation Time: ≤2 hours (write missing tests, fix failing tests)
Service-Specific Thresholds:
| Service | Line Coverage | Branch Coverage | Rationale |
|---|---|---|---|
| Ingestion | ≥75% | ≥65% | Critical path for all audit events; high reliability requirement |
| Query | ≥80% | ≥70% | Complex query logic with multiple filters; high test coverage essential |
| Integrity | ≥85% | ≥75% | Security-critical tamper-evidence; highest coverage requirement |
| Export | ≥70% | ≥60% | I/O-heavy with external dependencies; lower threshold acceptable |
| Policy | ≥80% | ≥70% | Business rules enforcement; high coverage for rule validation |
| Search | ≥70% | ≥60% | Integration-heavy with Elasticsearch; focus on integration tests |
| Gateway | ≥65% | ≥55% | API routing and orchestration; lower threshold, focus on E2E tests |
Automation:
# Azure Pipelines: Test Coverage Gate
- task: DotNetCoreCLI@2
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: '--configuration Release --collect:"XPlat Code Coverage" --settings:CodeCoverage.runsettings'
displayName: 'Run Tests with Coverage'
- task: PublishCodeCoverageResults@1
inputs:
codeCoverageTool: 'Cobertura'
summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'
displayName: 'Publish Coverage Results'
- task: BuildQualityChecks@8
inputs:
checkCoverage: true
coverageFailOption: 'fixed'
coverageType: 'lines'
coverageThreshold: 70 # ATP minimum baseline
treatBuildWarningsAsErrors: true
displayName: 'Enforce Coverage Threshold'
Gate Category 3: Security¶
Purpose: Detect vulnerabilities, secrets, and security issues before code reaches production.
Enforcement Point: CI stage (after tests) + Pre-deployment validation (staging/prod)
Blocker Status: ✅ Yes — Pipeline fails on critical/high vulnerabilities or detected secrets
Key Checks:
- Dependency Scanning: OWASP Dependency-Check for vulnerable NuGet packages (CVSS ≥7)
- Secrets Detection: CredScan, GitGuardian for API keys, passwords, tokens in code
- Container Scanning: Trivy scan for Docker image vulnerabilities (before ACR push)
- SAST (Static Application Security Testing): SonarQube security rules (injection, XSS, crypto)
- License Compliance: Verify all dependencies have acceptable licenses (no GPL/AGPL)
Typical Failures:
- Vulnerable dependencies (outdated packages with known CVEs)
- Secrets in code (connection strings, API keys, passwords in appsettings.json or code)
- Container image vulnerabilities (base image outdated, vulnerable OS packages)
- Insecure coding patterns (SQL injection, XSS, weak crypto)
Remediation Time: ≤24 hours (critical/high), ≤30 days (medium/low)
Severity Thresholds:
| Severity | CVSS Score | Action | SLA | Production Blocker |
|---|---|---|---|---|
| Critical | 9.0-10.0 | ❌ Block build immediately | Fix within 24h | ✅ Yes |
| High | 7.0-8.9 | ❌ Block build; require patching or risk acceptance | Fix within 7 days | ✅ Yes |
| Medium | 4.0-6.9 | ⚠️ Warning; track in security backlog | Fix within 30 days | ❌ No |
| Low | 0.1-3.9 | ℹ️ Info; track in backlog | Fix in next major release | ❌ No |
| None | 0.0 | ℹ️ Info; no action required | N/A | ❌ No |
Automation:
# Azure Pipelines: Security Gates
- task: dependency-check-build-task@6
inputs:
projectName: 'ConnectSoft.ATP.Ingestion'
scanPath: '$(Build.SourcesDirectory)'
format: 'HTML,JSON,XML'
failOnCVSS: 7 # Block on High/Critical (CVSS ≥7)
suppressionFile: 'dependency-check-suppressions.xml'
displayName: 'OWASP Dependency Scan'
- task: CredScan@3
inputs:
toolMajorVersion: 'V2'
suppressionsFile: 'credscan-suppressions.json'
displayName: 'Secrets Detection'
- script: |
trivy image --severity HIGH,CRITICAL --exit-code 1 \
$(containerRegistry)/$(imageRepository):$(Build.BuildNumber)
displayName: 'Trivy Container Scan'
condition: and(succeeded(), eq(variables['Build.Reason'], 'PullRequest'))
Gate Category 4: Compliance¶
Purpose: Ensure regulatory compliance, audit logging, PII protection, and supply chain transparency.
Enforcement Point: CI stage (after security) + Pre-deployment validation
Blocker Status: ✅ Yes — Pipeline fails if SBOM missing, audit logging incomplete, or PII detected in logs
Key Checks:
- SBOM Generation: CycloneDX/SPDX bill of materials for all dependencies
- Audit Logging Validation: All state-mutating operations emit audit events
- PII Redaction: No raw PII (email, phone, SSN) in log statements
- Compliance Checklist: GDPR/HIPAA safeguards validated (encryption, retention, tenant isolation)
- License Compliance: All dependencies have acceptable licenses (no copyleft in production)
Typical Failures:
- Missing SBOM (build artifact not generated or published)
- Audit logging gaps (new API endpoints without audit event emission)
- PII in logs (raw email/phone logged without redaction)
- Compliance checklist items incomplete (e.g., retention policies not configured)
Remediation Time: ≤4 hours (SBOM/logging), ≤1 day (PII redaction), ≤1 week (compliance checklist)
GDPR/HIPAA Compliance Checklist:
| Control | Requirement | Validation | Blocker |
|---|---|---|---|
| Encryption at Rest | All databases, storage accounts encrypted (TDE, SSE) | Azure Policy scan | ✅ Yes (staging/prod) |
| Encryption in Transit | TLS 1.3 enforced for all external APIs | Network policy validation | ✅ Yes (prod) |
| Tenant Isolation | Multi-tenant data separation validated in integration tests | Test results (tag: @tenantIsolation) | ✅ Yes |
| Retention Policies | Configurable retention per tenant (7 years default) | Configuration validation | ✅ Yes |
| DSAR Workflow | Data Subject Access Request workflow implemented | API contract test (export endpoint) | ✅ Yes |
| Breach Notification | Incident response procedure documented | Document exists in repo | ⚠️ Warning |
| Audit Logging | All write operations emit audit events | Custom validator (IAuditLogger.LogAsync) | ✅ Yes |
| PII Redaction | Sensitive fields redacted in logs/telemetry | Custom validator (log parsing) | ✅ Yes |
Automation:
# Azure Pipelines: Compliance Gates
- task: CycloneDX@1
inputs:
projectPath: '$(Build.SourcesDirectory)'
outputFormat: 'json,xml'
outputPath: '$(Build.ArtifactStagingDirectory)/sbom'
displayName: 'Generate SBOM (CycloneDX)'
- script: |
./scripts/validate-audit-logging.ps1 -Path "$(Build.SourcesDirectory)" -Threshold 100
displayName: 'Validate Audit Logging Coverage'
- script: |
./scripts/validate-pii-redaction.ps1 -Path "$(Build.SourcesDirectory)"
displayName: 'Validate PII Redaction'
- task: AzurePolicyCompliance@1
inputs:
azureSubscription: '$(azureSubscription)'
resourceGroup: 'ATP-$(Environment)-RG'
policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/...'
displayName: 'Validate Azure Policy Compliance'
Gate Category 5: Performance¶
Purpose: Validate that application meets performance requirements under load and during failures.
Enforcement Point: Staging environment (before production deployment)
Blocker Status: ⚠️ Warning in staging, ✅ Blocker for production deployment
Key Checks:
- Load Testing: Simulate production traffic (500-1000 concurrent users, 10-15 minutes)
- Latency Thresholds: p50 <100ms, p95 <500ms, p99 <1000ms
- Error Rate: <0.1% (1 error per 1000 requests)
- Throughput: ≥1000 requests/second sustained
- Chaos Testing: Pod restarts, network latency, storage unavailability scenarios
Typical Failures:
- High latency due to inefficient queries, N+1 problems, missing indexes
- High error rate due to race conditions, deadlocks, resource exhaustion
- Low throughput due to synchronous I/O, single-threaded bottlenecks
- Chaos test failures due to lack of retries, circuit breakers, graceful degradation
Remediation Time: ≤1 week (performance optimization), ≤2 weeks (resilience improvements)
Performance Metrics:
| Metric | Target (ATP) | Industry Standard | Measurement Tool | Action on Failure |
|---|---|---|---|---|
| p50 Latency | <100ms | <200ms | JMeter, k6 | ⚠️ Warning; investigate |
| p95 Latency | <500ms | <1000ms | JMeter, k6 | ❌ Block prod deployment |
| p99 Latency | <1000ms | <2000ms | JMeter, k6 | ⚠️ Warning; track |
| Error Rate | <0.1% | <1% | JMeter, k6 | ❌ Block prod deployment |
| Throughput | ≥1000 RPS | ≥500 RPS | JMeter, k6 | ℹ️ Info; track capacity |
| CPU Utilization | <70% avg | <80% avg | Azure Monitor | ⚠️ Warning; optimize |
| Memory Utilization | <80% avg | <85% avg | Azure Monitor | ⚠️ Warning; investigate leaks |
Chaos Test Scenarios:
| Scenario | Pass Rate | Blocker | Expected Behavior |
|---|---|---|---|
| Pod Restart (random pod killed) | 100% | ✅ Yes | Graceful shutdown, requests redistributed, no data loss |
| Network Latency (500ms added) | 95% | ❌ No | Timeouts honored, retries triggered, circuit breaker opens |
| Storage Unavailable (SQL/Blob down 30s) | 100% | ✅ Yes | Circuit breaker opens, degraded mode, no cascading failures |
| CPU Throttle (50% CPU limit) | 90% | ❌ No | Graceful degradation, autoscaling triggered, no OOM kills |
| Memory Pressure (80% memory used) | 95% | ❌ No | GC triggered, cache eviction, no OOM exceptions |
Automation:
# Azure Pipelines: Performance Gates (Staging)
- task: JMeterLoadTest@1
inputs:
testPlan: 'load-tests/atp-load-test.jmx'
targetUrl: '$(StagingUrl)'
users: 500
duration: 600 # 10 minutes
thresholdP50: 100
thresholdP95: 500
thresholdErrorRate: 0.1
displayName: 'Run Load Test (JMeter)'
- task: ChaosTest@1
inputs:
chaosManifest: 'chaos-tests/pod-restart.yaml'
namespace: 'atp-staging'
duration: 300 # 5 minutes
displayName: 'Run Chaos Test (Pod Restart)'
Gate Category 6: Observability¶
Purpose: Validate that application emits sufficient telemetry (logs, metrics, traces) for production observability.
Enforcement Point: Staging environment (before production deployment)
Blocker Status: ⚠️ Warning in staging, ✅ Blocker for production deployment
Key Checks:
- OpenTelemetry Instrumentation: All HTTP endpoints, database calls, message bus operations instrumented
- Health Checks:
/health/liveand/health/readyreturn 200 OK - Structured Logging: All logs use structured logging (JSON), no string concatenation
- Custom Metrics: Business KPIs exposed (audit events ingested, queries executed, export jobs completed)
- Trace Context Propagation: Trace IDs propagated across service boundaries (W3C Trace Context)
Typical Failures:
- Missing instrumentation (new endpoints without activity spans)
- Health check failures (dependency checks fail, timeouts)
- Unstructured logs (string concatenation, missing correlation IDs)
- Missing metrics (business KPIs not exposed for Prometheus scraping)
Remediation Time: ≤4 hours (instrumentation), ≤1 day (health checks), ≤2 hours (structured logging)
Observability Requirements:
| Requirement | Validation | Tool | Blocker |
|---|---|---|---|
| Activity Spans | All HTTP endpoints have Activity spans |
Custom validator (DI container scan) | ✅ Yes |
| Database Instrumentation | All EF Core queries instrumented | System.Diagnostics.DiagnosticSource listener |
✅ Yes |
| Structured Logging | All logs use ILogger<T> with structured parameters |
Custom log parser | ✅ Yes |
| Health Checks | /health/live and /health/ready return 200 |
HTTP test task | ✅ Yes |
| Prometheus Metrics | /metrics endpoint exposed and scrapable |
Prometheus validation | ⚠️ Warning |
| Trace Context | TraceParent header propagated (W3C) | Integration test validation | ✅ Yes |
Health Check Components (must all pass):
// Health check dependencies (all must be healthy)
public static class HealthCheckExtensions
{
public static IHealthChecksBuilder AddAtpHealthChecks(
this IHealthChecksBuilder builder,
IConfiguration configuration)
{
return builder
// Liveness checks (process is alive)
.AddCheck("self", () => HealthCheckResult.Healthy("Service is running"))
// Readiness checks (dependencies available)
.AddSqlServer(
connectionString: configuration.GetConnectionString("DefaultConnection"),
name: "sql-server",
tags: new[] { "ready", "database" })
.AddRedis(
redisConnectionString: configuration.GetConnectionString("Redis"),
name: "redis-cache",
tags: new[] { "ready", "cache" })
.AddAzureServiceBusTopic(
connectionString: configuration.GetConnectionString("ServiceBus"),
topicName: "audit-events",
name: "service-bus",
tags: new[] { "ready", "messaging" })
.AddAzureBlobStorage(
connectionString: configuration.GetConnectionString("BlobStorage"),
containerName: "audit-attachments",
name: "blob-storage",
tags: new[] { "ready", "storage" })
.AddApplicationInsightsPublisher();
}
}
Automation:
# Azure Pipelines: Observability Gates (Staging)
- task: HttpTest@1
inputs:
url: '$(StagingUrl)/health/ready'
method: 'GET'
expectedStatusCode: 200
retryCount: 3
retryDelay: 5
displayName: 'Validate Health Checks'
- script: |
./scripts/validate-otel-instrumentation.ps1 -Path "$(Build.SourcesDirectory)"
displayName: 'Validate OpenTelemetry Instrumentation'
- script: |
curl -s "$(StagingUrl)/metrics" | promtool check metrics
displayName: 'Validate Prometheus Metrics'
Summary¶
- 6 Gate Categories: Build Quality, Test Coverage, Security, Compliance, Performance, Observability
- Sequential Execution: Gates run in order with early termination on failure (fail fast)
- CI Gates (Build → Compliance): ~15-20 minutes, all blockers
- Staging Gates (Performance + Observability): ~13-20 minutes, warnings that block production
- Service-Specific Thresholds: Coverage varies by service (65%-85% based on criticality)
- Severity-Based Actions: Critical/High vulnerabilities are blockers; Medium/Low are warnings
- Compliance Focus: SBOM, audit logging, PII redaction, GDPR/HIPAA checklist all enforced
- Performance Standards: p95 <500ms, error rate <0.1%, 1000+ RPS sustained
- Observability Requirements: OpenTelemetry, health checks, structured logging, custom metrics all validated
Build Quality Gates (Deep Dive)¶
Build quality gates are the first line of defense in ATP's quality enforcement strategy. They execute immediately after code is committed, providing rapid feedback to developers before any tests run or security scans execute.
Philosophy: If code doesn't compile cleanly or violates coding standards, there's no point in running expensive test suites or security scans. Build quality gates ensure a baseline of code hygiene before proceeding.
Build Quality Gate Workflow¶
graph TD
A[Code Committed] --> B[Restore NuGet Packages]
B --> C[Compile Code]
C --> D{Build Success?}
D -->|No| E[Build Failed ❌]
D -->|Yes| F[Run StyleCop Analysis]
F --> G{StyleCop Pass?}
G -->|No| H[Style Violations ❌]
G -->|Yes| I[Run SonarQube Scan]
I --> J{SonarQube Quality Gate?}
J -->|No| K[Quality Gate Failed ❌]
J -->|Yes| L[Run Meziantou/AsyncFixer]
L --> M{Analyzers Pass?}
M -->|No| N[Analyzer Violations ❌]
M -->|Yes| O[Build Quality Passed ✅]
E --> P[Pipeline Stopped]
H --> P
K --> P
N --> P
O --> Q[Proceed to Test Gates]
style E fill:#ff6b6b
style H fill:#ff6b6b
style K fill:#ff6b6b
style N fill:#ff6b6b
style O fill:#90EE90
Typical Build Quality Gate Duration: 2-4 minutes
Code Compilation¶
Purpose: Ensure all code compiles successfully with zero errors and zero warnings before any further validation.
Threshold:
- Build Errors: 0 (absolute requirement)
- Build Warnings: 0 (all warnings treated as errors)
- Exit Code:
dotnet buildmust return 0
Configuration (.csproj):
<Project Sdk="Microsoft.NET.Sdk.Web">
<PropertyGroup>
<TargetFramework>net8.0</TargetFramework>
<Nullable>enable</Nullable>
<ImplicitUsings>enable</ImplicitUsings>
<!-- Build Quality Enforcement -->
<TreatWarningsAsErrors>true</TreatWarningsAsErrors>
<WarningsAsErrors />
<NoWarn></NoWarn> <!-- Empty: no warnings suppressed -->
<!-- Code Analysis -->
<EnforceCodeStyleInBuild>true</EnforceCodeStyleInBuild>
<EnableNETAnalyzers>true</EnableNETAnalyzers>
<AnalysisLevel>latest</AnalysisLevel>
<AnalysisMode>All</AnalysisMode>
<!-- Documentation Enforcement -->
<GenerateDocumentationFile>true</GenerateDocumentationFile>
<NoWarn>$(NoWarn);1591</NoWarn> <!-- Temporarily allow missing XML docs -->
<!-- Deterministic Builds (for reproducibility) -->
<Deterministic>true</Deterministic>
<ContinuousIntegrationBuild Condition="'$(CI)' == 'true'">true</ContinuousIntegrationBuild>
</PropertyGroup>
<!-- Static Analysis Packages -->
<ItemGroup>
<PackageReference Include="StyleCop.Analyzers" Version="1.2.0-beta.556">
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
<PackageReference Include="Meziantou.Analyzer" Version="2.0.110">
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
<PackageReference Include="AsyncFixer" Version="1.6.0">
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
<PackageReference Include="Microsoft.CodeAnalysis.NetAnalyzers" Version="8.0.0">
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
</ItemGroup>
<!-- StyleCop Configuration -->
<ItemGroup>
<AdditionalFiles Include="stylecop.json" />
</ItemGroup>
</Project>
Enforcement (Azure Pipelines):
# Build Compilation Gate
- task: DotNetCoreCLI@2
inputs:
command: 'build'
projects: '$(solution)'
arguments: >
--configuration Release
--no-restore
/p:TreatWarningsAsErrors=true
/p:EnforceCodeStyleInBuild=true
/p:ContinuousIntegrationBuild=true
/p:Deterministic=true
/warnaserror
displayName: 'Build with Warnings as Errors'
# Fail pipeline on non-zero exit code
continueOnError: false
# Capture build logs for diagnostics
env:
DOTNET_CLI_TELEMETRY_OPTOUT: 1
DOTNET_SKIP_FIRST_TIME_EXPERIENCE: 1
Common Build Failures & Remediation:
| Failure Type | Example Error | Remediation | Typical Time |
|---|---|---|---|
| Syntax Error | CS1002: ; expected |
Fix syntax in code | 1-5 min |
| Type Mismatch | CS0029: Cannot implicitly convert type |
Fix type casting or generics | 5-15 min |
| Missing Reference | CS0246: The type or namespace could not be found |
Add NuGet package or project reference | 5-10 min |
| Nullability Warning | CS8600: Converting null literal or possible null value |
Add null checks or nullable annotations | 10-30 min |
| Async/Await | CS4014: Call is not awaited |
Add await or .ConfigureAwait(false) |
5-10 min |
| Unused Variable | CS0219: Variable is assigned but never used |
Remove variable or use it | 1-2 min |
| Missing XML Doc | CS1591: Missing XML comment for publicly visible type |
Add /// <summary> documentation |
10-20 min |
Build Performance Optimization:
# Local developer build (fast feedback)
dotnet build --configuration Debug --no-restore /p:TreatWarningsAsErrors=false
# CI build (full enforcement)
dotnet build --configuration Release --no-restore /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=true
# Parallel build for large solutions (8 CPUs)
dotnet build --configuration Release -m:8
# Build with binary log (for diagnostics)
dotnet build --configuration Release /bl:build.binlog
Static Code Analysis¶
ATP uses four complementary static analyzers to enforce code quality, each focusing on different aspects of code hygiene.
Analyzer 1: StyleCop (Code Style & Documentation)¶
Purpose: Enforce consistent code style, naming conventions, and documentation standards across all ATP services.
Rules Enforced: 125+ rules covering naming, spacing, ordering, documentation, maintainability
Configuration (stylecop.json):
{
"$schema": "https://raw.githubusercontent.com/DotNetAnalyzers/StyleCopAnalyzers/master/StyleCop.Analyzers/StyleCop.Analyzers/Settings/stylecop.schema.json",
"settings": {
"documentationRules": {
"companyName": "ConnectSoft",
"copyrightText": "Copyright (c) {companyName}. All rights reserved.\nLicensed under the MIT license.",
"headerDecoration": "-----------------------------------------------------------------------",
"xmlHeader": true,
"documentInterfaces": true,
"documentExposedElements": true,
"documentInternalElements": false,
"documentPrivateElements": false,
"documentPrivateFields": false,
"fileNamingConvention": "stylecop"
},
"namingRules": {
"allowCommonHungarianPrefixes": false,
"allowedHungarianPrefixes": [],
"includeInferredTupleElementNames": true,
"tupleElementNameCasing": "camelCase"
},
"orderingRules": {
"elementOrder": [
"kind",
"accessibility",
"constant",
"static",
"readonly"
],
"systemUsingDirectivesFirst": true,
"usingDirectivesPlacement": "outsideNamespace",
"blankLinesBetweenUsingGroups": "allow"
},
"maintainabilityRules": {
"topLevelTypes": "multiple"
},
"layoutRules": {
"newlineAtEndOfFile": "require",
"allowConsecutiveUsings": true
}
}
}
Key StyleCop Rules (ATP-Specific):
| Rule ID | Rule Name | Severity | Example Violation | Fix |
|---|---|---|---|---|
| SA1200 | Using directives placement | Error | using inside namespace |
Move using outside namespace |
| SA1633 | File header required | Warning | Missing copyright header | Add standard file header |
| SA1600 | Elements should be documented | Warning | Missing XML documentation | Add /// <summary> tags |
| SA1309 | Field names must not begin with underscore | Error | _field for public fields |
Use _field only for private fields |
| SA1101 | Prefix local calls with this | Disabled | — | ATP preference: no this. prefix |
| SA1503 | Braces for single-line statements | Error | if (x) DoSomething(); |
Add braces: if (x) { DoSomething(); } |
| SA1516 | Elements should be separated by blank line | Warning | No blank line between methods | Add blank line |
StyleCop Suppression (when necessary):
// Global suppression (GlobalSuppressions.cs)
[assembly: SuppressMessage("StyleCop.CSharp.DocumentationRules", "SA1633:File should have header", Justification = "Reviewed: Standard header enforced by .editorconfig")]
// Local suppression (specific violation)
#pragma warning disable SA1600 // Elements should be documented
public class GeneratedClass // Auto-generated, no docs needed
{
}
#pragma warning restore SA1600
Analyzer 2: SonarQube (Bugs, Code Smells, Security)¶
Purpose: Detect bugs, code smells, and security vulnerabilities through deep semantic analysis.
Rules Enforced: 500+ rules covering reliability, maintainability, security, code smells
Quality Profile (ConnectSoft-ATP-Default):
# SonarQube Quality Profile (ATP)
qualityGate:
name: ConnectSoft-ATP-Default
conditions:
# Reliability: Zero bugs allowed
- metric: bugs
operator: GREATER_THAN
threshold: 0
description: "Zero tolerance for bugs"
# Security: Zero vulnerabilities allowed
- metric: vulnerabilities
operator: GREATER_THAN
threshold: 0
description: "Zero tolerance for security issues"
# Security: Zero security hotspots in review
- metric: security_hotspots_reviewed
operator: LESS_THAN
threshold: 100
description: "All security hotspots must be reviewed"
# Maintainability: Max 10 code smells (minor issues)
- metric: code_smells
operator: GREATER_THAN
threshold: 10
description: "Limit technical debt"
# Coverage: Minimum 70% line coverage
- metric: coverage
operator: LESS_THAN
threshold: 70.0
description: "Enforce minimum test coverage"
# Duplication: Max 3% duplicated lines
- metric: duplicated_lines_density
operator: GREATER_THAN
threshold: 3.0
description: "Prevent copy-paste programming"
# Complexity: Cognitive complexity ≤15 per method
- metric: cognitive_complexity
operator: GREATER_THAN
threshold: 15
description: "Keep methods simple"
# New Code: Zero new bugs in new code
- metric: new_bugs
operator: GREATER_THAN
threshold: 0
onlyNewCode: true
description: "No new bugs introduced"
# New Code: 100% coverage on new code
- metric: new_coverage
operator: LESS_THAN
threshold: 100.0
onlyNewCode: true
description: "All new code must be tested"
SonarQube Integration (Azure Pipelines):
# SonarQube Analysis Gate
stages:
- stage: CI_Stage
jobs:
- job: Build_Analyze
steps:
# 1. Prepare SonarQube
- task: SonarQubePrepare@5
inputs:
SonarQube: 'SonarCloud-ConnectSoft'
scannerMode: 'MSBuild'
projectKey: 'ConnectSoft_ATP_Ingestion'
projectName: 'ATP Ingestion Service'
projectVersion: '$(Build.BuildNumber)'
extraProperties: |
sonar.organization=connectsoft
sonar.sources=src
sonar.tests=tests
sonar.cs.opencover.reportsPaths=$(Agent.TempDirectory)/**/coverage.opencover.xml
sonar.exclusions=**/Migrations/**,**/obj/**,**/bin/**
sonar.coverage.exclusions=**/*Tests.cs,**/Program.cs,**/Startup.cs
sonar.cpd.exclusions=**/Models/**,**/DTOs/**
displayName: 'Prepare SonarQube Analysis'
# 2. Restore NuGet packages
- task: DotNetCoreCLI@2
inputs:
command: 'restore'
projects: '$(solution)'
displayName: 'Restore NuGet Packages'
# 3. Build (SonarQube collects metrics)
- task: DotNetCoreCLI@2
inputs:
command: 'build'
projects: '$(solution)'
arguments: '--configuration Release --no-restore'
displayName: 'Build Solution'
# 4. Run Tests (coverage data for SonarQube)
- task: DotNetCoreCLI@2
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: '--configuration Release --no-build --collect:"XPlat Code Coverage" -- DataCollectionRunSettings.DataCollectors.DataCollector.Configuration.Format=opencover'
displayName: 'Run Tests with Coverage'
# 5. Analyze Code with SonarQube
- task: SonarQubeAnalyze@5
displayName: 'Run SonarQube Analysis'
# 6. Publish Quality Gate Result
- task: SonarQubePublish@5
inputs:
pollingTimeoutSec: '300'
displayName: 'Publish Quality Gate Result'
# 7. Break Build on Quality Gate Failure
- script: |
# Query SonarQube API for quality gate status
QUALITY_GATE=$(curl -u $(SonarToken): \
"https://sonarcloud.io/api/qualitygates/project_status?projectKey=ConnectSoft_ATP_Ingestion" \
| jq -r '.projectStatus.status')
if [ "$QUALITY_GATE" != "OK" ]; then
echo "##vso[task.logissue type=error]SonarQube Quality Gate Failed: $QUALITY_GATE"
echo "##vso[task.complete result=Failed;]Quality Gate Failed"
exit 1
fi
echo "✅ SonarQube Quality Gate Passed"
displayName: 'Validate Quality Gate'
env:
SonarToken: $(SonarQubeToken)
Top SonarQube Rules (ATP-Critical):
| Rule ID | Rule Name | Type | Severity | Example | Fix |
|---|---|---|---|---|---|
| S1172 | Unused method parameters | Code Smell | Major | public void Process(int unused) |
Remove or use parameter |
| S2589 | Boolean expressions should not be gratuitous | Bug | Blocker | if (x == true && x == false) |
Fix logic error |
| S2696 | Instance methods should not write to static fields | Bug | Critical | Instance method writes to static field | Refactor to instance field |
| S3776 | Cognitive complexity too high | Code Smell | Critical | Method with complexity > 15 | Refactor into smaller methods |
| S1135 | Track uses of "TODO" tags | Info | Info | // TODO: Fix this |
Create work item, remove TODO |
| S4790 | Weak cryptographic algorithms | Vulnerability | Blocker | MD5.Create() |
Use SHA256 or better |
| S2077 | Formatting SQL queries is security-sensitive | Security Hotspot | Major | $"SELECT * FROM Users WHERE Id={id}" |
Use parameterized queries |
| S1481 | Unused local variables should be removed | Code Smell | Minor | var unused = GetData(); |
Remove or use variable |
SonarQube False Positive Suppression:
// Suppress specific rule for method
[SuppressMessage("SonarQube", "S3776:Cognitive Complexity of methods should not be too high",
Justification = "Complex business logic; covered by tests")]
public async Task<AuditEventResult> ProcessComplexEvent(AuditEvent evt)
{
// Complex logic here
}
// Suppress for entire file (use sparingly)
#pragma warning disable S1135 // Track uses of "TODO" tags
// TODO: This entire file is a prototype
#pragma warning restore S1135
Analyzer 3: Meziantou.Analyzer (Best Practices)¶
Purpose: Enforce .NET best practices, async/await patterns, and performance optimizations.
Rules Enforced: 150+ rules covering async, collections, strings, disposal, naming
Key Meziantou Rules (ATP-Enabled):
| Rule ID | Rule Name | Severity | Example | Fix |
|---|---|---|---|---|
| MA0001 | StringComparison is missing | Warning | str.Contains("test") |
str.Contains("test", StringComparison.Ordinal) |
| MA0004 | Use Task.ConfigureAwait(false) | Warning | await Task.Delay(100); |
await Task.Delay(100).ConfigureAwait(false); |
| MA0006 | Use String.Equals instead of equality operator | Warning | str == "test" |
str.Equals("test", StringComparison.Ordinal) |
| MA0011 | IFormatProvider is missing | Warning | int.Parse("123") |
int.Parse("123", CultureInfo.InvariantCulture) |
| MA0016 | Prefer return collection abstraction | Warning | public List<T> Get() |
public IEnumerable<T> Get() or IReadOnlyList<T> |
| MA0026 | Fix TODO comment | Info | // TODO: Implement |
Create work item, remove TODO |
| MA0040 | Use a cancellation token | Warning | async Task DoWork() |
async Task DoWork(CancellationToken ct) |
| MA0051 | Method is too long | Warning | Method > 60 lines | Refactor into smaller methods |
| MA0056 | Do not call overridable members in constructor | Warning | ctor() { VirtualMethod(); } |
Move to Initialize() method |
.editorconfig Configuration (Meziantou):
# Meziantou Analyzer Rules
[*.cs]
# MA0001: StringComparison is missing
dotnet_diagnostic.MA0001.severity = warning
# MA0004: Use Task.ConfigureAwait(false)
dotnet_diagnostic.MA0004.severity = warning
# MA0006: Use String.Equals
dotnet_diagnostic.MA0006.severity = warning
# MA0011: IFormatProvider is missing
dotnet_diagnostic.MA0011.severity = warning
# MA0016: Prefer return collection abstraction
dotnet_diagnostic.MA0016.severity = warning
# MA0040: Use a cancellation token
dotnet_diagnostic.MA0040.severity = warning
# MA0051: Method is too long (disable for auto-generated code)
dotnet_diagnostic.MA0051.severity = none
# MA0056: Do not call overridable members in constructor
dotnet_diagnostic.MA0056.severity = error
Analyzer 4: AsyncFixer (Async/Await Correctness)¶
Purpose: Detect async/await anti-patterns that can cause deadlocks, performance issues, or incorrect behavior.
Rules Enforced: 6 critical async patterns
Key AsyncFixer Rules (ATP-Enabled):
| Rule ID | Rule Name | Severity | Example | Issue | Fix |
|---|---|---|---|---|---|
| AsyncFixer01 | Unnecessary async/await | Warning | async Task<int> Get() => await Task.FromResult(1); |
Unnecessary overhead | Task<int> Get() => Task.FromResult(1); |
| AsyncFixer02 | Long-running or blocking operations | Warning | Task.Run(() => { Thread.Sleep(1000); }) |
Blocking thread pool | Use await Task.Delay(1000) |
| AsyncFixer03 | Fire-and-forget async void | Error | async void ProcessEvent() |
Unhandled exceptions | async Task ProcessEventAsync() |
| AsyncFixer04 | Fire-and-forget async call | Warning | ProcessEventAsync(); // not awaited |
Lost exceptions | await ProcessEventAsync(); |
| AsyncFixer05 | Downcasting from Task to Task |
Error | (Task<int>)task |
Runtime exception risk | Use Task.FromResult<int>() or generics |
| AsyncFixer06 | Missing ConfigureAwait(false) | Warning | await client.GetAsync(url); |
Potential deadlock in UI apps | await client.GetAsync(url).ConfigureAwait(false); |
AsyncFixer Examples & Remediation:
// ❌ BAD: Async void (AsyncFixer03)
public async void ProcessEvent(AuditEvent evt) // Unhandled exceptions disappear
{
await _repository.SaveAsync(evt);
}
// ✅ GOOD: Async Task
public async Task ProcessEventAsync(AuditEvent evt) // Exceptions propagate correctly
{
await _repository.SaveAsync(evt);
}
// ❌ BAD: Fire-and-forget (AsyncFixer04)
public void EnqueueEvent(AuditEvent evt)
{
ProcessEventAsync(evt); // Not awaited; exceptions lost
}
// ✅ GOOD: Awaited or properly handled
public async Task EnqueueEventAsync(AuditEvent evt)
{
await ProcessEventAsync(evt); // Exceptions propagate
}
// OR: Explicitly fire-and-forget with error handling
public void EnqueueEvent(AuditEvent evt)
{
_ = ProcessEventAsync(evt).ContinueWith(t =>
{
if (t.IsFaulted)
{
_logger.LogError(t.Exception, "Event processing failed");
}
}, TaskScheduler.Default);
}
// ❌ BAD: Blocking in async (AsyncFixer02)
public async Task<string> GetDataAsync()
{
return await Task.Run(() =>
{
Thread.Sleep(1000); // Blocking thread pool thread
return "data";
});
}
// ✅ GOOD: Proper async
public async Task<string> GetDataAsync()
{
await Task.Delay(1000); // Non-blocking
return "data";
}
// ❌ BAD: Missing ConfigureAwait (AsyncFixer06)
public async Task<AuditEvent> GetEventAsync(Guid id)
{
var json = await _httpClient.GetStringAsync($"/api/events/{id}"); // Captures context
return JsonSerializer.Deserialize<AuditEvent>(json);
}
// ✅ GOOD: ConfigureAwait(false) in library code
public async Task<AuditEvent> GetEventAsync(Guid id)
{
var json = await _httpClient.GetStringAsync($"/api/events/{id}").ConfigureAwait(false);
return JsonSerializer.Deserialize<AuditEvent>(json);
}
.editorconfig (Unified Analyzer Configuration)¶
ATP uses .editorconfig to centralize analyzer rule severity across all services.
.editorconfig (ATP Standard):
# ConnectSoft ATP .editorconfig
# Applied to all C# files in the repository
root = true
# All files
[*]
charset = utf-8
insert_final_newline = true
trim_trailing_whitespace = true
indent_style = space
# C# files
[*.cs]
indent_size = 4
end_of_line = lf
# Build Quality: Treat warnings as errors
dotnet_diagnostic.severity = error
# Nullable Reference Types
nullable = enable
dotnet_diagnostic.CS8600.severity = error # Converting null literal
dotnet_diagnostic.CS8601.severity = error # Possible null reference assignment
dotnet_diagnostic.CS8602.severity = error # Dereference of a possibly null reference
dotnet_diagnostic.CS8603.severity = error # Possible null reference return
dotnet_diagnostic.CS8604.severity = error # Possible null reference argument
# StyleCop Rules (selective enforcement)
dotnet_diagnostic.SA1101.severity = none # Prefix local calls with this (disabled)
dotnet_diagnostic.SA1200.severity = error # Using directives placement
dotnet_diagnostic.SA1309.severity = error # Field names must not begin with underscore (public)
dotnet_diagnostic.SA1503.severity = error # Braces for single-line statements
dotnet_diagnostic.SA1516.severity = warning # Elements separated by blank line
dotnet_diagnostic.SA1600.severity = warning # Elements should be documented
dotnet_diagnostic.SA1633.severity = none # File header (handled by .editorconfig)
# SonarQube Rules (critical only)
dotnet_diagnostic.S1172.severity = warning # Unused parameters
dotnet_diagnostic.S2589.severity = error # Boolean expressions gratuitous
dotnet_diagnostic.S2696.severity = error # Instance methods write to static fields
dotnet_diagnostic.S3776.severity = warning # Cognitive complexity
dotnet_diagnostic.S4790.severity = error # Weak cryptographic algorithms
# Meziantou Rules
dotnet_diagnostic.MA0001.severity = warning # StringComparison missing
dotnet_diagnostic.MA0004.severity = warning # ConfigureAwait missing
dotnet_diagnostic.MA0006.severity = warning # Use String.Equals
dotnet_diagnostic.MA0011.severity = warning # IFormatProvider missing
dotnet_diagnostic.MA0040.severity = warning # Use cancellation token
# AsyncFixer Rules
dotnet_diagnostic.AsyncFixer01.severity = warning # Unnecessary async/await
dotnet_diagnostic.AsyncFixer02.severity = warning # Blocking operations
dotnet_diagnostic.AsyncFixer03.severity = error # Async void
dotnet_diagnostic.AsyncFixer04.severity = warning # Fire-and-forget
dotnet_diagnostic.AsyncFixer06.severity = warning # Missing ConfigureAwait
# Code Style
csharp_prefer_braces = true:error
csharp_prefer_simple_using_statement = true:suggestion
csharp_style_namespace_declarations = file_scoped:warning
csharp_style_var_for_built_in_types = false:suggestion
csharp_style_var_when_type_is_apparent = true:suggestion
csharp_style_var_elsewhere = false:suggestion
Build Quality Metrics & Dashboard¶
Azure DevOps Dashboard (Build Quality Widget):
# Build Quality Dashboard Configuration
dashboard:
name: "ATP Build Quality"
widgets:
- type: buildQuality
title: "Build Success Rate"
query: "Build Success Rate (Last 30 Days)"
metric: successRate
target: 95%
- type: codeQuality
title: "SonarQube Quality Gate"
query: "Quality Gate Pass Rate"
metric: qualityGatePass
target: 100%
- type: codeAnalysis
title: "Analyzer Violations"
query: "Analyzer Violations (Last 7 Days)"
metrics:
- StyleCop: 0
- SonarQube: 0
- Meziantou: < 10
- AsyncFixer: 0
Build Quality KQL Queries (Application Insights):
// Build success rate by service (last 30 days)
customEvents
| where name == "BuildCompleted"
| where timestamp > ago(30d)
| extend Service = tostring(customDimensions.Service)
| extend Success = tostring(customDimensions.Success) == "true"
| summarize
TotalBuilds = count(),
SuccessfulBuilds = countif(Success),
SuccessRate = 100.0 * countif(Success) / count()
by Service
| order by SuccessRate asc
// Top build failure reasons (last 7 days)
customEvents
| where name == "BuildFailed"
| where timestamp > ago(7d)
| extend Reason = tostring(customDimensions.FailureReason)
| summarize FailureCount = count() by Reason
| order by FailureCount desc
| take 10
// Average build duration trend (last 90 days)
customEvents
| where name == "BuildCompleted"
| where timestamp > ago(90d)
| extend DurationSeconds = todouble(customDimensions.DurationSeconds)
| summarize AvgDuration = avg(DurationSeconds) by bin(timestamp, 1d)
| render timechart
Summary¶
- Build Quality Gates: First line of defense with 2-4 minute execution time
- Code Compilation: Zero errors, zero warnings (TreatWarningsAsErrors=true)
- 4 Static Analyzers: StyleCop (style), SonarQube (bugs/smells/security), Meziantou (best practices), AsyncFixer (async correctness)
- StyleCop: 125+ rules enforcing code style, naming, documentation
- SonarQube: 500+ rules with quality gate (0 bugs, 0 vulnerabilities, ≤10 code smells, ≥70% coverage, ≤3% duplication)
- Meziantou: 150+ rules for .NET best practices (StringComparison, ConfigureAwait, IFormatProvider, cancellation tokens)
- AsyncFixer: 6 critical rules preventing async anti-patterns (no async void, ConfigureAwait, fire-and-forget)
- .editorconfig: Centralized analyzer configuration with severity levels per rule
- Enforcement: All analyzers run during build; pipeline fails on any error-level violation
- Typical Remediation: 1-30 minutes per build failure depending on type
Test Coverage Gates (Deep Dive)¶
Test coverage gates ensure that sufficient automated tests exist and execute successfully with adequate code coverage. ATP enforces 100% test pass rate and service-specific coverage thresholds to maintain high reliability and prevent regression.
Philosophy: Code without tests is legacy code. ATP requires that all new code is accompanied by comprehensive unit and integration tests, with coverage thresholds calibrated to each service's criticality and complexity.
Test Coverage Gate Workflow¶
graph TD
A[Build Successful] --> B[Restore Test Projects]
B --> C[Run Unit Tests]
C --> D{All Tests Pass?}
D -->|No| E[Test Failures ❌]
D -->|Yes| F[Run Integration Tests]
F --> G{All Tests Pass?}
G -->|No| H[Integration Test Failures ❌]
G -->|Yes| I[Collect Code Coverage]
I --> J[Generate Coverage Report]
J --> K{Coverage ≥ Threshold?}
K -->|No| L[Coverage Too Low ❌]
K -->|Yes| M[Detect Flaky Tests]
M --> N{Flaky Rate < 5%?}
N -->|No| O[Flaky Tests Detected ⚠️]
N -->|Yes| P[Test Coverage Passed ✅]
E --> Q[Pipeline Stopped]
H --> Q
L --> Q
O --> R[Warning: Fix Flaky Tests]
P --> S[Proceed to Security Gates]
style E fill:#ff6b6b
style H fill:#ff6b6b
style L fill:#ff6b6b
style O fill:#feca57
style P fill:#90EE90
Typical Test Coverage Gate Duration: 3-5 minutes
Coverage Thresholds (Per Service)¶
ATP enforces service-specific coverage thresholds based on each service's criticality, complexity, and architectural patterns. Security-critical services have higher thresholds than I/O-heavy integration services.
Service Threshold Matrix:
| Service | Line Coverage | Branch Coverage | Min Tests | Max Test Duration | Rationale |
|---|---|---|---|---|---|
| Ingestion | ≥75% | ≥65% | 100+ | 5 minutes | Critical path for all audit events; high reliability requirement; complex validation logic |
| Query | ≥80% | ≥70% | 150+ | 5 minutes | Complex query logic with dynamic filters, pagination, sorting; high test coverage essential |
| Integrity | ≥85% | ≥75% | 80+ | 3 minutes | Security-critical tamper-evidence, hash chains, digital signatures; highest coverage requirement |
| Export | ≥70% | ≥60% | 60+ | 7 minutes | I/O-heavy with external dependencies (Blob, CSV, PDF); lower threshold acceptable |
| Policy | ≥80% | ≥70% | 120+ | 4 minutes | Business rules enforcement; high coverage for rule validation and policy evaluation |
| Search | ≥70% | ≥60% | 80+ | 6 minutes | Integration-heavy with Elasticsearch; focus on integration tests over unit tests |
| Gateway | ≥65% | ≥55% | 50+ | 4 minutes | API routing and orchestration; lower threshold, focus on E2E tests and contract validation |
Threshold Rationale:
# Why different thresholds per service?
ingestion:
threshold: 75%
rationale: |
- Critical path: All audit events flow through Ingestion
- Complex validation: Schema validation, tenant isolation, duplicate detection
- High reliability: 99.9% uptime SLA
- Consequence of failure: Audit events lost (catastrophic)
riskProfile: Critical
query:
threshold: 80%
rationale: |
- Complex logic: Dynamic query building, filter composition, pagination
- High variability: Many query permutations (100+ filter combinations)
- Performance-critical: Query performance directly impacts user experience
- Consequence of failure: Incorrect results (compliance risk)
riskProfile: High
integrity:
threshold: 85%
rationale: |
- Security-critical: Tamper-evidence, hash chain validation, signature verification
- Zero-tolerance: Any integrity failure undermines entire audit trail
- Cryptographic complexity: Hash algorithms, Merkle trees, digital signatures
- Consequence of failure: Audit trail integrity compromised (catastrophic)
riskProfile: Critical
export:
threshold: 70%
rationale: |
- I/O-heavy: File generation, streaming, Blob uploads
- External dependencies: PDF libraries, CSV serialization
- Lower complexity: Mostly data transformation and serialization
- Consequence of failure: Export fails (retryable, not data loss)
riskProfile: Medium
gateway:
threshold: 65%
rationale: |
- API routing: Minimal business logic, mostly orchestration
- E2E coverage: Tested via E2E tests rather than unit tests
- Thin layer: Delegates to downstream services
- Consequence of failure: Request routing error (visible, fast feedback)
riskProfile: Low
Test Execution & Coverage Collection¶
Test Execution Pipeline (Azure Pipelines):
# Test Coverage Gate
- task: DotNetCoreCLI@2
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: >
--configuration Release
--no-build
--collect:"XPlat Code Coverage"
--settings:CodeCoverage.runsettings
--logger:"trx;LogFileName=TestResults.trx"
--
DataCollectionRunSettings.DataCollectors.DataCollector.Configuration.Format=cobertura
displayName: 'Run Tests with Coverage'
# Fail on any test failure
continueOnError: false
# Test result publication
publishTestResults: true
testResultsFormat: 'VSTest'
testResultsFiles: '**/TestResults.trx'
# Publish Coverage Results
- task: PublishCodeCoverageResults@1
inputs:
codeCoverageTool: 'Cobertura'
summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'
reportDirectory: '$(Agent.TempDirectory)/coverage-report'
pathToSources: '$(Build.SourcesDirectory)/src'
displayName: 'Publish Coverage Report'
# Enforce Coverage Threshold
- task: BuildQualityChecks@8
inputs:
checkCoverage: true
coverageFailOption: 'fixed'
coverageType: 'lines'
coverageThreshold: 70 # ATP baseline (overridden per service)
coverageVariance: 0 # No tolerance for coverage drops
baseBranchRef: 'refs/heads/main'
treatBuildWarningsAsErrors: true
baselineEnabled: true
baselineType: 'previous'
displayName: 'Enforce Coverage Threshold'
CodeCoverage.runsettings (Configuration):
<?xml version="1.0" encoding="utf-8"?>
<RunSettings>
<DataCollectionRunSettings>
<DataCollectors>
<DataCollector friendlyName="XPlat code coverage">
<Configuration>
<Format>cobertura,opencover</Format>
<Exclude>[*.Tests]*,[*]*.Migrations.*,[*]*.Program,[*]*.Startup</Exclude>
<ExcludeByAttribute>Obsolete,GeneratedCode,CompilerGenerated</ExcludeByAttribute>
<ExcludeByFile>**/*Designer.cs,**/obj/**,**/bin/**</ExcludeByFile>
<IncludeDirectory>src/</IncludeDirectory>
<SingleHit>false</SingleHit>
<UseSourceLink>true</UseSourceLink>
<IncludeTestAssembly>false</IncludeTestAssembly>
<SkipAutoProps>true</SkipAutoProps>
</Configuration>
</DataCollector>
</DataCollectors>
</DataCollectionRunSettings>
<RunConfiguration>
<MaxCpuCount>0</MaxCpuCount> <!-- Use all available CPUs -->
<ResultsDirectory>./TestResults</ResultsDirectory>
<TestSessionTimeout>600000</TestSessionTimeout> <!-- 10 minutes -->
</RunConfiguration>
<MSTest>
<Parallelize>
<Workers>0</Workers> <!-- Auto-detect based on CPU cores -->
<Scope>ClassLevel</Scope>
</Parallelize>
</MSTest>
</RunSettings>
Baseline Protection¶
Purpose: Prevent coverage regression by comparing current coverage to previous builds and failing if coverage drops.
Mechanism: Azure DevOps Build Quality Checks task tracks coverage per build and fails if coverage decreases.
Configuration:
# Baseline Protection Configuration
- task: BuildQualityChecks@8
inputs:
checkCoverage: true
coverageFailOption: 'fixed' # Fixed threshold (not dynamic)
coverageType: 'lines'
coverageThreshold: $(coverageThreshold) # Service-specific variable
coverageVariance: 0 # Zero tolerance for coverage drops
# Baseline Comparison
baselineEnabled: true
baselineType: 'previous' # Compare to previous build
baseBranchRef: 'refs/heads/main'
# Include/Exclude Filters
includePartiallySucceeded: false
treatBuildWarningsAsErrors: true
# Failure Behavior
failTaskOnBaselineViolation: true
createBuildIssue: true # Create work item for coverage drop
displayName: 'Baseline Protection: Enforce Coverage'
Baseline Scenarios:
| Scenario | Previous Coverage | Current Coverage | Variance | Result | Action |
|---|---|---|---|---|---|
| Coverage Maintained | 75.0% | 75.2% | +0.2% | ✅ Pass | None |
| Coverage Improved | 75.0% | 78.5% | +3.5% | ✅ Pass | Celebrate! |
| Coverage Dropped (Minor) | 75.0% | 74.8% | -0.2% | ❌ Fail | Add tests for new code |
| Coverage Dropped (Major) | 75.0% | 68.0% | -7.0% | ❌ Fail | Investigate untested code; may require new baseline |
| First Build | N/A | 72.0% | N/A | ✅ Pass | Establishes baseline |
| Refactoring | 75.0% | 65.0% (new baseline) | -10.0% | ⚠️ Conditional | Requires Force New Baseline approval |
Coverage Drop Notification:
// Custom coverage drop detector
public class CoverageRegressionDetector
{
public async Task<CoverageRegressionResult> DetectRegressionAsync(
string buildId,
double currentCoverage,
double threshold)
{
// Get previous build coverage
var previousBuild = await GetPreviousBuildAsync(buildId);
var previousCoverage = previousBuild?.Coverage ?? 0;
var regression = new CoverageRegressionResult
{
CurrentCoverage = currentCoverage,
PreviousCoverage = previousCoverage,
Delta = currentCoverage - previousCoverage,
Threshold = threshold,
PassesThreshold = currentCoverage >= threshold,
HasRegression = currentCoverage < previousCoverage
};
if (regression.HasRegression)
{
// Create Azure DevOps work item
await CreateCoverageRegressionWorkItemAsync(new WorkItem
{
Title = $"Coverage Regression Detected: {regression.Delta:F1}% drop in Build {buildId}",
Description = $@"
Code coverage dropped from {previousCoverage:F1}% to {currentCoverage:F1}% (delta: {regression.Delta:F1}%).
**Previous Build**: {previousBuild.BuildNumber} ({previousBuild.CommitSha})
**Current Build**: {buildId}
**Uncovered Code**:
{await GetUncoveredCodeSummaryAsync(buildId)}
**Action Required**:
1. Review uncovered code in coverage report
2. Add unit tests for critical paths
3. Re-run build to validate coverage improvement
**Coverage Report**: [View Report]({GetCoverageReportUrl(buildId)})
",
AssignedTo = previousBuild.Requester,
Priority = regression.Delta > 5 ? 1 : 2, // P1 if >5% drop
Tags = new[] { "coverage-regression", "quality-gate", "test-coverage" }
});
}
return regression;
}
}
Force New Baseline¶
Purpose: Allow intentional coverage drops after major refactoring or architecture changes, with proper approval and documentation.
When to Use:
- Major Refactoring: Large code deletions or restructuring (e.g., removing deprecated code)
- Architecture Changes: Moving logic between services (coverage shifts from one service to another)
- Test Cleanup: Removing obsolete tests after feature removal
- Coverage Calculation Changes: Updating
.runsettingsexclusions or analyzers
Approval Workflow:
stateDiagram-v2
[*] --> Requested: Engineer triggers Force New Baseline
Requested --> TechnicalReview: Create ADR documenting change
TechnicalReview --> ArchitectApproval: Tech Lead approves justification
TechnicalReview --> Rejected: Insufficient justification
ArchitectApproval --> Approved: Lead Architect approves
ArchitectApproval --> Rejected: Coverage drop unjustified
Approved --> BaselineCreated: Set BQC.ForceNewBaseline=true
BaselineCreated --> Validated: Monitor next 3 builds
Validated --> [*]: New baseline established
Rejected --> [*]: Use existing baseline
Procedure:
# Step 1: Create Architecture Decision Record (ADR)
# File: adrs/adr-NNN-force-new-coverage-baseline.md
---
title: ADR-042: Force New Coverage Baseline for Query Service Refactoring
status: Accepted
date: 2025-01-15
decision-makers: Lead Architect, Tech Lead
consulted: QA Lead, SRE Team
informed: Development Team
---
## Context
Query service underwent major refactoring to separate read/write paths (CQRS).
~30% of code moved to new QueryRead project, causing coverage to drop from 80% to 62% in QueryWrite project.
## Decision
Force new baseline at 62% for QueryWrite service, with commitment to raise to 75% within Q1 2025.
## Consequences
- Coverage threshold lowered temporarily (62% for QueryWrite)
- Baseline protection disabled for 1 build
- Monitoring for 3 builds to ensure stability
- Action plan: Add 50+ unit tests for QueryWrite within 2 sprints
## Approval
- Lead Architect: ✅ Approved (2025-01-15)
- Tech Lead: ✅ Approved (2025-01-15)
- QA Lead: ✅ Consulted (2025-01-14)
# Step 2: Set Pipeline Variable (Azure DevOps)
# UI: Pipelines → Edit → Variables → Add Variable
variableName: BQC.ForceNewBaseline
value: true
scope: Single build # Reset to false after baseline created
# Step 3: Update BuildQualityChecks Task
- task: BuildQualityChecks@8
inputs:
checkCoverage: true
coverageThreshold: 62 # NEW BASELINE (was 80%)
baselineEnabled: true
# Force new baseline if variable set
baselineType: ${{ if eq(variables['BQC.ForceNewBaseline'], 'true') }}:
'current' # Use current build as new baseline
${{ else }}:
'previous' # Compare to previous build
displayName: 'Enforce Coverage with Baseline'
# Step 4: Monitor Next 3 Builds
#!/bin/bash
# validate-new-baseline.sh
BUILD_ID=$1
EXPECTED_COVERAGE=62.0
for i in {1..3}; do
echo "Validating build $BUILD_ID (attempt $i/3)..."
COVERAGE=$(az pipelines runs show --id $BUILD_ID --org https://dev.azure.com/ConnectSoft \
--query "buildNumber" -o json | jq -r '.coverage')
if (( $(echo "$COVERAGE < $EXPECTED_COVERAGE" | bc -l) )); then
echo "❌ Coverage dropped below new baseline: $COVERAGE% < $EXPECTED_COVERAGE%"
exit 1
fi
echo "✅ Build $BUILD_ID coverage: $COVERAGE% (baseline: $EXPECTED_COVERAGE%)"
# Wait for next build
sleep 3600 # 1 hour
done
echo "✅ New baseline validated over 3 builds"
Test Quality Metrics¶
Beyond Coverage Percentage: ATP tracks test quality metrics to ensure tests are effective, not just numerous.
Test Quality Scorecard¶
| Metric | Target | Measurement | Blocker | Purpose |
|---|---|---|---|---|
| Test Pass Rate | 100% | Count(Passed) / Count(Total) | ✅ Yes | All tests must pass; no flaky tolerance |
| Test Duration | Unit <30s, Integration <5min | Execution time per test category | ⚠️ Warning | Fast feedback; slow tests indicate issues |
| Flaky Test Rate | <5% | Tests with <95% historical pass rate | ⚠️ Warning | Flaky tests erode confidence |
| Assertion Density | ≥1.5 per test | Count(Assertions) / Count(Tests) | ℹ️ Info | Ensure tests actually validate behavior |
| Quarantined Tests | ≤3 per service | Tests marked with [Ignore] or [Fact(Skip=)] |
⚠️ Warning | Quarantined tests must be fixed or removed |
| Test Coverage on New Code | 100% | Coverage on changed lines | ✅ Yes | All new code must be tested |
Test Pass Rate (100% Required)¶
Threshold: 100% — Every test must pass; no tolerance for failures.
Enforcement:
# Test execution fails on first test failure
- task: DotNetCoreCLI@2
inputs:
command: 'test'
arguments: '--no-build --logger trx --blame-hang-timeout 5m'
displayName: 'Run Unit Tests'
continueOnError: false # Fail immediately on test failure
Common Test Failures:
| Failure Type | Symptom | Root Cause | Remediation |
|---|---|---|---|
| Assertion Failure | Expected: 200, Actual: 500 |
Business logic bug, incorrect test expectation | Fix code or update test assertion |
| Null Reference | NullReferenceException |
Missing null checks, incomplete mocking | Add null checks, fix mock setup |
| Timeout | Test exceeds 5-minute limit | Deadlock, infinite loop, external dependency | Add timeout, fix async code, mock external calls |
| Flaky Test | Passes sometimes, fails sometimes | Race condition, timing dependency, shared state | Fix concurrency issues, isolate test state |
| Dependency Failure | SQL/Redis connection failed | Service container not started | Verify services: in pipeline, check health |
Test Failure Notification:
// Emit test failure event for alerting
public class TestFailureNotifier : ITestExecutionListener
{
public void OnTestFailed(TestResult result)
{
var telemetry = new EventTelemetry("TestFailed");
telemetry.Properties["TestName"] = result.TestName;
telemetry.Properties["FailureReason"] = result.ErrorMessage;
telemetry.Properties["StackTrace"] = result.ErrorStackTrace;
telemetry.Properties["Duration"] = result.Duration.ToString();
telemetry.Properties["BuildId"] = Environment.GetEnvironmentVariable("BUILD_BUILDID");
_telemetryClient.TrackEvent(telemetry);
// Alert on P0 test failures (security, integrity tests)
if (result.Categories.Contains("Security") || result.Categories.Contains("Integrity"))
{
_alertService.SendPagerDutyAlert(
severity: "high",
title: $"Critical Test Failed: {result.TestName}",
description: result.ErrorMessage);
}
}
}
Test Duration Thresholds¶
Purpose: Ensure tests provide fast feedback; slow tests indicate design issues (e.g., integration tests disguised as unit tests).
Thresholds:
| Test Category | Max Duration (Per Test) | Max Suite Duration | Enforcement |
|---|---|---|---|
| Unit Tests | 100ms | 30 seconds | ⚠️ Warning if exceeded |
| Integration Tests | 5 seconds | 5 minutes | ⚠️ Warning if exceeded |
| E2E Tests | 30 seconds | 15 minutes | ℹ️ Info (E2E expected to be slower) |
Slow Test Detection:
// Detect slow tests during execution
[AttributeUsage(AttributeTargets.Method)]
public class PerformanceTestAttribute : FactAttribute
{
public int MaxDurationMs { get; set; } = 100; // Default: 100ms for unit tests
public PerformanceTestAttribute()
{
// Custom test framework hook to measure duration
}
}
// Usage
[PerformanceTest(MaxDurationMs = 100)]
public async Task Should_Validate_Event_Within_100ms()
{
var validator = new AuditEventValidator();
var evt = CreateValidEvent();
var result = await validator.ValidateAsync(evt); // Must complete in <100ms
Assert.True(result.IsValid);
}
Slow Test Report (Azure Pipelines):
#!/bin/bash
# detect-slow-tests.sh
# Parse test results XML
TEST_RESULTS=$(find . -name "TestResults.trx" -type f)
# Extract tests exceeding duration threshold
xq -x '//UnitTestResult[@duration > "PT0.1S"]/@testName' $TEST_RESULTS | while read TEST_NAME; do
DURATION=$(xq -x "//UnitTestResult[@testName='$TEST_NAME']/@duration" $TEST_RESULTS)
echo "⚠️ Slow Test Detected: $TEST_NAME (Duration: $DURATION)"
# Create work item for slow test optimization
az boards work-item create \
--title "Slow Test: $TEST_NAME" \
--type "Task" \
--description "Test duration: $DURATION (threshold: 100ms)\n\nOptimize or reclassify as integration test." \
--assigned-to "qa-team@connectsoft.example" \
--fields Priority=3
done
Flaky Test Detection¶
Purpose: Identify unreliable tests that pass/fail intermittently, eroding confidence in the test suite.
Threshold: Tests with <95% historical pass rate are flagged as flaky.
Detection Mechanism:
// Flaky test analyzer (Azure Function)
[FunctionName("DetectFlakyTests")]
public async Task RunAsync(
[TimerTrigger("0 0 2 * * *")] TimerInfo timer, // Daily at 2 AM
ILogger log)
{
log.LogInformation("Analyzing test results for flaky tests...");
var last30Days = DateTime.UtcNow.AddDays(-30);
// Query Azure DevOps Test Analytics
var testRuns = await _devOpsClient.GetTestRunsAsync(
project: "ConnectSoft",
minLastUpdatedDate: last30Days);
var flakyTests = new List<FlakyTestResult>();
foreach (var testRun in testRuns)
{
var results = await _devOpsClient.GetTestResultsAsync(testRun.Id);
var testStats = results
.GroupBy(r => r.TestCaseTitle)
.Select(g => new
{
TestName = g.Key,
TotalRuns = g.Count(),
PassedRuns = g.Count(r => r.Outcome == "Passed"),
PassRate = g.Count(r => r.Outcome == "Passed") / (double)g.Count()
})
.Where(t => t.PassRate < 0.95 && t.TotalRuns >= 5) // Flaky: <95% pass, min 5 runs
.ToList();
flakyTests.AddRange(testStats.Select(s => new FlakyTestResult
{
TestName = s.TestName,
PassRate = s.PassRate,
TotalRuns = s.TotalRuns,
FailureCount = s.TotalRuns - s.PassedRuns
}));
}
if (flakyTests.Any())
{
log.LogWarning($"Detected {flakyTests.Count} flaky tests");
// Create work item for each flaky test
foreach (var flaky in flakyTests)
{
await _devOpsClient.CreateWorkItemAsync(new
{
Fields = new Dictionary<string, object>
{
["System.Title"] = $"Flaky Test: {flaky.TestName}",
["System.WorkItemType"] = "Bug",
["System.Description"] = $@"
Test has {flaky.PassRate:P0} pass rate over {flaky.TotalRuns} runs (threshold: 95%).
**Failure Count**: {flaky.FailureCount}
**Pass Rate**: {flaky.PassRate:P1}
**Action Required**:
1. Investigate test for race conditions, timing dependencies, shared state
2. Fix root cause or quarantine test (mark with [Ignore])
3. Re-enable after fix and validate 100% pass rate over 10 runs
",
["System.Tags"] = "flaky-test; test-quality",
["Microsoft.VSTS.Common.Priority"] = 2
}
});
}
// Send summary to QA team
await SendFlakyTestReportAsync(flakyTests);
}
else
{
log.LogInformation("✅ No flaky tests detected");
}
}
Flaky Test Quarantine:
// Quarantine flaky test until fixed
[Fact(Skip = "Flaky: Timing-dependent; see work item #12345")]
public async Task Should_Process_Event_Concurrently()
{
// Test disabled until race condition fixed
}
// OR: Mark with custom attribute for reporting
[Fact]
[Trait("Category", "Flaky")]
[Trait("WorkItem", "12345")]
public async Task Should_Process_Event_Concurrently()
{
// Test runs but tracked as flaky
}
Coverage Exclusions¶
Purpose: Exclude auto-generated code, third-party code, and infrastructure code from coverage calculations to focus on business logic.
Exclusion Categories:
<!-- CodeCoverage.runsettings -->
<Configuration>
<!-- Exclude by Assembly Name -->
<Exclude>
[*.Tests]*, <!-- All test assemblies -->
[*]*.Migrations.*, <!-- EF Core migrations -->
[*]*.Program, <!-- Program.cs entry point -->
[*]*.Startup, <!-- Startup.cs DI config -->
[xunit.*]*, <!-- xUnit framework -->
[Moq]* <!-- Moq mocking framework -->
</Exclude>
<!-- Exclude by Attribute -->
<ExcludeByAttribute>
Obsolete, <!-- Deprecated code -->
GeneratedCode, <!-- Auto-generated (T4, Swagger) -->
CompilerGenerated, <!-- Compiler-generated (closures) -->
ExcludeFromCodeCoverage <!-- Explicitly excluded -->
</ExcludeByAttribute>
<!-- Exclude by File Pattern -->
<ExcludeByFile>
**/*Designer.cs, <!-- WinForms/WPF designers -->
**/obj/**, <!-- Build output -->
**/bin/**, <!-- Build output -->
**/Migrations/**, <!-- EF migrations -->
**/*.Generated.cs, <!-- Generated files -->
**/GlobalUsings.cs <!-- Global usings (C# 10+) -->
</ExcludeByFile>
<!-- Include Only Source Directories -->
<IncludeDirectory>src/</IncludeDirectory>
<!-- Skip Auto-Properties (getters/setters) -->
<SkipAutoProps>true</SkipAutoProps>
</Configuration>
Explicit Exclusion (via Attribute):
// Exclude infrastructure code from coverage
[ExcludeFromCodeCoverage]
public class ApplicationDbContextFactory : IDesignTimeDbContextFactory<ApplicationDbContext>
{
// Design-time factory for EF migrations (not covered by tests)
public ApplicationDbContext CreateDbContext(string[] args)
{
var optionsBuilder = new DbContextOptionsBuilder<ApplicationDbContext>();
optionsBuilder.UseSqlServer("Server=(localdb)\\mssqllocaldb;Database=DesignTime;");
return new ApplicationDbContext(optionsBuilder.Options);
}
}
// Exclude obsolete code from coverage (scheduled for removal)
[Obsolete("Use ProcessEventV2Async instead")]
[ExcludeFromCodeCoverage]
public async Task ProcessEventAsync(AuditEvent evt)
{
// Legacy method; coverage not enforced
}
Test Organization & Naming¶
Purpose: Consistent test organization and naming improve discoverability, maintainability, and coverage analysis.
Test Project Structure:
ConnectSoft.ATP.Ingestion.Tests/
├── Unit/
│ ├── Controllers/
│ │ ├── AuditEventsControllerTests.cs
│ │ └── HealthControllerTests.cs
│ ├── Services/
│ │ ├── EventValidationServiceTests.cs
│ │ └── TenantIsolationServiceTests.cs
│ ├── Validators/
│ │ └── AuditEventValidatorTests.cs
│ └── Models/
│ └── AuditEventTests.cs
│
├── Integration/
│ ├── Repositories/
│ │ ├── AuditEventRepositoryTests.cs # Requires SQL container
│ │ └── CacheRepositoryTests.cs # Requires Redis container
│ ├── MessageBus/
│ │ └── EventPublisherTests.cs # Requires RabbitMQ container
│ └── EndToEnd/
│ └── IngestionWorkflowTests.cs # Full workflow (API → DB → Bus)
│
├── TestHelpers/
│ ├── Builders/
│ │ └── AuditEventBuilder.cs # Test data builder
│ ├── Fixtures/
│ │ └── DatabaseFixture.cs # Shared test fixtures
│ └── Mocks/
│ └── MockTimeProvider.cs # Time abstraction mock
│
└── CodeCoverage.runsettings
Test Naming Convention:
// Pattern: Should_ExpectedBehavior_When_StateUnderTest
// ✅ GOOD: Clear, descriptive test names
public class AuditEventValidatorTests
{
[Fact]
public void Should_ReturnValid_When_EventHasAllRequiredFields()
{
// Arrange
var evt = new AuditEventBuilder()
.WithTenantId(Guid.NewGuid())
.WithAction("UserLogin")
.WithTimestamp(DateTime.UtcNow)
.Build();
var validator = new AuditEventValidator();
// Act
var result = validator.Validate(evt);
// Assert
Assert.True(result.IsValid);
Assert.Empty(result.Errors);
}
[Fact]
public void Should_ReturnInvalid_When_TenantIdIsMissing()
{
// Arrange
var evt = new AuditEventBuilder()
.WithAction("UserLogin")
.WithTimestamp(DateTime.UtcNow)
.Build(); // Missing TenantId
var validator = new AuditEventValidator();
// Act
var result = validator.Validate(evt);
// Assert
Assert.False(result.IsValid);
Assert.Contains(result.Errors, e => e.PropertyName == "TenantId");
}
[Theory]
[InlineData(null)]
[InlineData("")]
[InlineData(" ")]
public void Should_ReturnInvalid_When_ActionIsNullOrWhitespace(string action)
{
// Arrange
var evt = new AuditEventBuilder()
.WithTenantId(Guid.NewGuid())
.WithAction(action)
.WithTimestamp(DateTime.UtcNow)
.Build();
var validator = new AuditEventValidator();
// Act
var result = validator.Validate(evt);
// Assert
Assert.False(result.IsValid);
Assert.Contains(result.Errors, e => e.PropertyName == "Action");
}
}
// ❌ BAD: Unclear test names
public class AuditEventValidatorTests
{
[Fact]
public void Test1() // What does this test?
{
var validator = new AuditEventValidator();
var result = validator.Validate(new AuditEvent());
Assert.False(result.IsValid);
}
}
Coverage Report Analysis¶
Purpose: Provide actionable insights into uncovered code to guide test authoring.
Coverage Report Formats:
# Generate multiple coverage formats
- task: reportgenerator@5
inputs:
reports: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'
targetdir: '$(Build.ArtifactStagingDirectory)/coverage-report'
reporttypes: 'HtmlInline_AzurePipelines;Cobertura;Badges;MarkdownSummary'
displayName: 'Generate Coverage Report'
# Publish as Azure DevOps artifact
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)/coverage-report'
ArtifactName: 'code-coverage-$(Build.BuildNumber)'
displayName: 'Publish Coverage Report'
Coverage Report Summary (Markdown):
# Code Coverage Summary
**Build**: 1.0.123
**Date**: 2025-01-15 14:30:00 UTC
**Branch**: main
---
## Overall Coverage
| Metric | Value | Threshold | Status |
|--------|-------|-----------|--------|
| Line Coverage | 73.5% | ≥70% | ✅ Pass |
| Branch Coverage | 64.2% | ≥60% | ✅ Pass |
| Method Coverage | 78.1% | — | ℹ️ Info |
---
## Coverage by Project
| Project | Line Coverage | Branch Coverage | Uncovered Lines |
|---------|---------------|-----------------|-----------------|
| Ingestion.API | 76.3% (↑1.2%) | 67.1% | 234 / 988 |
| Ingestion.Domain | 82.1% (↑0.5%) | 74.3% | 89 / 497 |
| Ingestion.Infrastructure | 65.4% (↓2.1%) ⚠️ | 58.9% | 412 / 1192 |
---
## Uncovered Code (Top 5 Files)
1. **EventRepository.cs** (48.2% covered)
- Lines 45-89: Delete methods (no tests)
- Lines 123-156: Bulk insert (no tests)
- **Action**: Add integration tests for delete/bulk operations
2. **TenantIsolationService.cs** (55.7% covered)
- Lines 34-67: Edge case validation (no tests)
- **Action**: Add unit tests for edge cases
3. **CacheInvalidationService.cs** (61.3% covered)
- Lines 12-45: Redis connection retry logic (no tests)
- **Action**: Add integration tests with Redis failures
4. **EventSerializer.cs** (68.9% covered)
- Lines 78-102: Error handling paths (no tests)
- **Action**: Add tests for serialization errors
5. **HealthCheckService.cs** (72.4% covered)
- Lines 56-89: Dependency health checks (no tests)
- **Action**: Add tests for dependency failures
Per-Service Coverage Configuration¶
Purpose: Apply service-specific thresholds via pipeline variables to enforce different coverage requirements.
Azure DevOps Variable Groups:
# Variable Group: ATP-Coverage-Thresholds
variables:
- name: Coverage.Ingestion.Line
value: 75
- name: Coverage.Ingestion.Branch
value: 65
- name: Coverage.Query.Line
value: 80
- name: Coverage.Query.Branch
value: 70
- name: Coverage.Integrity.Line
value: 85
- name: Coverage.Integrity.Branch
value: 75
- name: Coverage.Export.Line
value: 70
- name: Coverage.Export.Branch
value: 60
- name: Coverage.Policy.Line
value: 80
- name: Coverage.Policy.Branch
value: 70
- name: Coverage.Search.Line
value: 70
- name: Coverage.Search.Branch
value: 60
- name: Coverage.Gateway.Line
value: 65
- name: Coverage.Gateway.Branch
value: 55
Pipeline Variable Usage:
# azure-pipelines.yml (per service)
variables:
- group: ATP-Coverage-Thresholds
- name: coverageThreshold
value: $[variables['Coverage.Ingestion.Line']] # Service-specific
- task: BuildQualityChecks@8
inputs:
checkCoverage: true
coverageThreshold: $(coverageThreshold) # Uses service-specific value
displayName: 'Enforce Coverage: $(coverageThreshold)%'
Summary¶
- Test Coverage Gates: Execute after successful build; 3-5 minute duration; 100% test pass rate required
- Service-Specific Thresholds: Ingestion (75%), Query (80%), Integrity (85%), Export (70%), Policy (80%), Search (70%), Gateway (65%)
- Threshold Rationale: Based on service criticality, complexity, and risk profile (Critical > High > Medium > Low)
- Baseline Protection: Prevents coverage regression by comparing to previous builds; zero tolerance for coverage drops
- Force New Baseline: Requires ADR documentation, Lead Architect approval, monitored over 3 builds
- Test Quality Metrics: Pass rate (100%), duration (unit <30s, integration <5min), flaky rate (<5%), assertion density (≥1.5)
- Flaky Test Detection: Daily automated scan flagging tests with <95% historical pass rate; work items created automatically
- Coverage Exclusions: Auto-generated code, migrations, Program.cs, test assemblies excluded via .runsettings
- Test Organization: Unit/Integration/E2E folder structure; Should_ExpectedBehavior_When_StateUnderTest naming
- Coverage Reports: HTML, Cobertura XML, Markdown summary with uncovered code analysis
- Per-Service Configuration: Azure DevOps variable groups for service-specific thresholds
Security Gates (Deep Dive)¶
Security gates are critical enforcement points that prevent vulnerable code, exposed secrets, and insecure dependencies from reaching production. ATP enforces zero tolerance for critical/high vulnerabilities and implements automated secret detection with mandatory rotation workflows.
Philosophy: Security is non-negotiable. ATP blocks builds with critical/high vulnerabilities, detected secrets, or insecure configurations. Every security finding is tracked, remediated, or formally risk-accepted with time-bound approvals.
Security Gate Workflow¶
graph TD
A[Test Coverage Passed] --> B[Dependency Scanning]
B --> C{Critical/High CVEs?}
C -->|Yes| D[Dependency Scan Failed ❌]
C -->|No| E[Secrets Detection]
E --> F{Secrets Found?}
F -->|Yes| G[Secrets Detected ❌]
F -->|No| H[SAST Analysis]
H --> I{Security Hotspots?}
I -->|Yes| J[SAST Failed ❌]
I -->|No| K[Container Scan]
K --> L{Image Vulnerabilities?}
L -->|Yes| M[Container Scan Failed ❌]
L -->|No| N[License Compliance]
N --> O{Incompatible Licenses?}
O -->|Yes| P[License Violation ❌]
O -->|No| Q[Security Gates Passed ✅]
D --> R[Pipeline Stopped]
G --> R
J --> R
M --> R
P --> R
Q --> S[Proceed to Compliance Gates]
style D fill:#ff6b6b
style G fill:#ff6b6b
style J fill:#ff6b6b
style M fill:#ff6b6b
style P fill:#ff6b6b
style Q fill:#90EE90
Typical Security Gate Duration: 5-8 minutes
Dependency Scanning (OWASP Dependency-Check)¶
Purpose: Detect vulnerable NuGet packages and transitive dependencies with known CVEs (Common Vulnerabilities and Exposures).
Tool: OWASP Dependency-Check — Open-source vulnerability scanner with NVD (National Vulnerability Database) integration
Threshold:
- CVSS ≥9.0 (Critical): ❌ Block build immediately; fix within 24 hours
- CVSS 7.0-8.9 (High): ❌ Block build; fix within 7 days or document risk acceptance
- CVSS 4.0-6.9 (Medium): ⚠️ Warning; fix within 30 days
- CVSS 0.1-3.9 (Low): ℹ️ Info; track in security backlog
Azure Pipelines Configuration:
# OWASP Dependency-Check Gate
- task: dependency-check-build-task@6
inputs:
projectName: 'ConnectSoft.ATP.Ingestion'
scanPath: '$(Build.SourcesDirectory)'
format: 'HTML,JSON,XML'
failOnCVSS: 7 # Block on High/Critical (CVSS ≥7)
suppressionFile: 'dependency-check-suppressions.xml'
# NVD API Configuration (faster updates)
nvdApiKey: $(NVD_API_KEY)
enableExperimental: false
# Data directory (cache for faster scans)
dataDirectory: '$(Pipeline.Workspace)/dependency-check-data'
# Advanced options
enableRetired: true # Check retired dependencies
failBuildOnCVSS: 7
warnOnCVSSViolation: true
displayName: 'OWASP Dependency Scan'
# Fail pipeline on critical/high vulnerabilities
continueOnError: false
# Publish scan results
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.SourcesDirectory)/dependency-check-report.html'
ArtifactName: 'dependency-check-$(Build.BuildNumber)'
displayName: 'Publish Dependency Scan Report'
condition: always() # Publish even on failure
CVSS Severity Matrix:
| Severity | CVSS Score | ATP Action | SLA | Approval Required | Production Blocker |
|---|---|---|---|---|---|
| Critical | 9.0-10.0 | ❌ Block build immediately | Fix within 24h | None (must fix) | ✅ Yes |
| High | 7.0-8.9 | ❌ Block build; patch or risk-accept | Fix within 7 days | Security Officer | ✅ Yes |
| Medium | 4.0-6.9 | ⚠️ Warning; track in backlog | Fix within 30 days | Tech Lead | ❌ No (warning only) |
| Low | 0.1-3.9 | ℹ️ Info; track in backlog | Fix in next release | None | ❌ No |
| None | 0.0 | ℹ️ Info; no action | N/A | None | ❌ No |
Example Vulnerability Report:
// dependency-check-report.json (excerpt)
{
"dependencies": [
{
"fileName": "System.Text.Json.dll",
"filePath": "/usr/share/dotnet/shared/Microsoft.NETCore.App/8.0.0/System.Text.Json.dll",
"sha256": "abc123...",
"vulnerabilities": [
{
"name": "CVE-2024-12345",
"severity": "CRITICAL",
"cvssv3": {
"baseScore": 9.8,
"attackVector": "NETWORK",
"attackComplexity": "LOW",
"privilegesRequired": "NONE",
"userInteraction": "NONE",
"scope": "UNCHANGED",
"confidentialityImpact": "HIGH",
"integrityImpact": "HIGH",
"availabilityImpact": "HIGH"
},
"description": "System.Text.Json deserialization vulnerability allows remote code execution",
"references": [
"https://nvd.nist.gov/vuln/detail/CVE-2024-12345",
"https://github.com/dotnet/runtime/security/advisories/GHSA-xxxx-xxxx-xxxx"
]
}
]
}
]
}
Vulnerability Suppression Workflow¶
Purpose: Allow temporary exceptions for false positives or mitigated vulnerabilities with formal approval and time-bound expiration.
Suppression File (dependency-check-suppressions.xml):
<?xml version="1.0" encoding="UTF-8"?>
<suppressions xmlns="https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.3.xsd">
<!-- Example 1: False Positive -->
<suppress>
<notes>
False positive: CVE-2023-12345 affects Linux builds only; ATP runs on Windows.
Approved by: security-team@connectsoft.example
Approval Date: 2025-01-10
Expires: 2025-07-10 (6 months)
Review Date: 2025-06-30
</notes>
<packageUrl regex="true">^pkg:nuget/Newtonsoft\.Json@12\.0\.3$</packageUrl>
<cve>CVE-2023-12345</cve>
</suppress>
<!-- Example 2: Mitigated Risk -->
<suppress>
<notes>
Risk Acceptance: CVE-2024-67890 in System.IdentityModel.Tokens.Jwt 6.x
Mitigation: Input validation prevents exploit; upgrade blocked by breaking changes.
Approved by: Lead Architect (John Doe), Security Officer (Jane Smith)
Approval Date: 2025-01-15
Expires: 2025-04-15 (3 months)
Action Plan: Upgrade to 7.x in Q2 2025 (requires API changes)
</notes>
<packageUrl regex="true">^pkg:nuget/System\.IdentityModel\.Tokens\.Jwt@6\.\d+\.\d+$</packageUrl>
<cve>CVE-2024-67890</cve>
</suppress>
<!-- Example 3: Vendor-Confirmed Fix -->
<suppress until="2025-03-01">
<notes>
Vendor has confirmed fix in next release (March 2025).
Workaround applied: Input sanitization before library call.
Approved by: Security Officer
Temporary suppression until vendor patch available.
</notes>
<packageUrl regex="true">^pkg:nuget/ThirdPartyLibrary@.*$</packageUrl>
<cve>CVE-2024-11111</cve>
</suppress>
</suppressions>
Suppression Approval Process:
stateDiagram-v2
[*] --> VulnerabilityDetected: OWASP scan finds CVE
VulnerabilityDetected --> Triage: Security team investigates
Triage --> FalsePositive: Not exploitable in ATP context
Triage --> TruePositive: Legitimate vulnerability
FalsePositive --> DocumentSuppression: Create suppression entry
TruePositive --> PatchAvailable: Check for patch
PatchAvailable --> ApplyPatch: Upgrade dependency
PatchAvailable --> RiskAcceptance: No patch available
RiskAcceptance --> MitigationExists: Evaluate controls
MitigationExists --> DocumentSuppression: Mitigated; temporary suppression
MitigationExists --> BlockBuild: No mitigation; must fix
DocumentSuppression --> SecurityReview: Security Officer reviews
SecurityReview --> Approved: Suppression approved
SecurityReview --> Rejected: Must fix or block
Approved --> TimeBoundSuppression: Add to suppressions.xml with expiry
ApplyPatch --> [*]: Build passes
BlockBuild --> [*]: Build fails
Rejected --> BlockBuild
TimeBoundSuppression --> [*]: Build passes with suppression
Risk Acceptance Form:
# Security Risk Acceptance Form
# File: security-risk-acceptances/CVE-2024-67890-System.IdentityModel.Tokens.Jwt.md
---
title: Risk Acceptance - CVE-2024-67890 (System.IdentityModel.Tokens.Jwt)
cve: CVE-2024-67890
cvssScore: 7.5 (High)
package: System.IdentityModel.Tokens.Jwt
version: 6.34.0
detectedDate: 2025-01-15
approvalDate: 2025-01-17
expirationDate: 2025-04-17
status: Approved
---
## Vulnerability Description
JWT signature validation bypass in System.IdentityModel.Tokens.Jwt 6.x allows attackers to forge tokens.
**Reference**: https://nvd.nist.gov/vuln/detail/CVE-2024-67890
## Impact Assessment
- **Exploitability**: Requires attacker to know signing algorithm (RS256 used in ATP)
- **Attack Vector**: Network-based; requires JWT manipulation
- **Affected Components**: Gateway service (all others validate via Gateway)
- **Tenant Impact**: Could allow cross-tenant access if exploited
## Mitigation Controls
1. **Additional Validation**: Custom JWT validator checks audience, issuer, expiration
2. **Rate Limiting**: API rate limiting prevents brute-force attempts
3. **Monitoring**: Anomaly detection alerts on unusual token patterns
4. **Network Segmentation**: Gateway isolated in separate subnet
## Justification for Temporary Acceptance
- Vendor fix scheduled for System.IdentityModel.Tokens.Jwt 7.0 (March 2025)
- Upgrade to 7.0 requires breaking API changes (planned for Q2 2025)
- Mitigation controls reduce risk from High (7.5) to Medium (4.2)
## Action Plan
- Q1 2025: Implement additional validation layer (completed)
- Q2 2025: Upgrade to System.IdentityModel.Tokens.Jwt 7.x
- Q2 2025: Remove suppression after upgrade
## Approval
- **Security Officer**: ✅ Approved (Jane Smith, 2025-01-17)
- **Lead Architect**: ✅ Approved (John Doe, 2025-01-17)
- **SRE Lead**: ✅ Consulted (Mike Johnson, 2025-01-16)
## Review Schedule
- Monthly review: 2025-02-17, 2025-03-17
- Expiration: 2025-04-17 (auto-removed from suppressions.xml)
Secrets Detection (CredScan / GitGuardian)¶
Purpose: Detect hardcoded secrets (API keys, passwords, tokens, certificates) in source code and prevent them from being committed.
Tool: CredScan (Microsoft Credential Scanner) or GitGuardian for GitHub repos
Enforcement: ❌ Block build immediately if secrets detected; no exceptions.
Azure Pipelines Configuration:
# Secrets Detection Gate
- task: CredScan@3
inputs:
toolMajorVersion: 'V2'
suppressionsFile: 'credscan-suppressions.json'
outputFormat: 'sarif'
debugMode: false
# Scan all text files
scanFolder: '$(Build.SourcesDirectory)'
# Exclude known safe files
excludePathsFromScan: |
**/node_modules/**
**/bin/**
**/obj/**
**/*.min.js
**/packages/**
displayName: 'Scan for Secrets (CredScan)'
# Always fail on secrets
continueOnError: false
# Analyze CredScan results
- task: PostAnalysis@2
inputs:
CredScan: true
ToolLogsNotFoundAction: 'Error' # Fail if CredScan didn't run
displayName: 'Post-Analysis: Validate No Secrets'
Detected Secret Patterns:
| Pattern Type | Regex Pattern | Example | Action |
|---|---|---|---|
| API Keys | [a-zA-Z0-9]{32,} with entropy check |
api_key=sk_live_123abc456def... |
❌ Block; rotate key |
| Connection Strings | Server=.*;Password=.*; |
Server=sql.example.com;Password=P@ssw0rd |
❌ Block; use Key Vault |
| JWT Tokens | eyJ[a-zA-Z0-9_-]*\.eyJ[a-zA-Z0-9_-]*\. |
eyJhbGciOiJIUzI1NiIsInR5cCI6... |
❌ Block; remove token |
| Private Keys | -----BEGIN (RSA|PRIVATE) KEY----- |
-----BEGIN RSA PRIVATE KEY----- |
❌ Block; use Key Vault |
| Azure Storage Keys | AccountKey=[A-Za-z0-9+/]{88}== |
AccountKey=abc123...xyz== |
❌ Block; regenerate key |
| AWS Credentials | AKIA[0-9A-Z]{16} |
AKIAIOSFODNN7EXAMPLE |
❌ Block; rotate credentials |
| GitHub Tokens | ghp_[a-zA-Z0-9]{36} |
ghp_abc123def456... |
❌ Block; revoke token |
Secrets Detection Example:
// ❌ BAD: Hardcoded connection string (CredScan WILL detect)
public class DatabaseConnection
{
private const string ConnectionString = "Server=atp-sql-prod.database.windows.net;Password=P@ssw0rd123!"; // ❌ BLOCKED
}
// ✅ GOOD: Connection string from configuration
public class DatabaseConnection
{
private readonly string _connectionString;
public DatabaseConnection(IConfiguration configuration)
{
_connectionString = configuration.GetConnectionString("DefaultConnection"); // ✅ SAFE
}
}
// ❌ BAD: API key in appsettings.json (CredScan detects in JSON files)
{
"ExternalApi": {
"ApiKey": "sk_live_123abc456def789ghi" // ❌ BLOCKED
}
}
// ✅ GOOD: API key from Key Vault
{
"ExternalApi": {
"ApiKey": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod.vault.azure.net/secrets/ExternalApiKey)" // ✅ SAFE
}
}
// ❌ BAD: Password in test code
[Fact]
public void Should_Connect_To_Database()
{
var connStr = "Server=localhost;Password=TestP@ss123!"; // ❌ BLOCKED (even in tests)
// ...
}
// ✅ GOOD: Password from environment variable
[Fact]
public void Should_Connect_To_Database()
{
var connStr = Environment.GetEnvironmentVariable("TEST_DB_CONNECTION_STRING"); // ✅ SAFE
// ...
}
CredScan Suppression (False Positives Only):
// credscan-suppressions.json
{
"suppressions": [
{
"placeholder": "Password123!",
"_justification": "Example password in documentation comment; not actual secret"
},
{
"placeholder": "sk_test_123456789",
"_justification": "Test API key example in unit test; not a real key"
},
{
"file": "docs/examples/connection-string.md",
"_justification": "Documentation example with fake credentials"
}
]
}
Secret Rotation Procedure (when secrets detected):
#!/bin/bash
# rotate-leaked-secret.sh
SECRET_TYPE=$1 # api-key, connection-string, certificate, etc.
SECRET_NAME=$2 # Name in Key Vault
echo "⚠️ Secret leaked: $SECRET_NAME"
echo "Initiating emergency rotation..."
# Step 1: Revoke compromised secret immediately
az keyvault secret set-attributes \
--vault-name atp-keyvault-prod-eus \
--name $SECRET_NAME \
--enabled false
echo "✅ Secret disabled in Key Vault"
# Step 2: Generate new secret
NEW_SECRET=$(openssl rand -base64 32)
az keyvault secret set \
--vault-name atp-keyvault-prod-eus \
--name $SECRET_NAME \
--value "$NEW_SECRET"
echo "✅ New secret generated and stored"
# Step 3: Restart services to pick up new secret
az webapp restart \
--name atp-ingestion-prod-eus \
--resource-group ATP-Prod-EUS-RG
echo "✅ Services restarted with new secret"
# Step 4: Notify security team
az boards work-item create \
--type "Incident" \
--title "Secret Leak Detected: $SECRET_NAME" \
--description "Secret detected in code commit. Rotated immediately.\n\nSecret: $SECRET_NAME\nBuild: $(Build.BuildNumber)\nCommit: $(Build.SourceVersion)" \
--assigned-to "security-team@connectsoft.example" \
--fields Priority=1
echo "✅ Incident created for security review"
Container Image Scanning (Trivy)¶
Purpose: Scan Docker images for vulnerabilities in base images, OS packages, and application dependencies before pushing to Azure Container Registry (ACR).
Tool: Trivy — Open-source container vulnerability scanner
Threshold:
- Critical: ❌ Block push to registry
- High: ❌ Block push to registry (require patch or risk acceptance)
- Medium: ⚠️ Warning; track in security backlog
- Low: ℹ️ Info; no action required
Azure Pipelines Configuration:
# Container Image Scanning Gate
- task: Docker@2
inputs:
command: 'build'
dockerfile: '$(dockerfile)'
repository: '$(imageRepository)'
tags: |
$(Build.BuildNumber)
latest
displayName: 'Build Docker Image'
# Trivy scan (before push)
- script: |
# Install Trivy
wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | sudo apt-key add -
echo "deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main" | sudo tee -a /etc/apt/sources.list.d/trivy.list
sudo apt-get update && sudo apt-get install trivy
# Scan image for HIGH/CRITICAL vulnerabilities
trivy image \
--severity HIGH,CRITICAL \
--exit-code 1 \
--no-progress \
--format json \
--output trivy-report.json \
$(containerRegistry)/$(imageRepository):$(Build.BuildNumber)
# Generate HTML report for artifact
trivy image \
--severity HIGH,CRITICAL,MEDIUM,LOW \
--format template \
--template "@contrib/html.tpl" \
--output trivy-report.html \
$(containerRegistry)/$(imageRepository):$(Build.BuildNumber)
displayName: 'Trivy Scan Docker Image'
continueOnError: false # Block on HIGH/CRITICAL
# Publish Trivy report
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: 'trivy-report.html'
ArtifactName: 'trivy-scan-$(Build.BuildNumber)'
displayName: 'Publish Trivy Report'
condition: always()
# Only push if scan passed
- task: Docker@2
inputs:
command: 'push'
repository: '$(imageRepository)'
containerRegistry: '$(dockerRegistryServiceConnection)'
tags: |
$(Build.BuildNumber)
latest
displayName: 'Push Docker Image to ACR'
condition: succeeded() # Only push if Trivy scan passed
Trivy Report Example:
// trivy-report.json (excerpt)
{
"Results": [
{
"Target": "connectsoft.azurecr.io/atp/ingestion:1.0.123",
"Class": "os-pkgs",
"Type": "ubuntu",
"Vulnerabilities": [
{
"VulnerabilityID": "CVE-2024-99999",
"PkgName": "openssl",
"InstalledVersion": "3.0.2-0ubuntu1.10",
"FixedVersion": "3.0.2-0ubuntu1.12",
"Severity": "CRITICAL",
"Description": "OpenSSL buffer overflow allows remote code execution",
"References": [
"https://nvd.nist.gov/vuln/detail/CVE-2024-99999"
],
"PrimaryURL": "https://ubuntu.com/security/CVE-2024-99999",
"Title": "openssl: buffer overflow in SSL handshake"
}
]
},
{
"Target": "app/ConnectSoft.ATP.Ingestion.dll",
"Class": "lang-pkgs",
"Type": "nuget",
"Vulnerabilities": [
{
"VulnerabilityID": "GHSA-xxxx-yyyy-zzzz",
"PkgName": "System.Text.Json",
"InstalledVersion": "8.0.0",
"FixedVersion": "8.0.1",
"Severity": "HIGH",
"Description": "Deserialization vulnerability in System.Text.Json"
}
]
}
]
}
Container Hardening Checklist (enforced by Trivy + manual review):
| Hardening Control | Validation | Blocker | Notes |
|---|---|---|---|
| Non-Root User | Trivy checks USER directive | ✅ Yes | Must run as non-root (UID 1000+) |
| No Secrets in Layers | CredScan + Trivy | ✅ Yes | Secrets must be injected at runtime |
| Minimal Base Image | Image size < 200MB | ⚠️ Warning | Prefer distroless or Alpine |
| Up-to-Date Base Image | Base image < 30 days old | ⚠️ Warning | Rebuild monthly to get security patches |
| Health Check | HEALTHCHECK directive present | ⚠️ Warning | Required for Kubernetes liveness/readiness |
| Read-Only Filesystem | Trivy config check | ⚠️ Warning | Prefer read-only root filesystem |
| Drop Capabilities | Trivy config check | ⚠️ Warning | Drop all capabilities except NET_BIND_SERVICE |
Dockerfile Best Practices (enforced by Trivy):
# ✅ GOOD: Secure Dockerfile
FROM mcr.microsoft.com/dotnet/aspnet:8.0-jammy AS base
# Run as non-root user
RUN groupadd -r atpuser && useradd -r -g atpuser atpuser
USER atpuser
WORKDIR /app
EXPOSE 8080
FROM mcr.microsoft.com/dotnet/sdk:8.0-jammy AS build
WORKDIR /src
# Copy only necessary files (avoid secrets)
COPY ["src/ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj", "ConnectSoft.ATP.Ingestion/"]
RUN dotnet restore "ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj"
COPY src/ .
RUN dotnet build "ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj" -c Release -o /app/build
FROM build AS publish
RUN dotnet publish "ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj" -c Release -o /app/publish
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8080/health/live || exit 1
ENTRYPOINT ["dotnet", "ConnectSoft.ATP.Ingestion.dll"]
# ❌ BAD: Dockerfile anti-patterns (Trivy will flag)
# FROM ubuntu:latest # Non-specific tag; use specific version
# RUN apt-get update && apt-get install -y curl # Missing clean up
# ENV DB_PASSWORD=P@ssw0rd123! # Hardcoded secret
# USER root # Running as root
# COPY . . # Copies everything including secrets
SAST (Static Application Security Testing)¶
Purpose: Detect security vulnerabilities in application code through static analysis (SQL injection, XSS, weak crypto, etc.).
Tool: SonarQube Security Rules (integrated with build quality gates)
Security Rules Enforced:
| Rule ID | Vulnerability Type | Severity | Example | Remediation |
|---|---|---|---|---|
| S2077 | SQL Injection | Blocker | $"SELECT * FROM Users WHERE Id={id}" |
Use parameterized queries or ORM |
| S3649 | SQL Injection (LINQ) | Blocker | context.Users.FromSqlRaw($"SELECT * WHERE Id={id}") |
Use FromSqlInterpolated |
| S5131 | Cross-Site Scripting (XSS) | Critical | @Html.Raw(userInput) |
Use @userInput (auto-escaped) |
| S4790 | Weak Cryptography | Blocker | MD5.Create(), DES.Create() |
Use SHA256, AES256 |
| S2068 | Hardcoded Credentials | Blocker | var password = "P@ssw0rd"; |
Load from configuration |
| S3330 | HTTP Not HTTPS | Critical | new HttpClient().GetAsync("http://...") |
Use HTTPS |
| S5122 | CORS Misconfiguration | Critical | AllowAnyOrigin() |
Specify allowed origins |
| S5042 | Zip Slip | Critical | zipEntry.FullName without validation |
Validate paths before extraction |
SAST Examples & Fixes:
// ❌ BAD: SQL Injection (S2077, S3649)
public async Task<User> GetUserAsync(string userId)
{
var sql = $"SELECT * FROM Users WHERE UserId = '{userId}'"; // ❌ Injectable
return await _context.Users.FromSqlRaw(sql).FirstOrDefaultAsync();
}
// ✅ GOOD: Parameterized query
public async Task<User> GetUserAsync(string userId)
{
return await _context.Users
.FromSqlInterpolated($"SELECT * FROM Users WHERE UserId = {userId}") // ✅ Parameterized
.FirstOrDefaultAsync();
}
// OR: Use LINQ (preferred)
public async Task<User> GetUserAsync(string userId)
{
return await _context.Users
.Where(u => u.UserId == userId) // ✅ LINQ (safe)
.FirstOrDefaultAsync();
}
// ❌ BAD: XSS Vulnerability (S5131)
public IActionResult DisplayMessage(string message)
{
ViewBag.Message = message;
return View(); // View uses @Html.Raw(ViewBag.Message) ❌
}
// ✅ GOOD: Auto-escaped output
public IActionResult DisplayMessage(string message)
{
ViewBag.Message = message;
return View(); // View uses @ViewBag.Message ✅ (auto-escaped)
}
// ❌ BAD: Weak Cryptography (S4790)
public string HashPassword(string password)
{
using var md5 = MD5.Create(); // ❌ MD5 is cryptographically broken
var hash = md5.ComputeHash(Encoding.UTF8.GetBytes(password));
return Convert.ToBase64String(hash);
}
// ✅ GOOD: Strong Cryptography
public string HashPassword(string password)
{
using var sha256 = SHA256.Create(); // ✅ SHA256 is acceptable
var hash = sha256.ComputeHash(Encoding.UTF8.GetBytes(password));
return Convert.ToBase64String(hash);
}
// ✅ BETTER: Use BCrypt/Argon2 for password hashing
public string HashPassword(string password)
{
return BCrypt.Net.BCrypt.HashPassword(password, workFactor: 12); // ✅ Industry standard
}
// ❌ BAD: CORS Misconfiguration (S5122)
public void ConfigureServices(IServiceCollection services)
{
services.AddCors(options =>
{
options.AddPolicy("AllowAll", builder =>
{
builder.AllowAnyOrigin() // ❌ Allows any origin (security risk)
.AllowAnyMethod()
.AllowAnyHeader();
});
});
}
// ✅ GOOD: Restrictive CORS
public void ConfigureServices(IServiceCollection services)
{
services.AddCors(options =>
{
options.AddPolicy("ATPPolicy", builder =>
{
builder.WithOrigins("https://atp.connectsoft.com", "https://app.connectsoft.com") // ✅ Specific origins
.WithMethods("GET", "POST", "PUT", "DELETE") // ✅ Specific methods
.WithHeaders("Content-Type", "Authorization"); // ✅ Specific headers
});
});
}
License Compliance Scanning¶
Purpose: Ensure all dependencies have acceptable licenses that comply with ConnectSoft's legal policies (no GPL/AGPL in production).
Tool: dotnet-project-licenses or FOSSA
Acceptable Licenses (Whitelist):
| License | Category | ATP Usage | Notes |
|---|---|---|---|
| MIT | Permissive | ✅ Allowed | Most NuGet packages |
| Apache 2.0 | Permissive | ✅ Allowed | Common in .NET ecosystem |
| BSD (⅔-Clause) | Permissive | ✅ Allowed | Widely used |
| ISC | Permissive | ✅ Allowed | Similar to MIT |
| MS-PL | Permissive | ✅ Allowed | Microsoft Public License |
| GPL 2.0/3.0 | Copyleft | ❌ Prohibited | Requires source disclosure |
| AGPL 3.0 | Copyleft | ❌ Prohibited | Network copyleft (service distribution) |
| LGPL 2.⅓.0 | Weak Copyleft | ⚠️ Review Required | Allowed if dynamically linked |
| Custom/Proprietary | Commercial | ⚠️ Review Required | Requires legal review |
License Scanning (Azure Pipelines):
# License Compliance Gate
- script: |
# Install license scanner
dotnet tool install --global dotnet-project-licenses
# Generate license report
dotnet-project-licenses \
--input $(Build.SourcesDirectory) \
--output $(Build.ArtifactStagingDirectory)/licenses \
--export-license-texts \
--projects-filter "^(?!.*Tests).*$" # Exclude test projects
# Check for prohibited licenses
PROHIBITED=$(jq -r '.projects[].packages[] | select(.license == "GPL-2.0" or .license == "GPL-3.0" or .license == "AGPL-3.0") | .packageName' \
$(Build.ArtifactStagingDirectory)/licenses/licenses.json)
if [ -n "$PROHIBITED" ]; then
echo "##vso[task.logissue type=error]Prohibited licenses detected:"
echo "$PROHIBITED"
exit 1
fi
echo "✅ All dependencies have acceptable licenses"
displayName: 'Validate License Compliance'
# Publish license report
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)/licenses'
ArtifactName: 'licenses-$(Build.BuildNumber)'
displayName: 'Publish License Report'
License Report Example:
// licenses.json (excerpt)
{
"projects": [
{
"projectName": "ConnectSoft.ATP.Ingestion",
"packages": [
{
"packageName": "System.Text.Json",
"packageVersion": "8.0.0",
"license": "MIT",
"licenseUrl": "https://licenses.nuget.org/MIT"
},
{
"packageName": "Newtonsoft.Json",
"packageVersion": "13.0.3",
"license": "MIT",
"licenseUrl": "https://github.com/JamesNK/Newtonsoft.Json/blob/master/LICENSE.md"
},
{
"packageName": "ProblematicLibrary",
"packageVersion": "1.0.0",
"license": "GPL-3.0", // ❌ PROHIBITED
"licenseUrl": "https://www.gnu.org/licenses/gpl-3.0.en.html"
}
]
}
]
}
Security Gate Metrics & Monitoring¶
Purpose: Track security posture over time and identify trends in vulnerability detection/remediation.
Security Metrics Dashboard:
# Azure DevOps Security Dashboard
dashboard:
name: "ATP Security Posture"
widgets:
- type: vulnerabilityTrend
title: "Open Vulnerabilities (Last 90 Days)"
query: |
customEvents
| where name == "VulnerabilityDetected"
| summarize count() by Severity, bin(timestamp, 1d)
target: 0 Critical/High
- type: secretsDetection
title: "Secrets Detected (Last 30 Days)"
query: "CredScan Results"
target: 0
- type: remediationTime
title: "Mean Time to Remediate (MTTR)"
query: |
customEvents
| where name in ("VulnerabilityDetected", "VulnerabilityRemediated")
| summarize MTTR = avg(datetime_diff('hour', RemediatedAt, DetectedAt))
target: < 24h (Critical), < 7d (High)
Security KQL Queries:
// Open vulnerabilities by severity (last 30 days)
customEvents
| where name == "VulnerabilityDetected"
| where timestamp > ago(30d)
| extend Severity = tostring(customDimensions.Severity)
| extend Status = tostring(customDimensions.Status)
| where Status == "Open"
| summarize Count = count() by Severity
| order by
case(
Severity == "Critical", 1,
Severity == "High", 2,
Severity == "Medium", 3,
Severity == "Low", 4,
5
)
// Mean Time to Remediate (MTTR) by severity
customEvents
| where name in ("VulnerabilityDetected", "VulnerabilityRemediated")
| where timestamp > ago(90d)
| extend VulnerabilityId = tostring(customDimensions.VulnerabilityId)
| extend Severity = tostring(customDimensions.Severity)
| summarize
DetectedAt = minif(timestamp, name == "VulnerabilityDetected"),
RemediatedAt = maxif(timestamp, name == "VulnerabilityRemediated")
by VulnerabilityId, Severity
| where isnotnull(RemediatedAt)
| extend MTTR_Hours = datetime_diff('hour', RemediatedAt, DetectedAt)
| summarize
AvgMTTR = avg(MTTR_Hours),
P50MTTR = percentile(MTTR_Hours, 50),
P95MTTR = percentile(MTTR_Hours, 95)
by Severity
// Secret detection incidents (last 6 months)
customEvents
| where name == "SecretDetected"
| where timestamp > ago(180d)
| extend SecretType = tostring(customDimensions.SecretType)
| extend Repository = tostring(customDimensions.Repository)
| extend Commit = tostring(customDimensions.CommitSha)
| summarize
IncidentCount = count(),
MostRecentIncident = max(timestamp)
by SecretType, Repository
| order by IncidentCount desc
Security Gate Enforcement Policy¶
Purpose: Define clear policies for vulnerability remediation SLAs and escalation procedures.
Remediation SLA Matrix:
| Severity | CVSS Score | Detection → Fix SLA | Escalation (SLA Breach) | Production Blocker |
|---|---|---|---|---|
| Critical | 9.0-10.0 | 24 hours | Security Officer → CISO | ✅ Yes |
| High | 7.0-8.9 | 7 days | Security Officer → Lead Architect | ✅ Yes |
| Medium | 4.0-6.9 | 30 days | Tech Lead → Security Officer | ❌ No |
| Low | 0.1-3.9 | Next Release | None | ❌ No |
Escalation Workflow:
graph TD
A[Vulnerability Detected] --> B{Severity?}
B -->|Critical| C[24h SLA Timer Starts]
B -->|High| D[7d SLA Timer Starts]
B -->|Medium| E[30d SLA Timer Starts]
B -->|Low| F[Track in Backlog]
C --> G{Fixed in 24h?}
D --> H{Fixed in 7d?}
E --> I{Fixed in 30d?}
G -->|Yes| J[Closed]
G -->|No| K[Escalate to CISO]
H -->|Yes| J
H -->|No| L[Escalate to Lead Architect]
I -->|Yes| J
I -->|No| M[Escalate to Security Officer]
K --> N[Emergency Patch Required]
L --> O[Risk Acceptance or Patch]
M --> P[Prioritize in Next Sprint]
F --> Q[Fix in Next Major Release]
style K fill:#ff6b6b
style L fill:#feca57
style M fill:#feca57
Automated SLA Monitoring:
// Monitor vulnerability remediation SLAs
[FunctionName("MonitorVulnerabilitySLAs")]
public async Task RunAsync(
[TimerTrigger("0 0 */6 * * *")] TimerInfo timer, // Every 6 hours
ILogger log)
{
log.LogInformation("Checking vulnerability remediation SLAs...");
var openVulnerabilities = await GetOpenVulnerabilitiesAsync();
var breachedSLAs = new List<VulnerabilitySLA>();
foreach (var vuln in openVulnerabilities)
{
var sla = CalculateSLA(vuln.Severity);
var ageHours = (DateTime.UtcNow - vuln.DetectedAt).TotalHours;
if (ageHours > sla.Hours)
{
breachedSLAs.Add(new VulnerabilitySLA
{
CVE = vuln.CVE,
Severity = vuln.Severity,
DetectedAt = vuln.DetectedAt,
AgeHours = ageHours,
SLAHours = sla.Hours,
BreachHours = ageHours - sla.Hours,
AssignedTo = vuln.AssignedTo
});
// Escalate based on severity
if (vuln.Severity == "Critical")
{
await EscalateToCISOAsync(vuln);
}
else if (vuln.Severity == "High")
{
await EscalateToArchitectAsync(vuln);
}
else if (vuln.Severity == "Medium")
{
await EscalateToSecurityOfficerAsync(vuln);
}
}
}
if (breachedSLAs.Any())
{
log.LogWarning($"SLA breaches detected: {breachedSLAs.Count} vulnerabilities");
await SendSLABreachReportAsync(breachedSLAs);
}
else
{
log.LogInformation("✅ All vulnerabilities within SLA");
}
}
private SLA CalculateSLA(string severity) => severity switch
{
"Critical" => new SLA { Hours = 24, Escalation = "CISO" },
"High" => new SLA { Hours = 168, Escalation = "Lead Architect" }, // 7 days
"Medium" => new SLA { Hours = 720, Escalation = "Security Officer" }, // 30 days
_ => new SLA { Hours = int.MaxValue, Escalation = "None" }
};
Dependency Update Strategy¶
Purpose: Proactively update dependencies to minimize vulnerability exposure.
Update Cadence:
dependencyUpdates:
automated:
schedule: Weekly (Monday 2 AM)
scope: Patch versions only (1.2.3 → 1.2.4)
tool: Dependabot or Renovate
automerge: true # If tests pass
minor:
schedule: Monthly (1st Monday)
scope: Minor versions (1.2.x → 1.3.0)
tool: Manual PR by platform team
automerge: false # Requires review
major:
schedule: Quarterly
scope: Major versions (1.x → 2.0)
tool: Manual PR with ADR
automerge: false # Requires architect approval
security:
schedule: Immediate (on CVE disclosure)
scope: Any version with security patch
tool: Emergency PR
automerge: false # Requires security officer approval
Dependabot Configuration (.github/dependabot.yml):
version: 2
updates:
# .NET dependencies
- package-ecosystem: "nuget"
directory: "/"
schedule:
interval: "weekly"
day: "monday"
time: "02:00"
# Auto-merge patch updates if tests pass
open-pull-requests-limit: 10
reviewers:
- "platform-team"
# Grouping strategy
groups:
security-updates:
patterns:
- "*"
update-types:
- "security"
patch-updates:
patterns:
- "*"
update-types:
- "patch"
# Labels for PR categorization
labels:
- "dependencies"
- "automated"
# Ignore specific dependencies
ignore:
- dependency-name: "System.Text.Json"
versions: ["8.0.0"] # Pinned for compatibility
Summary¶
- Security Gates: 5-8 minute execution; zero tolerance for critical/high vulnerabilities and secrets
- Dependency Scanning: OWASP Dependency-Check with NVD integration; CVSS ≥7 blocks build
- Severity Thresholds: Critical (24h SLA), High (7d SLA), Medium (30d SLA), Low (next release)
- Suppression Workflow: Mermaid approval flow (Triage → Patch/RiskAcceptance → Security Review → Time-Bound Suppression)
- Risk Acceptance: Formal template with impact assessment, mitigation controls, approval signatures, expiration dates
- Secrets Detection: CredScan blocks on any detected secrets (API keys, passwords, tokens, certificates)
- Secret Patterns: 7 pattern types (API keys, connection strings, JWTs, private keys, Azure/AWS/GitHub credentials)
- Secret Rotation: Emergency rotation script with Key Vault disable/regenerate/restart workflow
- Container Scanning: Trivy scans Docker images; blocks push on Critical/High OS/package vulnerabilities
- Container Hardening: 7 controls (non-root user, no secrets, minimal image, health check, read-only FS)
- SAST: SonarQube security rules (SQL injection, XSS, weak crypto, CORS, hardcoded credentials)
- License Compliance: Whitelist (MIT, Apache, BSD); prohibited (GPL, AGPL); scan with dotnet-project-licenses
- Update Strategy: Weekly patch updates (automated), monthly minor updates, quarterly major updates, immediate security updates
- SLA Monitoring: Automated Azure Function checks every 6 hours; escalates Critical (CISO), High (Architect), Medium (Security Officer)
SBOM & Supply Chain Gates (Deep Dive)¶
Software Bill of Materials (SBOM) and supply chain security gates ensure transparency and traceability of all software components. ATP generates comprehensive SBOMs for every build and implements cryptographic signing for container images to prevent supply chain attacks.
Philosophy: In the era of Log4Shell and SolarWinds, supply chain security is paramount. ATP enforces complete visibility into all dependencies, cryptographic verification of artifacts, and immutable provenance tracking from code commit to production deployment.
SBOM & Supply Chain Workflow¶
graph TD
A[Security Gates Passed] --> B[Generate SBOM]
B --> C{SBOM Valid?}
C -->|No| D[SBOM Generation Failed ❌]
C -->|Yes| E[Validate SBOM Content]
E --> F{All Dependencies Listed?}
F -->|No| G[Incomplete SBOM ❌]
F -->|Yes| H[Sign Build Artifacts]
H --> I[Sign Container Image]
I --> J{Signature Valid?}
J -->|No| K[Signing Failed ❌]
J -->|Yes| L[Generate Provenance]
L --> M[Publish SBOM + Provenance]
M --> N[Supply Chain Gates Passed ✅]
D --> O[Pipeline Stopped]
G --> O
K --> O
N --> P[Proceed to Compliance Gates]
style D fill:#ff6b6b
style G fill:#ff6b6b
style K fill:#ff6b6b
style N fill:#90EE90
Typical SBOM Gate Duration: 2-3 minutes
SBOM Generation (CycloneDX)¶
Purpose: Generate a complete inventory of all software components (NuGet packages, Docker base images, OS packages) with version, license, and vulnerability information.
Tool: CycloneDX — OWASP-standardized SBOM format (also supports SPDX)
Requirements:
- Every build must generate an SBOM (no exceptions)
- Published as build artifact for audit trail and compliance
- Includes all dependencies: Direct, transitive, dev dependencies
- Metadata captured: Versions, licenses, CVEs, hashes (SHA256)
- Retention: 7 years for production builds (immutable storage)
Azure Pipelines Configuration:
# SBOM Generation Gate
- task: CycloneDX@1
inputs:
projectPath: '$(Build.SourcesDirectory)'
outputFormat: 'json,xml'
outputPath: '$(Build.ArtifactStagingDirectory)/sbom'
# Include detailed metadata
includeSerialNumber: true
includeLicenseText: true
# Scan depth
scanType: 'solution' # Scan entire solution
# Output naming
outputFilename: 'atp-ingestion-sbom-$(Build.BuildNumber)'
displayName: 'Generate SBOM (CycloneDX)'
continueOnError: false # Fail if SBOM generation fails
# Validate SBOM was generated
- script: |
SBOM_FILE="$(Build.ArtifactStagingDirectory)/sbom/atp-ingestion-sbom-$(Build.BuildNumber).json"
if [ ! -f "$SBOM_FILE" ]; then
echo "##vso[task.logissue type=error]SBOM file not found: $SBOM_FILE"
exit 1
fi
# Validate SBOM is valid JSON
if ! jq empty "$SBOM_FILE" 2>/dev/null; then
echo "##vso[task.logissue type=error]SBOM is not valid JSON"
exit 1
fi
# Validate SBOM has components
COMPONENT_COUNT=$(jq '.components | length' "$SBOM_FILE")
if [ "$COMPONENT_COUNT" -lt 10 ]; then
echo "##vso[task.logissue type=error]SBOM has too few components: $COMPONENT_COUNT (expected >10)"
exit 1
fi
echo "✅ SBOM validated: $COMPONENT_COUNT components"
displayName: 'Validate SBOM Content'
# Publish SBOM as build artifact
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)/sbom'
ArtifactName: 'sbom-$(Build.BuildNumber)'
displayName: 'Publish SBOM Artifact'
condition: always() # Publish even on failure for audit
# Upload SBOM to immutable storage (production only)
- task: AzureCLI@2
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az storage blob upload \
--account-name atpcomplianceblob \
--container-name sbom-archive \
--name "atp-ingestion/$(Build.BuildNumber)/sbom.json" \
--file "$(Build.ArtifactStagingDirectory)/sbom/atp-ingestion-sbom-$(Build.BuildNumber).json" \
--metadata \
BuildId=$(Build.BuildId) \
CommitSha=$(Build.SourceVersion) \
Pipeline=$(Build.DefinitionName) \
GeneratedAt=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# Enable legal hold (7-year retention)
az storage blob set-legal-hold \
--account-name atpcomplianceblob \
--container-name sbom-archive \
--blob-name "atp-ingestion/$(Build.BuildNumber)/sbom.json" \
--legal-hold true \
--tags compliance=true retention=7years
displayName: 'Archive SBOM with Legal Hold'
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
SBOM Example (CycloneDX JSON):
{
"bomFormat": "CycloneDX",
"specVersion": "1.5",
"serialNumber": "urn:uuid:12345678-1234-1234-1234-123456789012",
"version": 1,
"metadata": {
"timestamp": "2025-01-15T14:30:00Z",
"tools": [
{
"vendor": "OWASP",
"name": "CycloneDX",
"version": "3.0.0"
}
],
"component": {
"type": "application",
"name": "ConnectSoft.ATP.Ingestion",
"version": "1.0.123",
"purl": "pkg:nuget/ConnectSoft.ATP.Ingestion@1.0.123",
"properties": [
{
"name": "build:commitSha",
"value": "a1b2c3d4e5f6..."
},
{
"name": "build:pipelineId",
"value": "12345"
},
{
"name": "build:timestamp",
"value": "2025-01-15T14:30:00Z"
}
]
}
},
"components": [
{
"type": "library",
"name": "System.Text.Json",
"version": "8.0.0",
"purl": "pkg:nuget/System.Text.Json@8.0.0",
"licenses": [
{
"license": {
"id": "MIT",
"url": "https://licenses.nuget.org/MIT"
}
}
],
"hashes": [
{
"alg": "SHA-256",
"content": "abc123def456..."
}
],
"externalReferences": [
{
"type": "website",
"url": "https://www.nuget.org/packages/System.Text.Json"
}
]
},
{
"type": "library",
"name": "Newtonsoft.Json",
"version": "13.0.3",
"purl": "pkg:nuget/Newtonsoft.Json@13.0.3",
"licenses": [
{
"license": {
"id": "MIT"
}
}
],
"vulnerabilities": [
{
"bom-ref": "vuln-1",
"id": "CVE-2024-12345",
"source": {
"name": "NVD",
"url": "https://nvd.nist.gov/vuln/detail/CVE-2024-12345"
},
"ratings": [
{
"source": {
"name": "NVD"
},
"score": 5.3,
"severity": "medium",
"method": "CVSSv3",
"vector": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:L/A:N"
}
]
}
]
}
]
}
SBOM Validation Requirements:
| Requirement | Validation | Blocker | Purpose |
|---|---|---|---|
| BOM Format | Must be valid CycloneDX JSON/XML | ✅ Yes | SBOM parser compatibility |
| Component Count | ≥10 components (sanity check) | ✅ Yes | Ensure dependencies captured |
| Version Info | All components have versions | ✅ Yes | Vulnerability correlation |
| License Info | ≥90% of components have license data | ⚠️ Warning | License compliance tracking |
| Hash Integrity | All components have SHA-256 hashes | ✅ Yes (prod) | Supply chain verification |
| Vulnerability Data | Known CVEs included if present | ℹ️ Info | Security awareness |
| Provenance | Build metadata (commit, pipeline, timestamp) | ✅ Yes | Audit trail |
SBOM Content Validation¶
Purpose: Ensure SBOM is complete and accurate, not just generated.
Validation Script (PowerShell):
<#
.SYNOPSIS
Validate SBOM completeness and accuracy.
.DESCRIPTION
Checks SBOM for required fields, component count, license data, and provenance.
#>
param(
[string]$SbomPath = "sbom/atp-ingestion-sbom.json",
[int]$MinComponents = 10,
[double]$MinLicenseCoverage = 0.9 # 90%
)
Write-Host "Validating SBOM: $SbomPath"
# Load SBOM
$sbom = Get-Content -Path $SbomPath | ConvertFrom-Json
# Validation 1: BOM Format
if ($sbom.bomFormat -ne "CycloneDX") {
Write-Error "Invalid BOM format: $($sbom.bomFormat) (expected: CycloneDX)"
exit 1
}
Write-Host "✅ BOM Format: $($sbom.bomFormat) $($sbom.specVersion)"
# Validation 2: Component Count
$componentCount = $sbom.components.Count
if ($componentCount -lt $MinComponents) {
Write-Error "Too few components: $componentCount (expected: ≥$MinComponents)"
exit 1
}
Write-Host "✅ Component Count: $componentCount"
# Validation 3: License Coverage
$componentsWithLicense = $sbom.components | Where-Object { $_.licenses.Count -gt 0 }
$licenseCoverage = $componentsWithLicense.Count / $componentCount
if ($licenseCoverage -lt $MinLicenseCoverage) {
Write-Warning "Low license coverage: $($licenseCoverage * 100)% (expected: ≥$($MinLicenseCoverage * 100)%)"
}
else {
Write-Host "✅ License Coverage: $($licenseCoverage * 100)%"
}
# Validation 4: Provenance Metadata
$buildMetadata = $sbom.metadata.component.properties | Where-Object { $_.name -like "build:*" }
if ($buildMetadata.Count -lt 3) {
Write-Error "Missing build provenance metadata (expected: commitSha, pipelineId, timestamp)"
exit 1
}
Write-Host "✅ Provenance Metadata: $($buildMetadata.Count) properties"
# Validation 5: Vulnerability Data (optional but recommended)
$componentsWithVulns = $sbom.components | Where-Object { $_.vulnerabilities.Count -gt 0 }
if ($componentsWithVulns.Count -gt 0) {
Write-Warning "Components with known vulnerabilities: $($componentsWithVulns.Count)"
$criticalVulns = $componentsWithVulns.vulnerabilities | Where-Object { $_.ratings[0].severity -eq "critical" }
if ($criticalVulns.Count -gt 0) {
Write-Error "SBOM contains components with CRITICAL vulnerabilities: $($criticalVulns.Count)"
exit 1
}
}
Write-Host "✅ SBOM validation passed"
Provenance & Signing (Cosign)¶
Purpose: Cryptographically sign build artifacts and container images to ensure authenticity and integrity, preventing tampering and supply chain attacks.
Tool: Cosign (part of Sigstore project) — Container image signing and verification
Requirements:
- All production images must be signed with Cosign before push to ACR
- Signature verification enforced at deployment time (Kubernetes admission controller)
- Provenance attestation includes commit SHA, pipeline ID, build timestamp, approver identities
- Key management: Signing keys stored in Azure Key Vault (managed HSM for production)
Cosign Signing Workflow:
# Container Image Signing Gate
steps:
# 1. Build Docker image
- task: Docker@2
inputs:
command: 'build'
dockerfile: '$(dockerfile)'
repository: '$(imageRepository)'
tags: '$(Build.BuildNumber)'
displayName: 'Build Docker Image'
# 2. Trivy scan (must pass before signing)
- script: |
trivy image --severity HIGH,CRITICAL --exit-code 1 \
$(imageRepository):$(Build.BuildNumber)
displayName: 'Trivy Scan'
# 3. Install Cosign
- script: |
# Install Cosign CLI
COSIGN_VERSION=2.2.2
wget "https://github.com/sigstore/cosign/releases/download/v${COSIGN_VERSION}/cosign-linux-amd64"
sudo mv cosign-linux-amd64 /usr/local/bin/cosign
sudo chmod +x /usr/local/bin/cosign
# Verify installation
cosign version
displayName: 'Install Cosign'
# 4. Fetch signing key from Key Vault
- task: AzureKeyVault@2
inputs:
azureSubscription: '$(azureSubscription)'
keyVaultName: 'atp-keyvault-prod-eus'
secretsFilter: 'CosignSigningKey,CosignPassword'
runAsPreJob: false
displayName: 'Fetch Cosign Signing Key'
# 5. Sign container image
- script: |
# Export Cosign private key
echo "$(CosignSigningKey)" > cosign.key
# Sign image with provenance
COSIGN_PASSWORD=$(CosignPassword) cosign sign \
--key cosign.key \
--annotations "build.commitSha=$(Build.SourceVersion)" \
--annotations "build.pipelineId=$(Build.BuildId)" \
--annotations "build.pipelineName=$(Build.DefinitionName)" \
--annotations "build.timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--annotations "build.branch=$(Build.SourceBranch)" \
--annotations "build.buildNumber=$(Build.BuildNumber)" \
$(containerRegistry)/$(imageRepository):$(Build.BuildNumber)
# Clean up key
rm -f cosign.key
echo "✅ Image signed successfully"
displayName: 'Sign Container Image with Cosign'
env:
COSIGN_PASSWORD: $(CosignPassword)
# 6. Generate provenance attestation (SLSA)
- script: |
cosign attest \
--key cosign.key \
--predicate <(
cat <<EOF
{
"buildType": "https://tekton.dev/attestations/chains/pipelinerun@v2",
"builder": {
"id": "https://dev.azure.com/ConnectSoft/$(Build.DefinitionName)"
},
"invocation": {
"configSource": {
"uri": "$(Build.Repository.Uri)",
"digest": {
"sha1": "$(Build.SourceVersion)"
},
"entryPoint": "azure-pipelines.yml"
}
},
"metadata": {
"buildStartedOn": "$(Build.StartTime)",
"buildFinishedOn": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"completeness": {
"parameters": true,
"environment": true,
"materials": true
},
"reproducible": false
},
"materials": [
{
"uri": "$(Build.Repository.Uri)",
"digest": {
"sha1": "$(Build.SourceVersion)"
}
}
]
}
EOF
) \
$(containerRegistry)/$(imageRepository):$(Build.BuildNumber)
echo "✅ Provenance attestation generated"
displayName: 'Generate Provenance Attestation'
# 7. Verify signature (self-test)
- script: |
# Export public key
cosign public-key --key cosign.key > cosign.pub
# Verify signature
cosign verify \
--key cosign.pub \
$(containerRegistry)/$(imageRepository):$(Build.BuildNumber)
if [ $? -eq 0 ]; then
echo "✅ Signature verified successfully"
else
echo "##vso[task.logissue type=error]Signature verification failed"
exit 1
fi
displayName: 'Verify Image Signature'
# 8. Push signed image to ACR
- task: Docker@2
inputs:
command: 'push'
repository: '$(imageRepository)'
containerRegistry: '$(dockerRegistryServiceConnection)'
tags: '$(Build.BuildNumber)'
displayName: 'Push Signed Image to ACR'
condition: succeeded() # Only push if signing succeeded
Signature Verification at Deployment¶
Purpose: Enforce that only signed images can be deployed to production, preventing unauthorized or tampered images.
Kubernetes Admission Controller (Cosign Verification):
# Policy Controller (Sigstore)
apiVersion: v1
kind: ConfigMap
metadata:
name: cosign-verification-policy
namespace: atp-prod
data:
policy.yaml: |
apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
name: atp-image-signing-policy
spec:
images:
- glob: "connectsoft.azurecr.io/atp/**"
authorities:
- key:
secretRef:
name: cosign-public-key
namespace: atp-prod
# Require specific annotations (provenance)
attestations:
- name: build-provenance
predicateType: https://slsa.dev/provenance/v0.2
policy:
type: cue
data: |
builder.id: "https://dev.azure.com/ConnectSoft/*"
metadata.completeness.materials: true
---
apiVersion: v1
kind: Secret
metadata:
name: cosign-public-key
namespace: atp-prod
type: Opaque
stringData:
cosign.pub: |
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
-----END PUBLIC KEY-----
Deployment Validation:
#!/bin/bash
# verify-image-signature.sh
IMAGE=$1 # e.g., connectsoft.azurecr.io/atp/ingestion:1.0.123
echo "Verifying image signature: $IMAGE"
# Fetch public key from Key Vault
az keyvault secret show \
--vault-name atp-keyvault-prod-eus \
--name CosignPublicKey \
--query value -o tsv > cosign.pub
# Verify signature
cosign verify --key cosign.pub $IMAGE
if [ $? -eq 0 ]; then
echo "✅ Image signature valid"
# Verify provenance attestation
cosign verify-attestation --key cosign.pub $IMAGE
if [ $? -eq 0 ]; then
echo "✅ Provenance attestation valid"
exit 0
else
echo "❌ Provenance attestation invalid or missing"
exit 1
fi
else
echo "❌ Image signature invalid or missing"
echo " Unsigned images are not allowed in production."
exit 1
fi
Supply Chain Attack Prevention¶
Purpose: Mitigate supply chain attack vectors through dependency pinning, checksum verification, and isolated build environments.
Supply Chain Security Controls:
| Control | Implementation | Enforcement | Risk Mitigated |
|---|---|---|---|
| Dependency Pinning | Lock file with exact versions | ✅ Required | Prevent malicious updates |
| Checksum Verification | NuGet package hash validation | ✅ Automatic | Detect package tampering |
| Isolated Build Agents | Ephemeral agents, no internet access | ✅ Prod builds | Prevent agent compromise |
| Two-Person Review | PR requires 2 approvals for dependency changes | ✅ Production | Prevent single-actor malicious PRs |
| SBOM Comparison | Diff SBOMs between builds | ⚠️ Warning | Detect unexpected dependency changes |
| Private Package Feed | Azure Artifacts mirrors public NuGet | ✅ Recommended | Prevent dependency confusion |
| Code Signing | All builds signed with authenticode | ✅ Prod only | Verify publisher identity |
Dependency Lock File (NuGet packages.lock.json):
<!-- Enable lock file in .csproj -->
<Project Sdk="Microsoft.NET.Sdk.Web">
<PropertyGroup>
<!-- Enable NuGet lock file -->
<RestorePackagesWithLockFile>true</RestorePackagesWithLockFile>
<RestoreLockedMode Condition="'$(CI)' == 'true'">true</RestoreLockedMode>
<!-- Fail on lock file mismatch -->
<NuGetLockFileMismatchCheck>true</NuGetLockFileMismatchCheck>
</PropertyGroup>
</Project>
// packages.lock.json (auto-generated)
{
"version": 1,
"dependencies": {
"net8.0": {
"System.Text.Json": {
"type": "Direct",
"requested": "[8.0.0, )",
"resolved": "8.0.0",
"contentHash": "sha512-abc123def456...",
"dependencies": {
"System.Runtime": "8.0.0",
"System.Memory": "8.0.0"
}
},
"Newtonsoft.Json": {
"type": "Direct",
"requested": "[13.0.3, )",
"resolved": "13.0.3",
"contentHash": "sha512-xyz789abc123..."
}
}
}
}
SBOM Diff Analysis (detect unexpected changes):
#!/bin/bash
# sbom-diff.sh
PREVIOUS_SBOM=$1 # Previous build SBOM
CURRENT_SBOM=$2 # Current build SBOM
echo "Analyzing SBOM changes..."
# Extract component lists
jq -r '.components[] | "\(.name)@\(.version)"' $PREVIOUS_SBOM | sort > previous-components.txt
jq -r '.components[] | "\(.name)@\(.version)"' $CURRENT_SBOM | sort > current-components.txt
# Detect added dependencies
ADDED=$(comm -13 previous-components.txt current-components.txt)
if [ -n "$ADDED" ]; then
echo "⚠️ New dependencies added:"
echo "$ADDED"
# Create work item for review
az boards work-item create \
--type "Task" \
--title "SBOM Review: New Dependencies in Build $(Build.BuildNumber)" \
--description "New dependencies detected:\n\n$ADDED\n\nReview for security and license compliance." \
--assigned-to "security-team@connectsoft.example"
fi
# Detect removed dependencies
REMOVED=$(comm -23 previous-components.txt current-components.txt)
if [ -n "$REMOVED" ]; then
echo "⚠️ Dependencies removed:"
echo "$REMOVED"
fi
# Detect version changes (same package, different version)
CHANGED=$(comm -12 <(cut -d'@' -f1 previous-components.txt) <(cut -d'@' -f1 current-components.txt) | while read PKG; do
PREV_VER=$(grep "^$PKG@" previous-components.txt | cut -d'@' -f2)
CURR_VER=$(grep "^$PKG@" current-components.txt | cut -d'@' -f2)
if [ "$PREV_VER" != "$CURR_VER" ]; then
echo "$PKG: $PREV_VER → $CURR_VER"
fi
done)
if [ -n "$CHANGED" ]; then
echo "ℹ️ Dependencies updated:"
echo "$CHANGED"
fi
echo "✅ SBOM diff analysis complete"
Private Package Feed (Dependency Confusion Prevention)¶
Purpose: Prevent dependency confusion attacks where attackers publish malicious packages with the same name as internal packages.
Strategy: Private Azure Artifacts feed that mirrors public NuGet with approval workflow.
Azure Artifacts Feed Configuration:
# Azure Artifacts: ATP-NuGet-Feed
feed:
name: ATP-NuGet-Feed
visibility: private
upstreams:
# Mirror public NuGet (with caching)
- name: nuget.org
protocol: nuget
location: https://api.nuget.org/v3/index.json
includePrerelease: false
# Package allowlist (only approved packages)
upstreamBehavior: allowExternalVersionsOnly
permissions:
# Feed readers (developers, build agents)
- identity: Build Service (ConnectSoft)
role: Reader
- identity: Contributors (ConnectSoft)
role: Reader
# Feed publishers (only platform team)
- identity: Platform-Team
role: Contributor
NuGet.config (consume private feed):
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<packageSources>
<clear />
<!-- Private feed (takes precedence) -->
<add key="ATP-NuGet-Feed" value="https://pkgs.dev.azure.com/ConnectSoft/_packaging/ATP-NuGet-Feed/nuget/v3/index.json" />
<!-- Public NuGet as fallback (via upstream) -->
<!-- <add key="nuget.org" value="https://api.nuget.org/v3/index.json" /> -->
</packageSources>
<packageSourceCredentials>
<ATP-NuGet-Feed>
<add key="Username" value="az" />
<add key="ClearTextPassword" value="%SYSTEM_ACCESSTOKEN%" />
</ATP-NuGet-Feed>
</packageSourceCredentials>
</configuration>
Dependency Confusion Detection:
// Detect potential dependency confusion attacks
public class DependencyConfusionDetector
{
public async Task<bool> DetectConfusionRiskAsync(string packageName, string version)
{
// Check if package exists in both public and private feeds
var publicPackage = await _nugetClient.GetPackageAsync("https://api.nuget.org/v3/index.json", packageName, version);
var privatePackage = await _nugetClient.GetPackageAsync("https://pkgs.dev.azure.com/ConnectSoft/_packaging/ATP-NuGet-Feed/nuget/v3/index.json", packageName, version);
if (publicPackage != null && privatePackage != null)
{
// Both feeds have this package; potential confusion risk
// Compare hashes
if (publicPackage.Hash != privatePackage.Hash)
{
// CRITICAL: Same package name/version, different content
await AlertSecurityTeamAsync(new
{
PackageName = packageName,
Version = version,
PublicHash = publicPackage.Hash,
PrivateHash = privatePackage.Hash,
Severity = "Critical",
Recommendation = "Investigate immediately; potential supply chain attack"
});
return true; // Confusion detected
}
}
return false; // No confusion detected
}
}
SLSA Provenance (Supply Chain Levels for Software Artifacts)¶
Purpose: Provide verifiable provenance for all build artifacts, documenting the complete build process from source to artifact.
SLSA Level: ATP targets SLSA Level 3 (Hardened Builds)
| SLSA Level | Requirements | ATP Status |
|---|---|---|
| SLSA 1 | Provenance exists; build process documented | ✅ Achieved |
| SLSA 2 | Signed provenance; service-generated (not user) | ✅ Achieved |
| SLSA 3 | Hardened builds; isolated, ephemeral build environments | 🚧 In Progress (Q2 2025) |
| SLSA 4 | Two-person review; hermetic builds | 🎯 Target (Q4 2025) |
Provenance Attestation (SLSA v1.0):
{
"_type": "https://in-toto.io/Statement/v1",
"subject": [
{
"name": "connectsoft.azurecr.io/atp/ingestion",
"digest": {
"sha256": "abc123def456..."
}
}
],
"predicateType": "https://slsa.dev/provenance/v1",
"predicate": {
"buildDefinition": {
"buildType": "https://dev.azure.com/Pipelines/v1",
"externalParameters": {
"repository": "https://github.com/ConnectSoft/ATP.Ingestion",
"ref": "refs/heads/main",
"commit": "a1b2c3d4e5f6..."
},
"internalParameters": {
"azurePipeline": "ATP-Ingestion-CI",
"buildId": "12345"
},
"resolvedDependencies": [
{
"uri": "pkg:nuget/System.Text.Json@8.0.0",
"digest": {
"sha256": "abc123..."
}
}
]
},
"runDetails": {
"builder": {
"id": "https://dev.azure.com/ConnectSoft/_build",
"version": {
"azure-pipelines": "1.0"
}
},
"metadata": {
"invocationId": "$(Build.BuildId)",
"startedOn": "2025-01-15T14:00:00Z",
"finishedOn": "2025-01-15T14:15:00Z"
},
"byproducts": [
{
"name": "SBOM",
"uri": "https://artifacts.connectsoft.com/sbom/atp-ingestion-1.0.123.json",
"digest": {
"sha256": "xyz789..."
}
}
]
}
}
}
Provenance Verification (Kubernetes):
# Deployment with provenance verification
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-prod
annotations:
# Require SLSA provenance
admission.sigstore.dev/require-provenance: "true"
admission.sigstore.dev/min-slsa-level: "2"
spec:
template:
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:1.0.123
# Image must be signed + have provenance attestation
SBOM Distribution & Consumption¶
Purpose: Make SBOMs accessible to security teams, customers, and auditors for transparency and compliance.
SBOM Distribution Channels:
sbomDistribution:
internal:
# Azure DevOps Artifacts (for internal teams)
location: https://dev.azure.com/ConnectSoft/_artifacts/feed/ATP-SBOM
retention: 7 years
access: Security team, compliance team, auditors
external:
# Customer-accessible SBOM portal (for transparency)
location: https://sbom.connectsoft.com/atp/
format: HTML (rendered from JSON)
access: Authenticated customers
contains: Public SBOM (no internal infrastructure details)
regulators:
# Auditor-accessible immutable storage
location: Azure Blob (read-only SAS token)
retention: 7 years + legal hold
access: External auditors (SOC 2, GDPR DPAs)
SBOM API (customer access):
// SBOM query API for customers
[ApiController]
[Route("api/sbom")]
[Authorize(Roles = "Customer")]
public class SbomController : ControllerBase
{
[HttpGet("{product}/{version}")]
public async Task<IActionResult> GetSbom(string product, string version)
{
// Validate customer can access this SBOM (their tenant uses this version)
if (!await _authService.CanAccessSbomAsync(User.Identity.Name, product, version))
{
return Forbid();
}
// Fetch SBOM from blob storage
var sbom = await _blobService.GetSbomAsync(product, version);
if (sbom == null)
{
return NotFound($"SBOM not found for {product} version {version}");
}
// Redact internal infrastructure details
var publicSbom = RedactInternalDetails(sbom);
return Ok(publicSbom);
}
private SbomDocument RedactInternalDetails(SbomDocument sbom)
{
// Remove internal build metadata
sbom.Metadata.Properties = sbom.Metadata.Properties
.Where(p => !p.Name.StartsWith("build:internal"))
.ToList();
// Remove internal-only components
sbom.Components = sbom.Components
.Where(c => !c.Name.Contains("Internal"))
.ToList();
return sbom;
}
}
Artifact Attestation (SLSA Build L3)¶
Purpose: Provide cryptographic proof that artifacts were built by trusted CI/CD pipelines without manual intervention.
in-toto Attestation (Azure Pipelines):
# Generate in-toto attestation
- script: |
# Install in-toto
pip install in-toto
# Generate link metadata (build step attestation)
in-toto-run \
--step-name build \
--key $(Build.ArtifactStagingDirectory)/signing-key.pem \
--materials $(Build.SourcesDirectory)/ \
--products $(Build.ArtifactStagingDirectory)/publish/ \
-- dotnet publish -c Release -o $(Build.ArtifactStagingDirectory)/publish
# Generate layout (defines expected build steps)
cat > layout.json <<EOF
{
"_type": "layout",
"steps": [
{
"name": "build",
"expected_materials": [
["MATCH", "**/*.cs", "WITH", "PRODUCTS", "FROM", "checkout"]
],
"expected_products": [
["CREATE", "publish/*.dll"],
["CREATE", "publish/*.json"]
],
"pubkeys": ["$(cat $(Build.ArtifactStagingDirectory)/signing-key.pub)"]
}
],
"inspect": []
}
EOF
# Sign layout
in-toto-sign \
--key $(Build.ArtifactStagingDirectory)/signing-key.pem \
--file layout.json
echo "✅ in-toto attestation generated"
displayName: 'Generate in-toto Attestation'
Supply Chain Security Metrics¶
Purpose: Track supply chain health and identify anomalies.
Supply Chain KQL Queries:
// Dependency change frequency (last 90 days)
customEvents
| where name == "DependencyChanged"
| where timestamp > ago(90d)
| extend PackageName = tostring(customDimensions.PackageName)
| extend OldVersion = tostring(customDimensions.OldVersion)
| extend NewVersion = tostring(customDimensions.NewVersion)
| summarize ChangeCount = count() by PackageName
| order by ChangeCount desc
| take 20
// SBOM generation success rate
customEvents
| where name in ("SbomGenerated", "SbomGenerationFailed")
| where timestamp > ago(30d)
| summarize
TotalAttempts = count(),
Successful = countif(name == "SbomGenerated"),
Failed = countif(name == "SbomGenerationFailed"),
SuccessRate = 100.0 * countif(name == "SbomGenerated") / count()
// Unsigned images detected in deployment attempts
customEvents
| where name == "UnsignedImageRejected"
| where timestamp > ago(7d)
| extend Image = tostring(customDimensions.Image)
| extend Namespace = tostring(customDimensions.Namespace)
| summarize RejectionCount = count() by Image, Namespace
| order by RejectionCount desc
// Dependency pinning violations (lock file mismatch)
customEvents
| where name == "LockFileMismatch"
| where timestamp > ago(30d)
| extend PackageName = tostring(customDimensions.PackageName)
| extend ExpectedVersion = tostring(customDimensions.ExpectedVersion)
| extend ActualVersion = tostring(customDimensions.ActualVersion)
| project timestamp, PackageName, ExpectedVersion, ActualVersion, BuildId = tostring(customDimensions.BuildId)
Summary¶
- SBOM & Supply Chain Gates: 2-3 minute execution; SBOM generation mandatory for all builds
- CycloneDX SBOM: JSON/XML format with complete dependency inventory (versions, licenses, CVEs, hashes)
- SBOM Validation: 7 requirements (valid format, ≥10 components, versions, licenses, hashes, vulnerabilities, provenance)
- SBOM Content Validator: PowerShell script checking format, component count, license coverage, provenance metadata
- Cosign Signing: All production images cryptographically signed with Key Vault-stored keys
- Signing Workflow: 8-step pipeline (Build → Scan → Install Cosign → Fetch Key → Sign → Generate Provenance → Verify → Push)
- Provenance Attestation: SLSA v1.0 format with builder ID, materials, metadata, timestamps
- Signature Verification: Kubernetes admission controller enforces signature validation at deployment time
- Supply Chain Controls: 7 controls (dependency pinning, checksum verification, isolated agents, two-person review, SBOM diff, private feed, code signing)
- Dependency Lock File: NuGet packages.lock.json with SHA-512 hashes; RestoreLockedMode enforced in CI
- SBOM Diff Analysis: Bash script detecting added/removed/updated dependencies; creates work items for security review
- Dependency Confusion Prevention: Private Azure Artifacts feed mirrors public NuGet with allowlist
- SLSA Levels: Currently SLSA 2 (signed provenance); targeting SLSA 3 (Q2 2025), SLSA 4 (Q4 2025)
- SBOM Distribution: Internal (Azure Artifacts), external (customer portal), regulators (immutable blob with SAS tokens)
- Supply Chain Metrics: KQL queries for dependency changes, SBOM success rate, unsigned image rejections, lock file mismatches
Compliance Gates (Deep Dive)¶
Compliance gates ensure regulatory adherence (GDPR, HIPAA, SOC 2) through automated validation of audit logging, PII protection, and compliance controls. ATP enforces 100% audit logging coverage for state-mutating operations and zero tolerance for PII leakage in logs or telemetry.
Philosophy: Compliance is built-in, not bolted-on. ATP embeds compliance controls into the CI/CD pipeline, making it impossible to deploy non-compliant code to production. Every compliance requirement is validated, documented, and auditable.
Compliance Gate Workflow¶
graph TD
A[SBOM Gates Passed] --> B[Audit Logging Validation]
B --> C{100% Coverage?}
C -->|No| D[Audit Logging Incomplete ❌]
C -->|Yes| E[PII Redaction Validation]
E --> F{No PII in Logs?}
F -->|PII Found| G[PII Leakage Detected ❌]
F -->|Clean| H[GDPR/HIPAA Checklist]
H --> I{All Items Pass?}
I -->|No| J[Compliance Checklist Failed ❌]
I -->|Yes| K[Data Classification Validation]
K --> L{Sensitive Data Classified?}
L -->|No| M[Classification Missing ❌]
L -->|Yes| N[Retention Policy Validation]
N --> O{Policies Configured?}
O -->|No| P[Retention Policy Missing ❌]
O -->|Yes| Q[Compliance Gates Passed ✅]
D --> R[Pipeline Stopped]
G --> R
J --> R
M --> R
P --> R
Q --> S[Proceed to Staging Deployment]
style D fill:#ff6b6b
style G fill:#ff6b6b
style J fill:#ff6b6b
style M fill:#ff6b6b
style P fill:#ff6b6b
style Q fill:#90EE90
Typical Compliance Gate Duration: 2-3 minutes
Audit Logging Validation¶
Purpose: Ensure 100% of state-mutating operations emit audit events to maintain complete audit trail for compliance.
Requirement: Every method that creates, updates, or deletes data must call IAuditLogger.LogAsync().
Tool: Custom static analyzer that scans C# code for audit logging calls
Threshold: 100% — No exceptions; all state mutations must be audited
Audit Logging Validator (C# Roslyn Analyzer):
// Custom Roslyn analyzer to enforce audit logging
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Diagnostics;
[DiagnosticAnalyzer(LanguageNames.CSharp)]
public class AuditLoggingAnalyzer : DiagnosticAnalyzer
{
private const string DiagnosticId = "ATP001";
private const string Title = "State-mutating method missing audit logging";
private const string MessageFormat = "Method '{0}' modifies state but does not call IAuditLogger.LogAsync()";
private const string Category = "Compliance";
private static readonly DiagnosticDescriptor Rule = new DiagnosticDescriptor(
DiagnosticId,
Title,
MessageFormat,
Category,
DiagnosticSeverity.Error,
isEnabledByDefault: true,
description: "All methods that create, update, or delete data must emit audit events.");
public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics => ImmutableArray.Create(Rule);
public override void Initialize(AnalysisContext context)
{
context.ConfigureGeneratedCodeAnalysis(GeneratedCodeAnalysisFlags.None);
context.EnableConcurrentExecution();
context.RegisterSyntaxNodeAction(AnalyzeMethod, SyntaxKind.MethodDeclaration);
}
private void AnalyzeMethod(SyntaxNodeAnalysisContext context)
{
var methodDeclaration = (MethodDeclarationSyntax)context.Node;
var methodSymbol = context.SemanticModel.GetDeclaredSymbol(methodDeclaration);
if (methodSymbol == null || methodSymbol.IsAbstract || methodSymbol.IsExtern)
return;
// Check if method is state-mutating (has Create/Update/Delete/Save in name)
var methodName = methodSymbol.Name;
var isStateMutating = methodName.Contains("Create") ||
methodName.Contains("Update") ||
methodName.Contains("Delete") ||
methodName.Contains("Save") ||
methodName.Contains("Add") ||
methodName.Contains("Remove");
if (!isStateMutating)
return;
// Check if method returns Task (async)
var returnType = methodSymbol.ReturnType;
if (returnType.Name != "Task")
return;
// Check if method calls IAuditLogger.LogAsync()
var invocations = methodDeclaration.DescendantNodes()
.OfType<InvocationExpressionSyntax>();
var hasAuditLogging = invocations.Any(invocation =>
{
var memberAccess = invocation.Expression as MemberAccessExpressionSyntax;
if (memberAccess?.Name.Identifier.Text == "LogAsync")
{
var symbolInfo = context.SemanticModel.GetSymbolInfo(memberAccess);
var symbol = symbolInfo.Symbol as IMethodSymbol;
// Check if method is from IAuditLogger interface
return symbol?.ContainingType.Name == "IAuditLogger";
}
return false;
});
if (!hasAuditLogging)
{
var diagnostic = Diagnostic.Create(Rule, methodDeclaration.Identifier.GetLocation(), methodSymbol.Name);
context.ReportDiagnostic(diagnostic);
}
}
}
Audit Logging Validation Script (PowerShell):
<#
.SYNOPSIS
Validate audit logging coverage in ATP services.
.DESCRIPTION
Scans C# source code for state-mutating methods without audit logging calls.
#>
param(
[string]$Path = "$(Build.SourcesDirectory)",
[int]$Threshold = 100 # 100% coverage required
)
Write-Host "Validating audit logging coverage in: $Path"
# Find all C# files (exclude tests, migrations, generated)
$csFiles = Get-ChildItem -Path $Path -Recurse -Filter *.cs |
Where-Object {
$_.FullName -notmatch "\\Tests\\" -and
$_.FullName -notmatch "\\Migrations\\" -and
$_.FullName -notmatch "\\.Generated\\.cs$"
}
$stateMutatingMethods = @()
$methodsWithAuditLogging = @()
foreach ($file in $csFiles) {
$content = Get-Content -Path $file.FullName -Raw
# Find state-mutating methods (Create, Update, Delete, Save, Add, Remove)
$methodPattern = 'public\s+(async\s+)?Task<?\w*>?\s+(Create|Update|Delete|Save|Add|Remove)\w*\s*\('
$matches = [regex]::Matches($content, $methodPattern)
foreach ($match in $matches) {
$methodName = $match.Groups[2].Value + $match.Groups[0].Value.Split('(')[0].Split()[-1]
$stateMutatingMethods += [PSCustomObject]@{
File = $file.Name
Method = $methodName
Line = ($content.Substring(0, $match.Index) -split "`n").Count
}
# Check if method body contains IAuditLogger.LogAsync()
$methodStart = $match.Index
$methodEnd = $content.IndexOf("}", $methodStart)
$methodBody = $content.Substring($methodStart, $methodEnd - $methodStart)
if ($methodBody -match '(IAuditLogger|_auditLogger|auditLogger)\.LogAsync\(') {
$methodsWithAuditLogging += $methodName
}
}
}
$totalStateMutatingMethods = $stateMutatingMethods.Count
$totalWithAuditLogging = $methodsWithAuditLogging.Count
$coveragePercent = if ($totalStateMutatingMethods -gt 0) {
($totalWithAuditLogging / $totalStateMutatingMethods) * 100
} else {
100
}
Write-Host "Audit Logging Coverage:"
Write-Host " Total state-mutating methods: $totalStateMutatingMethods"
Write-Host " Methods with audit logging: $totalWithAuditLogging"
Write-Host " Coverage: $($coveragePercent.ToString('F1'))%"
# Report methods without audit logging
$methodsWithoutLogging = $stateMutatingMethods | Where-Object { $methodsWithAuditLogging -notcontains $_.Method }
if ($methodsWithoutLogging.Count -gt 0) {
Write-Host "`n❌ Methods without audit logging:" -ForegroundColor Red
foreach ($method in $methodsWithoutLogging | Select-Object -First 10) {
Write-Host " - $($method.File):$($method.Line) → $($method.Method)" -ForegroundColor Red
}
if ($methodsWithoutLogging.Count -gt 10) {
Write-Host " ... and $($methodsWithoutLogging.Count - 10) more"
}
}
# Fail if coverage below threshold
if ($coveragePercent -lt $Threshold) {
Write-Error "Audit logging coverage ($($coveragePercent.ToString('F1'))%) below threshold ($Threshold%)"
Write-Error "Add IAuditLogger.LogAsync() calls to all state-mutating methods."
exit 1
}
Write-Host "`n✅ Audit logging validation passed" -ForegroundColor Green
Azure Pipelines Integration:
# Audit Logging Validation Gate
- task: PowerShell@2
inputs:
filePath: 'scripts/validate-audit-logging.ps1'
arguments: '-Path "$(Build.SourcesDirectory)" -Threshold 100'
pwsh: true
displayName: 'Validate Audit Logging Coverage'
continueOnError: false # Fail build if coverage < 100%
Example: Correct Audit Logging:
// ✅ GOOD: State-mutating method with audit logging
public class AuditEventService
{
private readonly IAuditLogger _auditLogger;
private readonly IAuditEventRepository _repository;
public AuditEventService(IAuditLogger auditLogger, IAuditEventRepository repository)
{
_auditLogger = auditLogger;
_repository = repository;
}
public async Task<AuditEvent> CreateEventAsync(CreateAuditEventRequest request, CancellationToken ct)
{
var evt = new AuditEvent
{
Id = Guid.NewGuid(),
TenantId = request.TenantId,
Action = request.Action,
Timestamp = DateTime.UtcNow,
UserId = request.UserId
};
await _repository.AddAsync(evt, ct);
// ✅ Audit logging call (REQUIRED)
await _auditLogger.LogAsync(new AuditLogEntry
{
EntityType = nameof(AuditEvent),
EntityId = evt.Id.ToString(),
Operation = AuditOperation.Create,
Timestamp = DateTime.UtcNow,
UserId = request.UserId,
TenantId = request.TenantId,
Changes = new { evt }
}, ct);
return evt;
}
public async Task UpdateEventAsync(Guid id, UpdateAuditEventRequest request, CancellationToken ct)
{
var evt = await _repository.GetByIdAsync(id, ct);
if (evt == null)
throw new NotFoundException($"Audit event {id} not found");
var oldValues = new { evt.Status, evt.Notes };
evt.Status = request.Status;
evt.Notes = request.Notes;
await _repository.UpdateAsync(evt, ct);
// ✅ Audit logging call with before/after values
await _auditLogger.LogAsync(new AuditLogEntry
{
EntityType = nameof(AuditEvent),
EntityId = evt.Id.ToString(),
Operation = AuditOperation.Update,
Timestamp = DateTime.UtcNow,
UserId = request.UserId,
TenantId = evt.TenantId,
Changes = new
{
Before = oldValues,
After = new { evt.Status, evt.Notes }
}
}, ct);
return evt;
}
}
// ❌ BAD: State-mutating method without audit logging
public async Task DeleteEventAsync(Guid id, CancellationToken ct)
{
var evt = await _repository.GetByIdAsync(id, ct);
await _repository.DeleteAsync(evt, ct);
// ❌ MISSING: IAuditLogger.LogAsync() call
// This will be flagged by ATP001 analyzer and fail the build
}
PII Redaction Validation¶
Purpose: Prevent Personally Identifiable Information (PII) from appearing in logs, telemetry, or error messages to comply with GDPR/HIPAA.
Requirement: All sensitive data must be redacted before logging using custom attributes or redaction filters.
Tool: Custom log parser that scans for PII patterns (email, phone, SSN, credit card)
Threshold: 0 — No raw PII allowed in any log statements
PII Redaction Validator (PowerShell):
<#
.SYNOPSIS
Validate PII redaction in ATP services.
.DESCRIPTION
Scans C# source code and log statements for unredacted PII (email, phone, SSN).
#>
param(
[string]$Path = "$(Build.SourcesDirectory)"
)
Write-Host "Validating PII redaction in: $Path"
# PII patterns to detect
$piiPatterns = @{
"Email" = '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
"Phone" = '\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b'
"SSN" = '\b\d{3}-\d{2}-\d{4}\b'
"CreditCard" = '\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'
}
$violations = @()
# Find all C# files
$csFiles = Get-ChildItem -Path $Path -Recurse -Filter *.cs |
Where-Object { $_.FullName -notmatch "\\Tests\\" }
foreach ($file in $csFiles) {
$content = Get-Content -Path $file.FullName -Raw
$lines = Get-Content -Path $file.FullName
# Find log statements (ILogger calls)
$logPattern = '(_logger|logger|_log|log)\.(Log\w+|Information|Warning|Error|Debug)\('
$logMatches = [regex]::Matches($content, $logPattern)
foreach ($logMatch in $logMatches) {
# Extract log statement (up to closing parenthesis)
$statementStart = $logMatch.Index
$depth = 0
$statementEnd = $statementStart
for ($i = $statementStart; $i -lt $content.Length; $i++) {
if ($content[$i] -eq '(') { $depth++ }
if ($content[$i] -eq ')') {
$depth--
if ($depth -eq 0) {
$statementEnd = $i
break
}
}
}
$logStatement = $content.Substring($statementStart, $statementEnd - $statementStart + 1)
# Check for PII patterns in log statement
foreach ($piiType in $piiPatterns.Keys) {
if ($logStatement -match $piiPatterns[$piiType]) {
$lineNumber = ($content.Substring(0, $statementStart) -split "`n").Count
$violations += [PSCustomObject]@{
File = $file.Name
Line = $lineNumber
PIIType = $piiType
Statement = $logStatement.Substring(0, [Math]::Min(100, $logStatement.Length)) + "..."
}
}
}
}
# Also check for direct string interpolation with user data
if ($content -match '\$"\{.*?(Email|Phone|SSN|UserId|TenantId).*?\}"' -and $content -match '_logger') {
Write-Warning "$($file.Name): Potential PII in string interpolation (manual review required)"
}
}
if ($violations.Count -gt 0) {
Write-Host "`n❌ PII detected in log statements:" -ForegroundColor Red
Write-Host " Total violations: $($violations.Count)" -ForegroundColor Red
Write-Host ""
foreach ($violation in $violations | Select-Object -First 10) {
Write-Host " - $($violation.File):$($violation.Line) → $($violation.PIIType)" -ForegroundColor Red
Write-Host " $($violation.Statement)" -ForegroundColor Yellow
}
if ($violations.Count -gt 10) {
Write-Host " ... and $($violations.Count - 10) more violations"
}
Write-Host "`n📚 Remediation:" -ForegroundColor Yellow
Write-Host " 1. Use redaction attributes: [EmailData], [PhoneData], [PersonalData]"
Write-Host " 2. Use structured logging with redacted parameters"
Write-Host " 3. Enable logging redaction in appsettings.json"
Write-Host ""
exit 1
}
# Check if redaction is enabled in appsettings.json
$appsettings = Get-ChildItem -Path $Path -Recurse -Filter appsettings.json | Select-Object -First 1
if ($appsettings) {
$config = Get-Content -Path $appsettings.FullName | ConvertFrom-Json
if ($config.Compliance.EnableLoggingRedaction -ne $true) {
Write-Warning "Logging redaction not enabled in appsettings.json"
Write-Warning " Set Compliance.EnableLoggingRedaction: true"
}
else {
Write-Host "✅ Logging redaction enabled in appsettings.json"
}
}
Write-Host "`n✅ PII redaction validation passed" -ForegroundColor Green
Azure Pipelines Integration:
# PII Redaction Validation Gate
- task: PowerShell@2
inputs:
filePath: 'scripts/validate-pii-redaction.ps1'
arguments: '-Path "$(Build.SourcesDirectory)"'
pwsh: true
displayName: 'Validate PII Redaction'
continueOnError: false # Fail build if PII detected
PII Redaction Examples:
// ❌ BAD: Raw PII in logs
public async Task ProcessUserAsync(User user)
{
_logger.LogInformation($"Processing user: {user.Email}"); // ❌ Raw email logged
_logger.LogInformation($"User phone: {user.PhoneNumber}"); // ❌ Raw phone logged
_logger.LogInformation($"SSN: {user.SSN}"); // ❌ SSN logged (CRITICAL violation)
}
// ✅ GOOD: Redacted PII with attributes
public class User
{
public Guid Id { get; set; }
[EmailData] // ✅ Redaction attribute
public string Email { get; set; }
[PhoneData] // ✅ Redaction attribute
public string PhoneNumber { get; set; }
[PersonalData] // ✅ Generic PII attribute
public string SSN { get; set; }
public string DisplayName { get; set; } // Public, can be logged
}
// ✅ GOOD: Structured logging with automatic redaction
public async Task ProcessUserAsync(User user)
{
_logger.LogInformation(
"Processing user {UserId} with email {Email} and phone {Phone}",
user.Id, // Safe (GUID)
user.Email, // ✅ Auto-redacted to "***@***.com"
user.PhoneNumber // ✅ Auto-redacted to "***-***-1234" (last 4 digits)
);
}
// ✅ GOOD: Manual redaction helper
public async Task ProcessUserAsync(User user)
{
_logger.LogInformation(
"Processing user {UserId} with email {Email}",
user.Id,
RedactEmail(user.Email) // ✅ Explicitly redacted
);
}
private string RedactEmail(string email)
{
if (string.IsNullOrEmpty(email))
return email;
var parts = email.Split('@');
if (parts.Length != 2)
return "***";
var username = parts[0];
var domain = parts[1];
// Show first char + last char, redact middle
var redactedUsername = username.Length > 2
? $"{username[0]}***{username[^1]}"
: "***";
return $"{redactedUsername}@{domain}";
}
Logging Redaction Configuration (appsettings.json):
{
"Compliance": {
"EnableLoggingRedaction": true,
"RedactionMode": "Automatic", // Automatic | Manual | Disabled
"RedactionAttributes": [
"EmailData",
"PhoneData",
"PersonalData",
"SensitiveData",
"CreditCardData"
],
"RedactionPlaceholder": "***",
"PreserveLastN": 4 // Show last 4 chars (e.g., ***-***-1234)
},
"Logging": {
"LogLevel": {
"Default": "Information"
},
"Enrichers": [
"PiiRedactionEnricher" // Serilog enricher for automatic redaction
]
}
}
GDPR/HIPAA Compliance Checklist¶
Purpose: Validate that all regulatory safeguards are implemented before deploying to staging/production.
Requirement: 8 compliance controls must be verified via automated checks and manual attestations.
Checklist Items:
| Control | Requirement | Validation Method | Blocker | Regulatory Basis |
|---|---|---|---|---|
| Encryption at Rest | All databases, storage accounts use encryption (TDE, SSE-AES256) | Azure Policy scan | ✅ Yes (staging/prod) | GDPR Art. 32, HIPAA §164.312(a)(2)(iv) |
| Encryption in Transit | TLS 1.3 enforced for all external APIs; TLS 1.2 minimum internal | Network security policy | ✅ Yes (prod) | GDPR Art. 32, HIPAA §164.312(e)(1) |
| Tenant Isolation | Multi-tenant data separation validated in integration tests | Test results (tag: @tenantIsolation) | ✅ Yes | GDPR Art. 32, HIPAA §164.308(a)(3) |
| Retention Policies | Configurable retention per tenant; default 7 years; auto-purge | Configuration validation + integration test | ✅ Yes | GDPR Art. 5(1)(e), HIPAA §164.316(b)(2)(i) |
| DSAR Workflow | Data Subject Access Request API implemented | API contract test (GET /api/dsar/{userId}) | ✅ Yes | GDPR Art. 15-20 |
| Breach Notification | Incident response procedure documented | Document exists in repo | ⚠️ Warning | GDPR Art. 33-34, HIPAA §164.404-414 |
| Audit Logging | 100% of write operations emit audit events | Custom Roslyn analyzer (ATP001) | ✅ Yes | GDPR Art. 30, HIPAA §164.312(b) |
| PII Redaction | No raw PII in logs, telemetry, error messages | Custom log parser (PowerShell) | ✅ Yes | GDPR Art. 5(1)(f), HIPAA §164.514(b) |
Compliance Checklist Validator (PowerShell):
<#
.SYNOPSIS
Validate GDPR/HIPAA compliance checklist.
.DESCRIPTION
Automated validation of compliance controls before deployment.
#>
param(
[string]$Environment = "Staging" # Dev | Test | Staging | Production
)
Write-Host "Validating GDPR/HIPAA compliance checklist for: $Environment"
$checklist = @()
# 1. Encryption at Rest
Write-Host "`nChecking: Encryption at Rest..."
$encryptionPolicy = az policy state list `
--resource-group "ATP-$Environment-RG" `
--filter "policyDefinitionName eq 'SQL TDE'" `
--query "[?complianceState=='Compliant'].resourceId" -o json | ConvertFrom-Json
if ($encryptionPolicy.Count -gt 0) {
Write-Host "✅ Encryption at Rest: PASS" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "Encryption at Rest"; Status = "Pass" }
}
else {
Write-Host "❌ Encryption at Rest: FAIL" -ForegroundColor Red
$checklist += [PSCustomObject]@{ Control = "Encryption at Rest"; Status = "Fail" }
}
# 2. Encryption in Transit
Write-Host "`nChecking: Encryption in Transit..."
$appServices = az webapp list `
--resource-group "ATP-$Environment-RG" `
--query "[].{name:name,httpsOnly:httpsOnly,minTlsVersion:siteConfig.minTlsVersion}" -o json | ConvertFrom-Json
$allHttpsOnly = $appServices | Where-Object { $_.httpsOnly -eq $false }
if ($allHttpsOnly.Count -eq 0) {
Write-Host "✅ Encryption in Transit: PASS (HTTPS enforced)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "Encryption in Transit"; Status = "Pass" }
}
else {
Write-Host "❌ Encryption in Transit: FAIL (HTTPS not enforced)" -ForegroundColor Red
$checklist += [PSCustomObject]@{ Control = "Encryption in Transit"; Status = "Fail" }
}
# 3. Tenant Isolation (from test results)
Write-Host "`nChecking: Tenant Isolation..."
$testResults = az pipelines runs artifact download `
--artifact-name "test-results" `
--path "test-results/" `
--run-id $(Build.BuildId)
$tenantIsolationTests = Select-String -Path "test-results/*.trx" -Pattern '@tenantIsolation.*Passed'
if ($tenantIsolationTests.Count -gt 0) {
Write-Host "✅ Tenant Isolation: PASS ($($tenantIsolationTests.Count) tests)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "Tenant Isolation"; Status = "Pass" }
}
else {
Write-Host "❌ Tenant Isolation: FAIL (tests not found or failed)" -ForegroundColor Red
$checklist += [PSCustomObject]@{ Control = "Tenant Isolation"; Status = "Fail" }
}
# 4. Retention Policies
Write-Host "`nChecking: Retention Policies..."
$appsettings = Get-Content -Path "src/*/appsettings.$Environment.json" | ConvertFrom-Json
if ($appsettings.Audit.RetentionDays -ge 2555) { # 7 years = 2555 days
Write-Host "✅ Retention Policies: PASS (7 years configured)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "Retention Policies"; Status = "Pass" }
}
else {
Write-Host "❌ Retention Policies: FAIL (retention < 7 years)" -ForegroundColor Red
$checklist += [PSCustomObject]@{ Control = "Retention Policies"; Status = "Fail" }
}
# 5. DSAR Workflow (API contract test)
Write-Host "`nChecking: DSAR Workflow..."
$openApiSpec = Get-Content -Path "swagger.json" | ConvertFrom-Json
$dsarEndpoint = $openApiSpec.paths."/api/dsar/{userId}"
if ($dsarEndpoint) {
Write-Host "✅ DSAR Workflow: PASS (API endpoint exists)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "DSAR Workflow"; Status = "Pass" }
}
else {
Write-Host "❌ DSAR Workflow: FAIL (API endpoint missing)" -ForegroundColor Red
$checklist += [PSCustomObject]@{ Control = "DSAR Workflow"; Status = "Fail" }
}
# 6. Breach Notification
Write-Host "`nChecking: Breach Notification..."
$breachDoc = Test-Path -Path "docs/compliance/breach-notification-procedure.md"
if ($breachDoc) {
Write-Host "✅ Breach Notification: PASS (procedure documented)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "Breach Notification"; Status = "Pass" }
}
else {
Write-Host "⚠️ Breach Notification: WARNING (procedure not found)" -ForegroundColor Yellow
$checklist += [PSCustomObject]@{ Control = "Breach Notification"; Status = "Warning" }
}
# 7. Audit Logging (already validated by previous gate)
Write-Host "`nChecking: Audit Logging..."
Write-Host "✅ Audit Logging: PASS (validated in previous gate)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "Audit Logging"; Status = "Pass" }
# 8. PII Redaction (already validated by previous gate)
Write-Host "`nChecking: PII Redaction..."
Write-Host "✅ PII Redaction: PASS (validated in previous gate)" -ForegroundColor Green
$checklist += [PSCustomObject]@{ Control = "PII Redaction"; Status = "Pass" }
# Summary
Write-Host "`n═══════════════════════════════════════════════════════════"
Write-Host "Compliance Checklist Summary"
Write-Host "═══════════════════════════════════════════════════════════"
$checklist | Format-Table -AutoSize
$failed = $checklist | Where-Object { $_.Status -eq "Fail" }
$warnings = $checklist | Where-Object { $_.Status -eq "Warning" }
if ($failed.Count -gt 0) {
Write-Host "`n❌ Compliance checklist FAILED: $($failed.Count) control(s)" -ForegroundColor Red
exit 1
}
if ($warnings.Count -gt 0) {
Write-Host "`n⚠️ Compliance checklist passed with warnings: $($warnings.Count) control(s)" -ForegroundColor Yellow
}
Write-Host "`n✅ All compliance controls validated" -ForegroundColor Green
Azure Pipelines Integration:
# GDPR/HIPAA Compliance Checklist Gate
- task: PowerShell@2
inputs:
filePath: 'scripts/validate-compliance-checklist.ps1'
arguments: '-Environment $(Environment)'
pwsh: true
displayName: 'Validate GDPR/HIPAA Checklist'
continueOnError: false # Fail deployment if checklist fails
# Generate compliance attestation report
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
$attestation = @{
BuildId = "$(Build.BuildId)"
BuildNumber = "$(Build.BuildNumber)"
Environment = "$(Environment)"
Timestamp = (Get-Date).ToUniversalTime().ToString("o")
Checklist = @(
@{ Control = "Encryption at Rest"; Status = "Pass"; Evidence = "Azure Policy: SQL TDE Enabled" }
@{ Control = "Encryption in Transit"; Status = "Pass"; Evidence = "App Service: HTTPS Only" }
@{ Control = "Tenant Isolation"; Status = "Pass"; Evidence = "Integration Tests: 15 passed" }
@{ Control = "Retention Policies"; Status = "Pass"; Evidence = "appsettings.Production.json: RetentionDays=2555" }
@{ Control = "DSAR Workflow"; Status = "Pass"; Evidence = "OpenAPI: GET /api/dsar/{userId}" }
@{ Control = "Breach Notification"; Status = "Pass"; Evidence = "docs/compliance/breach-notification-procedure.md" }
@{ Control = "Audit Logging"; Status = "Pass"; Evidence = "Audit Logging Coverage: 100%" }
@{ Control = "PII Redaction"; Status = "Pass"; Evidence = "PII Validation: 0 violations" }
)
Approver = "$(Build.RequestedFor)"
}
$attestation | ConvertTo-Json -Depth 10 | Out-File -FilePath "compliance-attestation-$(Build.BuildNumber).json"
Write-Host "✅ Compliance attestation report generated"
displayName: 'Generate Compliance Attestation'
# Publish attestation as artifact
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: 'compliance-attestation-$(Build.BuildNumber).json'
ArtifactName: 'compliance-attestation'
displayName: 'Publish Compliance Attestation'
Data Classification Validation¶
Purpose: Ensure all sensitive data is properly classified with appropriate attributes for GDPR/HIPAA compliance.
Data Classification Levels:
| Classification | Attribute | Examples | Protection Requirements |
|---|---|---|---|
| Public | (none) | Product names, public IDs | No special protection |
| Internal | [InternalData] |
Employee names, department | Redact in external logs |
| Confidential | [ConfidentialData] |
Tenant config, business rules | Redact in all logs, encrypt at rest |
| Personal | [PersonalData] |
User names, preferences | GDPR Article 4(1), redact always |
| Sensitive | [SensitiveData] |
Email, phone, address | GDPR Article 9, redact + encrypt |
| Restricted | [RestrictedData] |
SSN, health data, biometrics | HIPAA PHI, maximum protection |
Classification Validator (C# Roslyn Analyzer):
// Custom analyzer to enforce data classification
[DiagnosticAnalyzer(LanguageNames.CSharp)]
public class DataClassificationAnalyzer : DiagnosticAnalyzer
{
private const string DiagnosticId = "ATP002";
private static readonly DiagnosticDescriptor Rule = new DiagnosticDescriptor(
DiagnosticId,
"Sensitive property missing classification attribute",
"Property '{0}' contains sensitive data but lacks [PersonalData], [SensitiveData], or [RestrictedData] attribute",
"Compliance",
DiagnosticSeverity.Error,
isEnabledByDefault: true);
public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics => ImmutableArray.Create(Rule);
public override void Initialize(AnalysisContext context)
{
context.RegisterSyntaxNodeAction(AnalyzeProperty, SyntaxKind.PropertyDeclaration);
}
private void AnalyzeProperty(SyntaxNodeAnalysisContext context)
{
var propertyDeclaration = (PropertyDeclarationSyntax)context.Node;
var propertySymbol = context.SemanticModel.GetDeclaredSymbol(propertyDeclaration);
if (propertySymbol == null)
return;
var propertyName = propertySymbol.Name.ToLower();
// Sensitive property names
var sensitiveNames = new[]
{
"email", "phone", "ssn", "socialsecurity", "creditcard",
"password", "healthrecord", "biometric", "dob", "dateofbirth"
};
var isSensitive = sensitiveNames.Any(s => propertyName.Contains(s));
if (!isSensitive)
return;
// Check if property has classification attribute
var hasClassificationAttribute = propertySymbol.GetAttributes().Any(attr =>
{
var attrName = attr.AttributeClass?.Name;
return attrName == "PersonalDataAttribute" ||
attrName == "SensitiveDataAttribute" ||
attrName == "RestrictedDataAttribute" ||
attrName == "EmailDataAttribute" ||
attrName == "PhoneDataAttribute";
});
if (!hasClassificationAttribute)
{
var diagnostic = Diagnostic.Create(Rule, propertyDeclaration.Identifier.GetLocation(), propertySymbol.Name);
context.ReportDiagnostic(diagnostic);
}
}
}
Example: Proper Data Classification:
// ✅ GOOD: Entity with proper data classification
public class User
{
// Public data (no attribute needed)
public Guid Id { get; set; }
public string DisplayName { get; set; }
public DateTime CreatedAt { get; set; }
// Personal data (GDPR Article 4(1))
[PersonalData]
public string FirstName { get; set; }
[PersonalData]
public string LastName { get; set; }
// Sensitive data (GDPR Article 9)
[EmailData]
[SensitiveData]
public string Email { get; set; }
[PhoneData]
[SensitiveData]
public string PhoneNumber { get; set; }
[SensitiveData]
public string Address { get; set; }
// Restricted data (HIPAA PHI)
[RestrictedData]
[EncryptedColumn] // Column-level encryption
public string SSN { get; set; }
[RestrictedData]
[EncryptedColumn]
public string HealthRecordNumber { get; set; }
}
// ❌ BAD: Sensitive property without classification
public class User
{
public Guid Id { get; set; }
public string Email { get; set; } // ❌ ATP002: Missing [EmailData] or [SensitiveData]
public string SSN { get; set; } // ❌ ATP002: Missing [RestrictedData]
}
Retention Policy Validation¶
Purpose: Validate that data retention policies are properly configured per GDPR Article 5(1)(e) (storage limitation) and HIPAA §164.316(b)(2)(i).
Validation (Integration Test):
// Integration test: Validate retention policy enforcement
[Fact]
[Trait("Category", "Compliance")]
[Trait("Regulatory", "GDPR")]
public async Task Should_EnforceRetentionPolicy_When_EventsExceedRetention()
{
// Arrange: Create event older than retention period
var retentionDays = _configuration.GetValue<int>("Audit:RetentionDays");
var oldEvent = new AuditEvent
{
Id = Guid.NewGuid(),
TenantId = _testTenant.Id,
Timestamp = DateTime.UtcNow.AddDays(-retentionDays - 1), // Beyond retention
Action = "OldAction"
};
await _repository.AddAsync(oldEvent);
// Act: Run retention policy enforcement job
var retentionService = _serviceProvider.GetRequiredService<IRetentionPolicyService>();
var purgedCount = await retentionService.EnforceRetentionPolicyAsync(_testTenant.Id);
// Assert: Old event should be purged
Assert.Equal(1, purgedCount);
var retrievedEvent = await _repository.GetByIdAsync(oldEvent.Id);
Assert.Null(retrievedEvent); // Event should be deleted
}
[Fact]
[Trait("Category", "Compliance")]
[Trait("Regulatory", "GDPR")]
public async Task Should_RespectCustomRetention_When_TenantOverridesDefault()
{
// Arrange: Tenant with custom 10-year retention
var customTenant = new Tenant
{
Id = Guid.NewGuid(),
Name = "CustomRetentionTenant",
RetentionDays = 3650 // 10 years (overrides default 7 years)
};
await _tenantRepository.AddAsync(customTenant);
var event8YearsOld = new AuditEvent
{
Id = Guid.NewGuid(),
TenantId = customTenant.Id,
Timestamp = DateTime.UtcNow.AddDays(-(365 * 8)), // 8 years old
Action = "OldAction"
};
await _repository.AddAsync(event8YearsOld);
// Act: Run retention enforcement
var retentionService = _serviceProvider.GetRequiredService<IRetentionPolicyService>();
var purgedCount = await retentionService.EnforceRetentionPolicyAsync(customTenant.Id);
// Assert: Event should NOT be purged (within custom 10-year retention)
Assert.Equal(0, purgedCount);
var retrievedEvent = await _repository.GetByIdAsync(event8YearsOld.Id);
Assert.NotNull(retrievedEvent); // Event still exists
}
DSAR (Data Subject Access Request) Workflow Validation¶
Purpose: Validate that DSAR workflows are implemented per GDPR Articles 15-20 (Right of Access, Erasure, Portability).
API Contract Test:
// API contract test for DSAR workflow
[Fact]
[Trait("Category", "Compliance")]
[Trait("Regulatory", "GDPR-Article-15")]
public async Task Should_ReturnUserData_When_DSARRequested()
{
// Arrange: Create test user with audit events
var userId = Guid.NewGuid();
var events = Enumerable.Range(0, 10)
.Select(i => new AuditEvent
{
Id = Guid.NewGuid(),
TenantId = _testTenant.Id,
UserId = userId,
Action = $"Action{i}",
Timestamp = DateTime.UtcNow.AddDays(-i)
})
.ToList();
await _repository.AddRangeAsync(events);
// Act: Request DSAR export
var response = await _client.GetAsync($"/api/dsar/{userId}");
// Assert: DSAR returns all user data
response.EnsureSuccessStatusCode();
var dsar = await response.Content.ReadFromJsonAsync<DSARExportResponse>();
Assert.NotNull(dsar);
Assert.Equal(userId, dsar.UserId);
Assert.Equal(10, dsar.AuditEvents.Count);
Assert.Equal("application/json", response.Content.Headers.ContentType?.MediaType);
// Validate DSAR contains required sections per GDPR
Assert.NotNull(dsar.PersonalData);
Assert.NotNull(dsar.AuditTrail);
Assert.NotNull(dsar.DataProcessingActivities);
Assert.NotNull(dsar.ThirdPartyDisclosures);
}
[Fact]
[Trait("Category", "Compliance")]
[Trait("Regulatory", "GDPR-Article-17")]
public async Task Should_EraseUserData_When_RightToErasureInvoked()
{
// Arrange: Create test user
var userId = Guid.NewGuid();
await CreateTestUserWithDataAsync(userId);
// Act: Request erasure (Right to be Forgotten)
var response = await _client.DeleteAsync($"/api/dsar/{userId}");
// Assert: User data erased
response.EnsureSuccessStatusCode();
// Verify audit events anonymized (user ID replaced with pseudonym)
var events = await _repository.GetByUserIdAsync(userId);
Assert.Empty(events); // User's events should be anonymized or deleted
// Verify user record marked as erased
var user = await _userRepository.GetByIdAsync(userId);
Assert.True(user.IsErased);
Assert.Equal("ERASED", user.Email); // PII overwritten
}
Compliance Evidence Collection¶
Purpose: Automatically collect compliance evidence during builds for SOC 2, GDPR, HIPAA audits.
Evidence Artifacts:
# Collect compliance evidence
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
$evidenceDir = "$(Build.ArtifactStagingDirectory)/compliance-evidence"
New-Item -ItemType Directory -Force -Path $evidenceDir
# 1. SBOM (already generated)
Copy-Item "$(Build.ArtifactStagingDirectory)/sbom/*.json" -Destination "$evidenceDir/sbom.json"
# 2. Security scan reports
Copy-Item "dependency-check-report.html" -Destination "$evidenceDir/dependency-scan.html"
Copy-Item "trivy-report.html" -Destination "$evidenceDir/container-scan.html"
# 3. Test results with compliance tags
Copy-Item "TestResults/*.trx" -Destination "$evidenceDir/test-results.trx"
# 4. Code coverage report
Copy-Item "coverage-report/index.html" -Destination "$evidenceDir/code-coverage.html"
# 5. Compliance attestation
Copy-Item "compliance-attestation-$(Build.BuildNumber).json" -Destination "$evidenceDir/compliance-attestation.json"
# 6. Audit logging coverage report
Copy-Item "audit-logging-coverage.json" -Destination "$evidenceDir/audit-logging-coverage.json"
# 7. PII redaction report
Copy-Item "pii-redaction-report.json" -Destination "$evidenceDir/pii-redaction-report.json"
# 8. License compliance report
Copy-Item "licenses/licenses.json" -Destination "$evidenceDir/license-report.json"
Write-Host "✅ Compliance evidence collected: 8 artifacts"
displayName: 'Collect Compliance Evidence'
# Publish compliance evidence bundle
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)/compliance-evidence'
ArtifactName: 'compliance-evidence-$(Build.BuildNumber)'
displayName: 'Publish Compliance Evidence Bundle'
# Archive to immutable storage (production only)
- task: AzureCLI@2
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
# Upload evidence bundle
az storage blob upload-batch \
--source "$(Build.ArtifactStagingDirectory)/compliance-evidence" \
--destination compliance-evidence \
--account-name atpcomplianceblob \
--pattern "*" \
--metadata \
BuildId=$(Build.BuildId) \
BuildNumber=$(Build.BuildNumber) \
Environment=Production \
ComplianceFrameworks=GDPR,HIPAA,SOC2 \
RetentionYears=7
# Enable legal hold
az storage blob set-legal-hold \
--account-name atpcomplianceblob \
--container-name compliance-evidence \
--blob-name "$(Build.BuildNumber)/*" \
--legal-hold true \
--tags audit-evidence=true soc2=true gdpr=true hipaa=true
displayName: 'Archive Compliance Evidence (Immutable)'
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
SOC 2 Control Mapping¶
Purpose: Map ATP quality gates to SOC 2 Trust Service Criteria for audit readiness.
SOC 2 Control Mapping:
| SOC 2 Control | Description | ATP Quality Gate | Evidence |
|---|---|---|---|
| CC6.1 | Logical and physical access controls | Two-person review (PRs), Key Vault access | PR approvals, Key Vault audit logs |
| CC6.6 | Vulnerability management | Dependency scanning, Trivy, SAST | OWASP reports, Trivy reports |
| CC7.2 | System monitoring | Observability gates, health checks | Application Insights, test results |
| CC8.1 | Change management | CAB approval, deployment gates | CAB meeting minutes, approval logs |
| CC8.0 | Change management controls | Azure Pipelines approval gates | Pipeline execution logs |
| CC9.2 | Risk mitigation | Security gates, compliance gates | Risk acceptance documents, suppressions |
| A1.2 | Availability commitments | Performance gates, chaos tests | Load test results, DR drill results |
| C1.1 | Confidentiality commitments | PII redaction, encryption gates | PII validation reports, Azure Policy |
| P2.1 | Privacy notice | DSAR workflow, consent management | DSAR API tests, consent logs |
| P3.2 | Privacy access, correction, deletion | DSAR API implementation | DSAR integration tests |
SOC 2 Evidence Package (auto-generated per build):
// Generate SOC 2 evidence package
public class SOC2EvidenceCollector
{
public async Task<SOC2EvidencePackage> CollectEvidenceAsync(string buildId)
{
return new SOC2EvidencePackage
{
BuildId = buildId,
GeneratedAt = DateTime.UtcNow,
// CC6.1: Logical Access Controls
CC6_1 = new ControlEvidence
{
ControlId = "CC6.1",
Description = "Logical and physical access controls restrict unauthorized access",
Evidence = new[]
{
await GetPRApprovalLogsAsync(buildId),
await GetKeyVaultAccessLogsAsync(),
await GetAzureADAccessReviewsAsync()
}
},
// CC6.6: Vulnerability Management
CC6_6 = new ControlEvidence
{
ControlId = "CC6.6",
Description = "Vulnerabilities are identified and remediated timely",
Evidence = new[]
{
await GetDependencyScanReportAsync(buildId),
await GetTrivyScanReportAsync(buildId),
await GetSASTReportAsync(buildId),
await GetVulnerabilityRemediationMetricsAsync()
}
},
// CC8.1: Change Management
CC8_1 = new ControlEvidence
{
ControlId = "CC8.1",
Description = "Changes are authorized, tested, and approved before deployment",
Evidence = new[]
{
await GetCABApprovalRecordsAsync(buildId),
await GetPipelineExecutionLogsAsync(buildId),
await GetDeploymentApprovalLogsAsync(buildId),
await GetTestResultsAsync(buildId)
}
},
// P3.2: Privacy Rights (DSAR)
P3_2 = new ControlEvidence
{
ControlId = "P3.2",
Description = "Individuals can access, correct, and delete personal data",
Evidence = new[]
{
await GetDSARAPITestResultsAsync(buildId),
await GetDSARExecutionLogsAsync(),
"DSAR API: GET /api/dsar/{userId}, DELETE /api/dsar/{userId}"
}
}
};
}
}
Compliance Gate Metrics & Reporting¶
Purpose: Track compliance posture over time and generate audit-ready reports.
Compliance Metrics Dashboard:
# Azure DevOps Compliance Dashboard
dashboard:
name: "ATP Compliance Posture"
widgets:
- type: complianceScore
title: "Overall Compliance Score"
query: |
customEvents
| where name == "ComplianceChecklistValidated"
| extend PassedControls = toint(customDimensions.PassedControls)
| extend TotalControls = toint(customDimensions.TotalControls)
| extend Score = (PassedControls * 100.0) / TotalControls
| summarize AvgScore = avg(Score)
target: 100%
- type: auditLoggingCoverage
title: "Audit Logging Coverage"
query: "Audit Logging Coverage (Last 30 Builds)"
target: 100%
- type: piiViolations
title: "PII Leakage Incidents"
query: |
customEvents
| where name == "PIIDetectedInLogs"
| where timestamp > ago(30d)
| summarize count()
target: 0
- type: dsarResponseTime
title: "DSAR Response Time"
query: |
customEvents
| where name == "DSARCompleted"
| extend ResponseTimeHours = todouble(customDimensions.ResponseTimeHours)
| summarize AvgResponseTime = avg(ResponseTimeHours)
target: < 72h (GDPR requirement: 30 days, ATP target: 3 days)
Compliance KQL Queries:
// Compliance checklist pass rate (last 90 days)
customEvents
| where name == "ComplianceChecklistValidated"
| where timestamp > ago(90d)
| extend PassedControls = toint(customDimensions.PassedControls)
| extend TotalControls = toint(customDimensions.TotalControls)
| extend PassRate = (PassedControls * 100.0) / TotalControls
| summarize
AvgPassRate = avg(PassRate),
MinPassRate = min(PassRate),
BuildsBelow100Percent = countif(PassRate < 100)
| extend ComplianceStatus = iff(MinPassRate == 100, "Compliant", "Non-Compliant")
// Audit logging coverage trend
customEvents
| where name == "AuditLoggingValidated"
| where timestamp > ago(90d)
| extend Coverage = todouble(customDimensions.CoveragePercent)
| summarize AvgCoverage = avg(Coverage) by bin(timestamp, 1d)
| render timechart
// PII leakage incidents by severity
customEvents
| where name == "PIIDetectedInLogs"
| where timestamp > ago(180d)
| extend PIIType = tostring(customDimensions.PIIType)
| extend File = tostring(customDimensions.File)
| summarize Count = count() by PIIType
| order by Count desc
// DSAR request fulfillment metrics
customEvents
| where name in ("DSARRequested", "DSARCompleted", "DSARFailed")
| where timestamp > ago(90d)
| extend RequestId = tostring(customDimensions.RequestId)
| summarize
RequestedAt = minif(timestamp, name == "DSARRequested"),
CompletedAt = maxif(timestamp, name == "DSARCompleted"),
Status = iff(countif(name == "DSARCompleted") > 0, "Completed", "Pending")
by RequestId
| where isnotnull(CompletedAt)
| extend ResponseTimeHours = datetime_diff('hour', CompletedAt, RequestedAt)
| summarize
AvgResponseTime = avg(ResponseTimeHours),
P50ResponseTime = percentile(ResponseTimeHours, 50),
P95ResponseTime = percentile(ResponseTimeHours, 95),
Within72Hours = 100.0 * countif(ResponseTimeHours <= 72) / count()
Compliance Audit Report Generation¶
Purpose: Generate audit-ready reports summarizing compliance posture for SOC 2, GDPR, HIPAA audits.
Monthly Compliance Report (Azure Function):
// Generate monthly compliance report
[FunctionName("GenerateMonthlyComplianceReport")]
public async Task RunAsync(
[TimerTrigger("0 0 9 1 * *")] TimerInfo timer, // 1st of month at 9 AM
ILogger log)
{
log.LogInformation("Generating monthly compliance report...");
var reportMonth = DateTime.UtcNow.AddMonths(-1).ToString("MMMM yyyy");
var report = new ComplianceReport
{
Month = reportMonth,
GeneratedAt = DateTime.UtcNow,
// Overall Compliance Score
OverallScore = await CalculateComplianceScoreAsync(),
// Quality Gate Pass Rates
QualityGates = new QualityGateMetrics
{
BuildQuality = await GetGatePassRateAsync("BuildQuality"),
TestCoverage = await GetGatePassRateAsync("TestCoverage"),
Security = await GetGatePassRateAsync("Security"),
Compliance = await GetGatePassRateAsync("Compliance"),
Performance = await GetGatePassRateAsync("Performance"),
Observability = await GetGatePassRateAsync("Observability")
},
// Audit Logging
AuditLogging = new AuditLoggingMetrics
{
AverageCoverage = await GetAuditLoggingCoverageAsync(),
EventsLogged = await GetAuditEventCountAsync(reportMonth),
ComplianceRate = 100.0 // Always 100% (enforced by gate)
},
// PII Protection
PIIProtection = new PIIProtectionMetrics
{
LeakageIncidents = await GetPIILeakageCountAsync(reportMonth),
RedactionEffectiveness = await GetPIIRedactionRateAsync(),
DataClassificationCoverage = await GetDataClassificationCoverageAsync()
},
// GDPR Compliance
GDPR = new GDPRMetrics
{
DSARRequests = await GetDSARRequestCountAsync(reportMonth),
DSARAvgResponseTime = await GetDSARAvgResponseTimeAsync(reportMonth),
RightToErasureRequests = await GetErasureRequestCountAsync(reportMonth),
DataBreachIncidents = await GetDataBreachCountAsync(reportMonth)
},
// HIPAA Compliance
HIPAA = new HIPAAMetrics
{
EncryptionCompliance = await GetEncryptionComplianceRateAsync(),
AccessControlAudits = await GetAccessControlAuditCountAsync(reportMonth),
BAAAgreementsActive = await GetBAACountAsync()
},
// SOC 2 Controls
SOC2 = await GenerateSOC2ControlEvidenceAsync(reportMonth)
};
// Generate PDF report
var pdf = await GeneratePdfReportAsync(report);
// Upload to compliance blob storage
await UploadComplianceReportAsync($"compliance-reports/{reportMonth}/Compliance-Report-{reportMonth}.pdf", pdf);
// Send to stakeholders
await SendReportAsync(pdf, new[]
{
"ciso@connectsoft.example",
"compliance-officer@connectsoft.example",
"dpo@connectsoft.example", // Data Protection Officer
"external-auditors@connectsoft.example"
});
log.LogInformation($"✅ Monthly compliance report generated for {reportMonth}");
}
Summary¶
- Compliance Gates: 2-3 minute execution; enforce GDPR, HIPAA, SOC 2 requirements before deployment
- Audit Logging Validation: 100% coverage required; custom Roslyn analyzer (ATP001) enforces IAuditLogger.LogAsync() calls
- Audit Logging Validator: PowerShell script scans for state-mutating methods (Create/Update/Delete/Save/Add/Remove) without audit calls
- PII Redaction Validation: Zero tolerance for raw PII in logs; PowerShell script detects email/phone/SSN patterns
- PII Patterns: 4 regex patterns (email, phone, SSN, credit card); scans all log statements
- PII Redaction: C# examples with [EmailData], [PhoneData], [PersonalData] attributes; automatic redaction via Serilog enricher
- GDPR/HIPAA Checklist: 8 controls (encryption at rest/transit, tenant isolation, retention, DSAR, breach notification, audit logging, PII redaction)
- Checklist Validator: PowerShell script validates all 8 controls via Azure Policy, test results, configuration, API contracts, documentation
- Data Classification: 6 levels (Public, Internal, Confidential, Personal, Sensitive, Restricted); custom analyzer (ATP002) enforces classification
- Retention Policy Validation: Integration tests verify 7-year retention enforcement and custom tenant overrides
- DSAR Workflow: API contract tests validate GDPR Article 15 (access), Article 17 (erasure), Article 20 (portability)
- Compliance Evidence: 8 artifacts auto-collected per build (SBOM, security scans, test results, coverage, attestation, audit logging, PII redaction, licenses)
- SOC 2 Mapping: 10 Trust Service Criteria mapped to ATP gates with evidence artifacts
- Compliance Reporting: Monthly automated report (Azure Function) covering quality gates, audit logging, PII protection, GDPR, HIPAA, SOC 2
- Immutable Evidence: All compliance artifacts archived in Azure Blob with legal hold (7-year retention)
Performance Gates (Deep Dive)¶
Performance gates validate that ATP services meet latency, throughput, and reliability requirements under production-like load conditions and failure scenarios. These gates execute in staging environment and block production deployment if performance thresholds are not met.
Philosophy: Performance is a feature, not an afterthought. ATP enforces industry-leading performance standards (p95 <500ms vs. industry <1000ms) and chaos engineering to ensure services degrade gracefully under adverse conditions.
Performance Gate Workflow¶
graph TD
A[Compliance Gates Passed] --> B[Deploy to Staging]
B --> C[Run Load Tests]
C --> D{Latency < Threshold?}
D -->|No| E[Latency Too High ❌]
D -->|Yes| F{Error Rate < 0.1%?}
F -->|No| G[Error Rate Too High ❌]
F -->|Yes| H{Throughput ≥ 1000 RPS?}
H -->|No| I[Throughput Too Low ⚠️]
H -->|Yes| J[Run Chaos Tests]
J --> K{Pod Restart Pass?}
K -->|No| L[Pod Restart Failed ❌]
K -->|Yes| M{Network Latency Pass?}
M -->|No| N[Network Latency Failed ⚠️]
M -->|Yes| O{Storage Failure Pass?}
O -->|No| P[Storage Failure Failed ❌]
O -->|Yes| Q[Performance Gates Passed ✅]
E --> R[Block Production Deployment]
G --> R
L --> R
P --> R
I --> S[Warning: Monitor Capacity]
N --> S
Q --> T[Ready for Production]
style E fill:#ff6b6b
style G fill:#ff6b6b
style L fill:#ff6b6b
style P fill:#ff6b6b
style I fill:#feca57
style N fill:#feca57
style Q fill:#90EE90
Typical Performance Gate Duration: 10-15 minutes (load tests) + 5-10 minutes (chaos tests) = 15-25 minutes total
Load Test Thresholds (Staging)¶
Purpose: Simulate production traffic patterns (500-1000 concurrent users) and validate that services meet latency and error rate requirements.
Tool: Apache JMeter or k6 for load testing
Test Configuration:
| Parameter | Value | Rationale |
|---|---|---|
| Concurrent Users | 500 | ~50% of production peak traffic |
| Ramp-Up Time | 60 seconds | Gradual traffic increase |
| Test Duration | 600 seconds (10 minutes) | Sufficient for steady-state analysis |
| Request Mix | 70% read, 30% write | Matches production patterns |
| Data Set | 1M audit events | Production-like scale |
Performance Thresholds:
| Metric | Threshold | Action | Rationale |
|---|---|---|---|
| p50 Latency | < 100ms | ⚠️ Warning; investigate | Median user experience; competitive with industry leaders |
| p95 Latency | < 500ms | ❌ Block prod deploy | 95% of requests must be fast; ATP requirement stricter than industry (<1000ms) |
| p99 Latency | < 1000ms | ⚠️ Warning; track outliers | 99% percentile; acceptable for edge cases |
| Error Rate | < 0.1% | ❌ Block prod deploy | 99.9% success rate; <1 error per 1000 requests |
| Throughput | ≥ 1000 RPS | ℹ️ Info; capacity planning | Sustained requests/second; validates scaling |
| CPU Utilization | < 70% avg | ⚠️ Warning; optimize | Headroom for traffic spikes |
| Memory Utilization | < 80% avg | ⚠️ Warning; investigate leaks | Prevent OOM conditions |
| Database DTU/RU | < 80% | ⚠️ Warning; scale up | Database capacity headroom |
JMeter Test Plan (XML excerpt):
<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2" properties="5.0" jmeter="5.6.2">
<hashTree>
<!-- Test Plan -->
<TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="ATP Load Test - Staging">
<stringProp name="TestPlan.comments">ATP load test simulating production traffic</stringProp>
<boolProp name="TestPlan.functional_mode">false</boolProp>
<boolProp name="TestPlan.serialize_threadgroups">false</boolProp>
<elementProp name="TestPlan.user_defined_variables" elementType="Arguments">
<collectionProp name="Arguments.arguments">
<elementProp name="BASE_URL" elementType="Argument">
<stringProp name="Argument.name">BASE_URL</stringProp>
<stringProp name="Argument.value">${__P(baseUrl,https://atp-staging.azurewebsites.net)}</stringProp>
</elementProp>
<elementProp name="USERS" elementType="Argument">
<stringProp name="Argument.name">USERS</stringProp>
<stringProp name="Argument.value">${__P(users,500)}</stringProp>
</elementProp>
<elementProp name="RAMP_UP" elementType="Argument">
<stringProp name="Argument.name">RAMP_UP</stringProp>
<stringProp name="Argument.value">${__P(rampUp,60)}</stringProp>
</elementProp>
<elementProp name="DURATION" elementType="Argument">
<stringProp name="Argument.name">DURATION</stringProp>
<stringProp name="Argument.value">${__P(duration,600)}</stringProp>
</elementProp>
</collectionProp>
</elementProp>
</TestPlan>
<hashTree>
<!-- Thread Group: Read Operations (70% of traffic) -->
<ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Read Operations">
<stringProp name="ThreadGroup.num_threads">${USERS}</stringProp>
<stringProp name="ThreadGroup.ramp_time">${RAMP_UP}</stringProp>
<stringProp name="ThreadGroup.duration">${DURATION}</stringProp>
<boolProp name="ThreadGroup.scheduler">true</boolProp>
</ThreadGroup>
<hashTree>
<!-- GET /api/audit-events (list events) -->
<HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="GET Audit Events">
<elementProp name="HTTPsampler.Arguments" elementType="Arguments">
<collectionProp name="Arguments.arguments">
<elementProp name="tenantId" elementType="HTTPArgument">
<stringProp name="Argument.value">${__UUID()}</stringProp>
</elementProp>
<elementProp name="pageSize" elementType="HTTPArgument">
<stringProp name="Argument.value">50</stringProp>
</elementProp>
</collectionProp>
</elementProp>
<stringProp name="HTTPSampler.domain">${BASE_URL}</stringProp>
<stringProp name="HTTPSampler.path">/api/audit-events</stringProp>
<stringProp name="HTTPSampler.method">GET</stringProp>
</HTTPSamplerProxy>
<!-- Assertion: Response time < 500ms (p95) -->
<DurationAssertion guiclass="DurationAssertionGui" testclass="DurationAssertion" testname="Response Time < 500ms">
<longProp name="DurationAssertion.duration">500</longProp>
</DurationAssertion>
<!-- Assertion: HTTP 200 OK -->
<ResponseAssertion guiclass="AssertionGui" testclass="ResponseAssertion" testname="HTTP 200 OK">
<collectionProp name="Asserion.test_strings">
<stringProp name="49586">200</stringProp>
</collectionProp>
<stringProp name="Assertion.test_field">Assertion.response_code</stringProp>
</ResponseAssertion>
</hashTree>
<!-- Thread Group: Write Operations (30% of traffic) -->
<ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Write Operations">
<stringProp name="ThreadGroup.num_threads">${__intSum(${USERS},0.3)}</stringProp>
<stringProp name="ThreadGroup.ramp_time">${RAMP_UP}</stringProp>
<stringProp name="ThreadGroup.duration">${DURATION}</stringProp>
</ThreadGroup>
<hashTree>
<!-- POST /api/audit-events (create event) -->
<HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="POST Audit Event">
<boolProp name="HTTPSampler.postBodyRaw">true</boolProp>
<elementProp name="HTTPsampler.Arguments" elementType="Arguments">
<collectionProp name="Arguments.arguments">
<elementProp name="" elementType="HTTPArgument">
<boolProp name="HTTPArgument.always_encode">false</boolProp>
<stringProp name="Argument.value">{
"tenantId": "${__UUID()}",
"action": "UserLogin",
"userId": "${__UUID()}",
"timestamp": "${__time(yyyy-MM-dd'T'HH:mm:ss'Z')}"
}</stringProp>
</elementProp>
</collectionProp>
</elementProp>
<stringProp name="HTTPSampler.domain">${BASE_URL}</stringProp>
<stringProp name="HTTPSampler.path">/api/audit-events</stringProp>
<stringProp name="HTTPSampler.method">POST</stringProp>
<stringProp name="HTTPSampler.contentEncoding">UTF-8</stringProp>
<stringProp name="HTTPSampler.header_manager">Content-Type: application/json</stringProp>
</HTTPSamplerProxy>
</hashTree>
<!-- Listeners: Aggregate Report -->
<ResultCollector guiclass="StatVisualizer" testclass="ResultCollector" testname="Aggregate Report">
<boolProp name="ResultCollector.error_logging">false</boolProp>
<objProp>
<name>saveConfig</name>
<value class="SampleSaveConfiguration">
<time>true</time>
<latency>true</latency>
<timestamp>true</timestamp>
<success>true</success>
<label>true</label>
<code>true</code>
<message>true</message>
<threadName>true</threadName>
<dataType>true</dataType>
<encoding>false</encoding>
<assertions>true</assertions>
<subresults>true</subresults>
<responseData>false</responseData>
<samplerData>false</samplerData>
<xml>false</xml>
<fieldNames>true</fieldNames>
<responseHeaders>false</responseHeaders>
<requestHeaders>false</requestHeaders>
<responseDataOnError>false</responseDataOnError>
<saveAssertionResultsFailureMessage>true</saveAssertionResultsFailureMessage>
<assertionsResultsToSave>0</assertionsResultsToSave>
<bytes>true</bytes>
<sentBytes>true</sentBytes>
<url>true</url>
<threadCounts>true</threadCounts>
<idleTime>true</idleTime>
<connectTime>true</connectTime>
</value>
</objProp>
<stringProp name="filename">load-test-results.jtl</stringProp>
</ResultCollector>
</hashTree>
</hashTree>
</jmeterTestPlan>
Azure Pipelines Load Test Execution:
# Load Test Gate (Staging Environment)
- stage: Performance_Tests
displayName: 'Performance Testing (Staging)'
dependsOn: Deploy_Staging
condition: succeeded()
jobs:
- job: LoadTest
displayName: 'Run Load Tests'
pool:
vmImage: 'ubuntu-latest'
steps:
# Install JMeter
- script: |
wget https://archive.apache.org/dist/jmeter/binaries/apache-jmeter-5.6.2.tgz
tar -xzf apache-jmeter-5.6.2.tgz
sudo mv apache-jmeter-5.6.2 /opt/jmeter
export PATH=$PATH:/opt/jmeter/bin
jmeter --version
displayName: 'Install JMeter'
# Run load test
- script: |
/opt/jmeter/bin/jmeter \
-n \
-t load-tests/atp-load-test.jmx \
-l load-test-results.jtl \
-e \
-o load-test-report \
-JbaseUrl=$(StagingUrl) \
-Jusers=500 \
-JrampUp=60 \
-Jduration=600
displayName: 'Execute Load Test (10 minutes)'
timeoutInMinutes: 15
# Analyze results
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
# Parse JMeter results
$results = Import-Csv load-test-results.jtl -Delimiter ","
# Calculate metrics
$latencies = $results | Where-Object { $_.success -eq "true" } | Select-Object -ExpandProperty elapsed | ForEach-Object { [int]$_ }
$errors = $results | Where-Object { $_.success -eq "false" }
$p50 = ($latencies | Sort-Object)[[int]($latencies.Count * 0.50)]
$p95 = ($latencies | Sort-Object)[[int]($latencies.Count * 0.95)]
$p99 = ($latencies | Sort-Object)[[int]($latencies.Count * 0.99)]
$errorRate = ($errors.Count / $results.Count) * 100
$throughput = $results.Count / 600 # Total requests / 600 seconds
Write-Host "Load Test Results:"
Write-Host " p50 Latency: ${p50}ms (threshold: <100ms)"
Write-Host " p95 Latency: ${p95}ms (threshold: <500ms)"
Write-Host " p99 Latency: ${p99}ms (threshold: <1000ms)"
Write-Host " Error Rate: $($errorRate.ToString('F2'))% (threshold: <0.1%)"
Write-Host " Throughput: $($throughput.ToString('F1')) RPS (threshold: ≥1000)"
# Validate thresholds
$failed = $false
if ($p50 -gt 100) {
Write-Warning "p50 latency exceeded threshold: ${p50}ms > 100ms"
}
if ($p95 -gt 500) {
Write-Error "❌ p95 latency exceeded threshold: ${p95}ms > 500ms (BLOCKER)"
$failed = $true
}
if ($p99 -gt 1000) {
Write-Warning "p99 latency exceeded threshold: ${p99}ms > 1000ms"
}
if ($errorRate -gt 0.1) {
Write-Error "❌ Error rate exceeded threshold: $($errorRate.ToString('F2'))% > 0.1% (BLOCKER)"
$failed = $true
}
if ($throughput -lt 1000) {
Write-Warning "Throughput below target: $($throughput.ToString('F1')) RPS < 1000 RPS"
}
if ($failed) {
Write-Error "Load test failed; blocking production deployment"
exit 1
}
Write-Host "`n✅ Load test passed all thresholds"
displayName: 'Validate Load Test Results'
# Publish results
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: 'load-test-report'
ArtifactName: 'load-test-report-$(Build.BuildNumber)'
displayName: 'Publish Load Test Report'
condition: always()
k6 Alternative (Modern Load Testing):
// load-test.js (k6)
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';
// Custom metrics
const errorRate = new Rate('errors');
const latency = new Trend('latency');
// Test configuration
export const options = {
stages: [
{ duration: '1m', target: 500 }, // Ramp-up to 500 users
{ duration: '10m', target: 500 }, // Hold at 500 users for 10 minutes
{ duration: '1m', target: 0 }, // Ramp-down
],
thresholds: {
'http_req_duration{type:read}': ['p(50)<100', 'p(95)<500', 'p(99)<1000'], // Latency thresholds
'http_req_duration{type:write}': ['p(95)<800'], // Writes can be slower
'http_req_failed': ['rate<0.001'], // <0.1% error rate
'http_reqs': ['rate>1000'], // ≥1000 requests/second
},
};
const BASE_URL = __ENV.BASE_URL || 'https://atp-staging.azurewebsites.net';
export default function () {
// 70% read operations
if (Math.random() < 0.7) {
const res = http.get(`${BASE_URL}/api/audit-events?pageSize=50`, {
tags: { type: 'read' },
});
check(res, {
'status is 200': (r) => r.status === 200,
'latency < 500ms': (r) => r.timings.duration < 500,
});
errorRate.add(res.status !== 200);
latency.add(res.timings.duration, { type: 'read' });
}
// 30% write operations
else {
const payload = JSON.stringify({
tenantId: '12345678-1234-1234-1234-123456789012',
action: 'UserLogin',
userId: '87654321-4321-4321-4321-210987654321',
timestamp: new Date().toISOString(),
});
const res = http.post(`${BASE_URL}/api/audit-events`, payload, {
headers: { 'Content-Type': 'application/json' },
tags: { type: 'write' },
});
check(res, {
'status is 201': (r) => r.status === 201,
'latency < 800ms': (r) => r.timings.duration < 800,
});
errorRate.add(res.status !== 201);
latency.add(res.timings.duration, { type: 'write' });
}
sleep(1); // 1 second delay between requests
}
// Run k6 in Azure Pipelines
// k6 run --out json=load-test-results.json load-test.js
Chaos Test Pass Rate¶
Purpose: Validate that ATP services degrade gracefully under failure conditions (pod restarts, network latency, storage unavailable).
Tool: Chaos Mesh (Kubernetes-native chaos engineering) or Azure Chaos Studio
Chaos Scenarios:
| Scenario | Description | Pass Rate | Blocker | Expected Behavior |
|---|---|---|---|---|
| Pod Restart | Random pod killed every 30s | 100% | ✅ Yes | Graceful shutdown, requests redistributed, no data loss, <5s recovery |
| Network Latency | 500ms latency added to pod | 95% | ❌ No | Timeouts honored, retries triggered, circuit breaker opens |
| Storage Unavailable | SQL/Blob down for 30s | 100% | ✅ Yes | Circuit breaker opens, degraded mode, cached data served, no cascading failures |
| CPU Throttle | Pod CPU limited to 50% | 90% | ❌ No | Graceful degradation, autoscaling triggered, no OOM kills |
| Memory Pressure | 80% memory consumed | 95% | ❌ No | GC triggered, cache eviction, no OOM exceptions |
Chaos Test Configuration (YAML):
# Chaos engineering tests
chaosTests:
- scenario: pod_restart
description: "Random pod restart every 30 seconds"
duration: 300 # 5 minutes
passRate: 100% # Must handle gracefully
blockerOnFail: true
expectedBehavior:
- Graceful shutdown (SIGTERM handled)
- Requests redistributed to healthy pods
- No data loss or corruption
- Recovery time < 5 seconds
- scenario: network_latency_500ms
description: "Add 500ms network latency to pod"
duration: 300
passRate: 95% # Allow some failures
blockerOnFail: false
expectedBehavior:
- Timeouts honored (circuit breaker opens)
- Retries triggered with exponential backoff
- Graceful degradation (cached responses)
- scenario: storage_unavailable
description: "SQL/Blob storage unavailable for 30 seconds"
duration: 300
passRate: 100% # Critical; must degrade gracefully
blockerOnFail: true
expectedBehavior:
- Circuit breaker opens immediately
- Degraded mode activated (read from cache)
- No cascading failures to other services
- Auto-recovery when storage returns
- scenario: cpu_throttle
description: "Limit pod CPU to 50% of requested"
duration: 300
passRate: 90%
blockerOnFail: false
expectedBehavior:
- Autoscaling triggered (horizontal pod autoscaler)
- Request queue managed (no rejections)
- Graceful performance degradation
- scenario: memory_pressure
description: "Consume 80% of pod memory"
duration: 300
passRate: 95%
blockerOnFail: false
expectedBehavior:
- Garbage collection triggered
- Cache eviction (LRU policy)
- No OutOfMemoryException
Chaos Mesh Manifests (Kubernetes):
# Chaos Experiment: Pod Restart
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-restart-test
namespace: atp-staging
spec:
action: pod-kill
mode: one # Kill one random pod
selector:
namespaces:
- atp-staging
labelSelectors:
app: atp-ingestion
scheduler:
cron: "@every 30s" # Kill pod every 30 seconds
duration: "5m" # Test duration
---
# Chaos Experiment: Network Latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-latency-test
namespace: atp-staging
spec:
action: delay
mode: one
selector:
namespaces:
- atp-staging
labelSelectors:
app: atp-ingestion
delay:
latency: "500ms"
correlation: "50" # 50% correlation between delays
jitter: "100ms"
direction: both # Both ingress and egress
duration: "5m"
---
# Chaos Experiment: Storage I/O Chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: storage-unavailable-test
namespace: atp-staging
spec:
action: fault
mode: one
selector:
namespaces:
- atp-staging
labelSelectors:
app: atp-ingestion
volumePath: /data
path: /data/**/*
errno: 5 # I/O error
percent: 100 # 100% of I/O operations fail
duration: "30s" # Storage down for 30 seconds
scheduler:
cron: "@every 2m" # Inject failure every 2 minutes
---
# Chaos Experiment: CPU Stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-throttle-test
namespace: atp-staging
spec:
mode: one
selector:
namespaces:
- atp-staging
labelSelectors:
app: atp-ingestion
stressors:
cpu:
workers: 2 # 2 CPU-intensive workers
load: 50 # 50% CPU load
duration: "5m"
---
# Chaos Experiment: Memory Pressure
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-pressure-test
namespace: atp-staging
spec:
mode: one
selector:
namespaces:
- atp-staging
labelSelectors:
app: atp-ingestion
stressors:
memory:
workers: 1
size: "512MB" # Consume 512MB (80% of 640MB pod limit)
duration: "5m"
Chaos Test Execution (Azure Pipelines):
# Chaos Engineering Tests (Staging)
- job: ChaosTests
displayName: 'Run Chaos Tests'
dependsOn: LoadTest
condition: succeeded()
steps:
# Install Chaos Mesh CLI
- script: |
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash
export PATH=$PATH:$HOME/.local/bin
chaos-mesh version
displayName: 'Install Chaos Mesh CLI'
# Connect to AKS cluster
- task: AzureCLI@2
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az aks get-credentials \
--resource-group ATP-Staging-RG \
--name atp-aks-staging-eus
displayName: 'Connect to AKS Staging'
# Scenario 1: Pod Restart
- script: |
# Apply chaos experiment
kubectl apply -f chaos-tests/pod-restart.yaml
# Wait for experiment to complete
sleep 300
# Check application metrics during chaos
ERRORS=$(kubectl logs -l app=atp-ingestion --since=5m | grep -c ERROR || true)
if [ "$ERRORS" -gt 10 ]; then
echo "❌ Pod restart chaos test failed: $ERRORS errors detected"
exit 1
fi
# Clean up experiment
kubectl delete -f chaos-tests/pod-restart.yaml
echo "✅ Pod restart chaos test passed"
displayName: 'Chaos Test: Pod Restart'
# Scenario 2: Network Latency
- script: |
kubectl apply -f chaos-tests/network-latency.yaml
sleep 300
# Validate circuit breaker opened (check metrics)
CIRCUIT_BREAKER_OPENS=$(curl -s $(StagingUrl)/metrics | grep 'circuit_breaker_state{state="open"}' | awk '{print $2}')
if [ "$CIRCUIT_BREAKER_OPENS" -eq "0" ]; then
echo "⚠️ Circuit breaker did not open during network latency"
fi
kubectl delete -f chaos-tests/network-latency.yaml
echo "✅ Network latency chaos test passed"
displayName: 'Chaos Test: Network Latency'
# Scenario 3: Storage Unavailable
- script: |
kubectl apply -f chaos-tests/storage-unavailable.yaml
sleep 60 # Wait for 30s failure + 30s recovery
# Validate no cascading failures
POD_STATUS=$(kubectl get pods -l app=atp-ingestion -o jsonpath='{.items[*].status.phase}')
if echo "$POD_STATUS" | grep -q "CrashLoopBackOff"; then
echo "❌ Storage failure caused cascading pod crashes"
exit 1
fi
kubectl delete -f chaos-tests/storage-unavailable.yaml
echo "✅ Storage unavailable chaos test passed"
displayName: 'Chaos Test: Storage Unavailable'
continueOnError: false # BLOCKER: Fail if storage failure causes crashes
Performance Baseline & Regression Detection¶
Purpose: Track performance trends over time and detect regressions (latency increases, throughput decreases).
Performance Baseline Tracking:
// Track performance baseline per build
public class PerformanceBaselineTracker
{
public async Task RecordBaselineAsync(string buildId, PerformanceMetrics metrics)
{
var baseline = new PerformanceBaseline
{
BuildId = buildId,
BuildNumber = await GetBuildNumberAsync(buildId),
Timestamp = DateTime.UtcNow,
P50Latency = metrics.P50Latency,
P95Latency = metrics.P95Latency,
P99Latency = metrics.P99Latency,
ErrorRate = metrics.ErrorRate,
Throughput = metrics.Throughput,
CpuUtilization = metrics.CpuUtilization,
MemoryUtilization = metrics.MemoryUtilization,
DatabaseUtilization = metrics.DatabaseUtilization
};
await _cosmosClient.UpsertAsync(baseline);
}
public async Task<PerformanceRegressionResult> DetectRegressionAsync(
string currentBuildId,
PerformanceMetrics currentMetrics)
{
// Get last 10 builds for baseline
var historicalBaselines = await _cosmosClient.QueryAsync<PerformanceBaseline>(
q => q.OrderByDescending(b => b.Timestamp).Take(10));
var avgP95 = historicalBaselines.Average(b => b.P95Latency);
var avgErrorRate = historicalBaselines.Average(b => b.ErrorRate);
var avgThroughput = historicalBaselines.Average(b => b.Throughput);
var regression = new PerformanceRegressionResult();
// Detect p95 latency regression (>20% increase)
if (currentMetrics.P95Latency > avgP95 * 1.2)
{
regression.HasRegression = true;
regression.Issues.Add($"p95 latency increased {((currentMetrics.P95Latency / avgP95 - 1) * 100):F1}% from baseline");
}
// Detect error rate regression (>50% increase)
if (currentMetrics.ErrorRate > avgErrorRate * 1.5)
{
regression.HasRegression = true;
regression.Issues.Add($"Error rate increased {((currentMetrics.ErrorRate / avgErrorRate - 1) * 100):F1}% from baseline");
}
// Detect throughput regression (>10% decrease)
if (currentMetrics.Throughput < avgThroughput * 0.9)
{
regression.HasRegression = true;
regression.Issues.Add($"Throughput decreased {((1 - currentMetrics.Throughput / avgThroughput) * 100):F1}% from baseline");
}
if (regression.HasRegression)
{
// Create work item for investigation
await CreatePerformanceRegressionWorkItemAsync(currentBuildId, regression);
}
return regression;
}
}
Performance Optimization Guidance¶
Purpose: Provide actionable remediation when performance gates fail.
Common Performance Issues & Fixes:
| Issue | Symptom | Root Cause | Remediation | Typical Time |
|---|---|---|---|---|
| High p95 Latency | p95 > 500ms | N+1 queries, missing indexes | Add .Include() for EF Core, create indexes |
2-8 hours |
| High Error Rate | Error rate > 0.1% | Race conditions, deadlocks | Add retries, pessimistic locking, idempotency | 1-3 days |
| Low Throughput | < 1000 RPS | Synchronous I/O, thread pool exhaustion | Use async/await, increase thread pool, add caching | 1-2 days |
| Memory Leak | Memory grows over time | Undisp Objects, event handler leaks | Implement IDisposable, remove event handlers | 1-2 days |
| Database Bottleneck | High DTU/RU utilization | Missing indexes, inefficient queries | Add indexes, optimize queries, use read replicas | 4 hours - 2 days |
| Cache Miss Rate | High latency on reads | Insufficient cache warming, short TTL | Increase TTL, implement cache warming, add Redis cluster | 4-8 hours |
Performance Profiling (dotnet-trace):
#!/bin/bash
# profile-performance.sh
POD_NAME=$1 # Kubernetes pod name
echo "Collecting performance trace from pod: $POD_NAME"
# Install dotnet-trace in pod
kubectl exec -it $POD_NAME -- bash -c "dotnet tool install --global dotnet-trace"
# Collect 60-second CPU trace
kubectl exec -it $POD_NAME -- bash -c "/root/.dotnet/tools/dotnet-trace collect --process-id 1 --duration 00:01:00 --format speedscope"
# Copy trace file locally
kubectl cp $POD_NAME:/tmp/trace.speedscope.json ./performance-trace.json
# Analyze trace (upload to https://speedscope.app)
echo "✅ Performance trace collected: performance-trace.json"
echo " Upload to https://speedscope.app for analysis"
Summary¶
- Performance Gates: 15-25 minute execution in staging; block production if thresholds exceeded
- Load Test Thresholds: p50 <100ms (warning), p95 <500ms (blocker), p99 <1000ms (warning), error rate <0.1% (blocker), throughput ≥1000 RPS (info)
- Load Test Configuration: 500 concurrent users, 60s ramp-up, 10-minute duration, 70% read / 30% write mix
- JMeter Test Plan: Complete XML with thread groups, HTTP samplers, assertions, result collectors
- k6 Alternative: Modern load testing with JavaScript DSL, custom metrics, threshold definitions
- Azure Pipelines Load Test: Install JMeter, execute test, analyze results (PowerShell parsing), publish HTML report
- Chaos Test Scenarios: 5 scenarios (pod restart, network latency, storage unavailable, CPU throttle, memory pressure)
- Chaos Pass Rates: Pod restart (100%, blocker), network latency (95%, non-blocker), storage unavailable (100%, blocker), CPU (90%), memory (95%)
- Chaos Mesh Manifests: 5 Kubernetes YAML manifests (PodChaos, NetworkChaos, IOChaos, StressChaos for CPU/memory)
- Azure Pipelines Chaos Tests: 3-scenario execution with validation (pod restart, network latency, storage unavailable)
- Performance Baseline: C# tracker recording metrics per build; regression detection (>20% latency increase, >50% error increase, >10% throughput decrease)
- Performance Issues Table: 6 common issues with root causes and remediation times (2 hours - 3 days)
- Performance Profiling: Bash script using dotnet-trace for CPU profiling in Kubernetes pods
Observability Gates (Deep Dive)¶
Observability gates validate that ATP services emit structured logs, distributed traces, custom metrics, and expose health check endpoints required for production monitoring, alerting, and incident response. These gates execute in all environments and block production deployment if observability requirements are not met.
Philosophy: Observability is not optional—production services must be fully observable (logs, traces, metrics, health checks) to enable rapid incident response (<15 minutes MTTR) and proactive issue detection (alerts before users notice).
Observability Gate Workflow¶
graph TD
A[Performance Gates Passed] --> B[Validate OpenTelemetry]
B --> C{All Endpoints Instrumented?}
C -->|No| D[Missing Instrumentation ❌]
C -->|Yes| E{Database Calls Instrumented?}
E -->|No| F[Missing DB Instrumentation ❌]
E -->|Yes| G{Custom Metrics Present?}
G -->|No| H[Missing Metrics ❌]
G -->|Yes| I{Health Checks Valid?}
I -->|No| J[Invalid Health Checks ❌]
I -->|Yes|試験{Structured Logging?}
試験 -->|No| K[Unstructured Logs ❌]
試験 -->|Yes| L{Trace Context Propagated?}
L -->|No| M[Trace Propagation Failed ❌]
L -->|Yes| N[Observability Gates Passed ✅]
D --> O[Block Production Deployment]
F --> O
H --> O
J --> O
K --> O
M --> O
N --> P[Ready for Production]
style D fill:#ff6b6b
style F fill:#ff6b6b
style H fill:#ff6b6b
style J fill:#ff6b6b
style K fill:#ff6b6b
style M fill:#ff6b6b
style N fill:#90EE90
Typical Observability Gate Duration: 5-10 minutes (validation scripts + health check tests)
OpenTelemetry Validation¶
Purpose: Ensure all ATP services are fully instrumented with OpenTelemetry for distributed tracing, custom metrics, and log correlation.
Validation Checks:
| Check | Requirement | Blocker | Rationale |
|---|---|---|---|
| HTTP Endpoints Instrumented | All endpoints have activity source spans | ✅ Yes | Without spans, requests are invisible in traces |
| Database Calls Instrumented | All EF Core / ADO.NET calls instrumented | ✅ Yes | Database operations are critical path; must be visible |
| Custom Metrics Present | Business KPIs emitted (audit records, queries) | ✅ Yes | Business metrics enable SLO tracking and capacity planning |
| Trace Context Propagated | Trace context passed via HTTP headers | ✅ Yes | Without propagation, distributed traces are incomplete |
| Activity Source Naming | Consistent naming: ConnectSoft.ATP.{Service} |
⚠️ Warning | Enables filtering and aggregation in observability tools |
OpenTelemetry Validation Script (PowerShell):
# scripts/validate-otel.ps1
param(
[Parameter(Mandatory=$true)]
[string]$Path,
[Parameter(Mandatory=$false)]
[string]$ServiceName = "ATP"
)
Write-Host "🔍 Validating OpenTelemetry instrumentation for $ServiceName services..." -ForegroundColor Cyan
$errors = @()
$warnings = @()
# Find all C# projects
$csprojFiles = Get-ChildItem -Path $Path -Filter "*.csproj" -Recurse | Where-Object {
$_.FullName -notlike "*\Test\*" -and $_.FullName -notlike "*\Tests\*"
}
foreach ($project in $csprojFiles) {
$projectPath = $project.DirectoryName
$projectName = $project.BaseName
Write-Host "`n📦 Validating project: $projectName" -ForegroundColor Yellow
# Check 1: OpenTelemetry NuGet packages present
$csprojContent = Get-Content $project.FullName -Raw
if ($csprojContent -notmatch "OpenTelemetry") {
$warnings += "⚠️ $projectName: Missing OpenTelemetry NuGet packages"
Write-Host " ⚠️ Missing OpenTelemetry packages" -ForegroundColor Yellow
} else {
Write-Host " ✅ OpenTelemetry packages found" -ForegroundColor Green
# Check for required packages
$requiredPackages = @(
"OpenTelemetry.Exporter.Console",
"OpenTelemetry.Exporter.OTLP",
"OpenTelemetry.Extensions.Hosting",
"OpenTelemetry.Instrumentation.AspNetCore",
"OpenTelemetry.Instrumentation.Http"
)
foreach ($pkg in $requiredPackages) {
if ($csprojContent -notmatch [regex]::Escape($pkg)) {
$warnings += "⚠️ $projectName: Missing recommended package: $pkg"
}
}
}
# Check 2: ActivitySource registration in Startup/Program.cs
$programFiles = @(
"$projectPath\Program.cs",
"$projectPath\Startup.cs",
"$projectPath\DependencyInjection.cs"
)
$foundActivitySource = $false
foreach ($file in $programFiles) {
if (Test-Path $file) {
$content = Get-Content $file -Raw
if ($content -match "ActivitySource|AddSource") {
$foundActivitySource = $true
Write-Host " ✅ ActivitySource registered" -ForegroundColor Green
break
}
}
}
if (-not $foundActivitySource) {
$errors += "❌ $projectName: No ActivitySource registration found in Program.cs/Startup.cs"
Write-Host " ❌ No ActivitySource registration" -ForegroundColor Red
}
# Check 3: HTTP endpoints have activity source spans
$controllerFiles = Get-ChildItem -Path $projectPath -Filter "*Controller.cs" -Recurse
$endpointFiles = Get-ChildItem -Path $projectPath -Filter "*Endpoints.cs" -Recurse
$allEndpoints = @($controllerFiles) + @($endpointFiles)
if ($allEndpoints.Count -eq 0) {
Write-Host " ⚠️ No controllers/endpoints found (may be library project)" -ForegroundColor Yellow
continue
}
foreach ($endpointFile in $allEndpoints) {
$endpointContent = Get-Content $endpointFile.FullName -Raw
# Check for HTTP methods
if ($endpointContent -match "\[HttpGet\]|\[HttpPost\]|\[HttpPut\]|\[HttpDelete\]") {
# Check if method uses ActivitySource or Activity.Current
if ($endpointContent -notmatch "ActivitySource|Activity\.Current|Activity\.Start") {
$errors += "❌ $projectName\$($endpointFile.Name): HTTP endpoint missing ActivitySource instrumentation"
Write-Host " ❌ $($endpointFile.Name): Missing instrumentation" -ForegroundColor Red
} else {
Write-Host " ✅ $($endpointFile.Name): Instrumented" -ForegroundColor Green
}
}
}
# Check 4: Database calls instrumented (EF Core / ADO.NET)
$dbFiles = Get-ChildItem -Path $projectPath -Filter "*DbContext.cs" -Recurse
$repositoryFiles = Get-ChildItem -Path $projectPath -Filter "*Repository.cs" -Recurse
$allDbFiles = @($dbFiles) + @($repositoryFiles)
foreach ($dbFile in $allDbFiles) {
$dbContent = Get-Content $dbFile.FullName -Raw
# Check for database operations
if ($dbContent -match "\.SaveChanges|\.Execute|\.Query|\.QueryAsync|\.Command") {
# Check if EF Core instrumentation is enabled
if ($csprojContent -notmatch "OpenTelemetry\.Instrumentation\.EntityFrameworkCore") {
$errors += "❌ $projectName\$($dbFile.Name): Database calls present but EF Core instrumentation missing"
Write-Host " ❌ $($dbFile.Name): Missing EF Core instrumentation" -ForegroundColor Red
} else {
Write-Host " ✅ $($dbFile.Name): EF Core instrumentation enabled" -ForegroundColor Green
}
}
}
# Check 5: Custom metrics present
$serviceFiles = Get-ChildItem -Path $projectPath -Filter "*Service.cs" -Recurse
$foundMetrics = $false
foreach ($serviceFile in $serviceFiles) {
$serviceContent = Get-Content $serviceFile.FullName -Raw
if ($serviceContent -match "Meter\.Create|CreateCounter|CreateHistogram|CreateGauge|CreateUpDownCounter") {
$foundMetrics = $true
Write-Host " ✅ Custom metrics found in $($serviceFile.Name)" -ForegroundColor Green
break
}
}
if (-not $foundMetrics -and $serviceFiles.Count -gt 0) {
$warnings += "⚠️ $projectName: No custom metrics found (business KPIs recommended)"
Write-Host " ⚠️ No custom metrics found" -ForegroundColor Yellow
}
# Check 6: Trace context propagation (HTTP client instrumentation)
$httpClientFiles = Get-ChildItem -Path $projectPath -Filter "*Client.cs" -Recurse
$foundHttpClientInstrumentation = $false
foreach ($clientFile in $httpClientFiles) {
$clientContent = Get-Content $clientFile.FullName -Raw
if ($clientContent -match "HttpClient|IHttpClientFactory") {
if ($csprojContent -match "OpenTelemetry\.Instrumentation\.Http") {
$foundHttpClientInstrumentation = $true
Write-Host " ✅ HTTP client instrumentation enabled" -ForegroundColor Green
break
}
}
}
if (-not $foundHttpClientInstrumentation -and $httpClientFiles.Count -gt 0) {
$warnings += "⚠️ $projectName: HTTP clients present but instrumentation may be missing"
Write-Host " ⚠️ HTTP client instrumentation not verified" -ForegroundColor Yellow
}
}
# Summary
Write-Host "`n" -NoNewline
Write-Host "=" * 80 -ForegroundColor Cyan
Write-Host "Validation Summary" -ForegroundColor Cyan
Write-Host "=" * 80 -ForegroundColor Cyan
if ($errors.Count -gt 0) {
Write-Host "`n❌ ERRORS ($($errors.Count)):" -ForegroundColor Red
foreach ($error in $errors) {
Write-Host " $error" -ForegroundColor Red
}
Write-Host "`n❌ OpenTelemetry validation FAILED. Fix errors before deployment." -ForegroundColor Red
exit 1
}
if ($warnings.Count -gt 0) {
Write-Host "`n⚠️ WARNINGS ($($warnings.Count)):" -ForegroundColor Yellow
foreach ($warning in $warnings) {
Write-Host " $warning" -ForegroundColor Yellow
}
}
Write-Host "`n✅ OpenTelemetry validation PASSED" -ForegroundColor Green
exit 0
Azure Pipelines Integration:
# Observability Gate: OpenTelemetry Validation
- stage: Observability_Gates
displayName: 'Observability Validation'
dependsOn: Build_Test_Publish
condition: succeeded()
jobs:
- job: ValidateObservability
displayName: 'Validate OpenTelemetry & Health Checks'
pool:
vmImage: 'windows-latest' # PowerShell script requires Windows
steps:
# Validate OpenTelemetry instrumentation
- task: PowerShell@2
inputs:
targetType: 'filePath'
filePath: '$(Build.SourcesDirectory)/scripts/validate-otel.ps1'
arguments: '-Path "$(Build.SourcesDirectory)" -ServiceName "ATP"'
displayName: 'Validate OpenTelemetry Instrumentation'
continueOnError: false # BLOCKER: Fail if instrumentation missing
# Run additional Roslyn analyzer checks
- task: DotNetCoreCLI@2
inputs:
command: 'build'
projects: '**/*.csproj'
arguments: '/p:EnforceOpenTelemetry=true /p:TreatWarningsAsErrors=true'
displayName: 'Build with OpenTelemetry Enforcement'
continueOnError: false
C# OpenTelemetry Setup Example (Required Pattern):
// Program.cs (ATP Ingestion Service)
using OpenTelemetry;
using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
using System.Diagnostics;
var builder = WebApplication.CreateBuilder(args);
// Configure OpenTelemetry Resource
var resourceBuilder = ResourceBuilder.CreateDefault()
.AddAttributes(new Dictionary<string, object>
{
["service.name"] = "atp-ingestion",
["service.namespace"] = "ConnectSoft.ATP",
["deployment.environment"] = builder.Environment.EnvironmentName
});
// Configure Tracing
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.SetResourceBuilder(resourceBuilder)
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.EnrichWithHttpRequest = (activity, request) =>
{
activity.SetTag("http.user_agent", request.Headers.User-Agent.ToString());
activity.SetTag("http.request_id", request.Headers["X-Request-Id"].ToString());
};
})
.AddHttpClientInstrumentation(options =>
{
options.RecordException = true;
options.EnrichWithHttpRequestMessage = (activity, request) =>
{
activity.SetTag("http.client.name", request.RequestUri?.Host);
};
})
.AddEntityFrameworkCoreInstrumentation(options =>
{
options.SetDbStatementForText = true;
options.EnrichWithIDbCommand = (activity, command) =>
{
activity.SetTag("db.statement.type", command.CommandType.ToString());
};
})
.AddSource("ConnectSoft.ATP.Ingestion") // Custom activity source
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri(builder.Configuration["OpenTelemetry:OtlpEndpoint"]
?? "http://otel-collector:4317");
})
)
.WithMetrics(metrics => metrics
.SetResourceBuilder(resourceBuilder)
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation() // GC, thread pool metrics
.AddMeter("ConnectSoft.ATP.Ingestion") // Custom metrics
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri(builder.Configuration["OpenTelemetry:OtlpEndpoint"]
?? "http://otel-collector:4317");
})
);
// Configure Logging
builder.Logging.AddOpenTelemetry(options =>
{
options.SetResourceBuilder(resourceBuilder)
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri(builder.Configuration["OpenTelemetry:OtlpEndpoint"]
?? "http://otel-collector:4317");
});
});
// Custom ActivitySource for business operations
var activitySource = new ActivitySource("ConnectSoft.ATP.Ingestion");
// Custom Metrics Meter
var meter = new Meter("ConnectSoft.ATP.Ingestion", "1.0.0");
var auditRecordsIngested = meter.CreateCounter<long>(
"atp.audit_records_ingested_total",
"records",
"Total number of audit records ingested"
);
var auditRecordsIngestedLatency = meter.CreateHistogram<double>(
"atp.audit_records_ingested_duration_ms",
"milliseconds",
"Latency of audit record ingestion"
);
var app = builder.Build();
// Health check endpoint (validated separately)
app.MapHealthChecks("/health");
app.Run();
Custom ActivitySource Usage Example:
// Controllers/AuditEventsController.cs
using System.Diagnostics;
[ApiController]
[Route("api/[controller]")]
public class AuditEventsController : ControllerBase
{
private static readonly ActivitySource ActivitySource = new("ConnectSoft.ATP.Ingestion");
private readonly IAuditEventService _auditEventService;
private readonly Meter _meter;
private readonly Counter<long> _recordsIngested;
public AuditEventsController(IAuditEventService auditEventService, Meter meter)
{
_auditEventService = auditEventService;
_meter = meter;
_recordsIngested = _meter.CreateCounter<long>("atp.audit_records_ingested_total");
}
[HttpPost]
public async Task<IActionResult> CreateAuditEvent([FromBody] CreateAuditEventRequest request)
{
// Create activity for this operation
using var activity = ActivitySource.StartActivity("IngestAuditEvent");
activity?.SetTag("audit.event.action", request.Action);
activity?.SetTag("audit.event.tenant_id", request.TenantId);
activity?.SetTag("audit.event.user_id", request.UserId);
try
{
var stopwatch = Stopwatch.StartNew();
var result = await _auditEventService.IngestAsync(request);
stopwatch.Stop();
// Record custom metric
_recordsIngested.Add(1, new KeyValuePair<string, object>("action", request.Action), new KeyValuePair<string, object>("tenant_id", request.TenantId));
// Record latency metric
activity?.SetTag("audit.ingestion.duration_ms", stopwatch.ElapsedMilliseconds);
activity?.SetStatus(ActivityStatusCode.Ok);
return CreatedAtAction(nameof(GetAuditEvent), new { id = result.Id }, result);
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.RecordException(ex);
throw;
}
}
}
Health Check Validation¶
Purpose: Ensure all ATP services expose liveness and readiness health check endpoints required for Kubernetes deployments and Azure App Service health monitoring.
Health Check Requirements:
| Endpoint | Purpose | Status Codes | Blocker |
|---|---|---|---|
/health/live |
Liveness probe (Kubernetes) | 200 (healthy), 503 (unhealthy) | ✅ Yes |
/health/ready |
Readiness probe (Kubernetes) | 200 (ready), 503 (not ready) | ✅ Yes |
/health/startup |
Startup probe (Kubernetes) | 200 (started), 503 (starting) | ⚠️ Warning |
/health |
Aggregated health (Azure App Service) | 200 (healthy), 503 (unhealthy) | ✅ Yes |
Health Check Dependencies (Readiness Probe):
| Dependency | Check Type | Timeout | Failure Impact |
|---|---|---|---|
| Database | SQL connectivity + query | 5s | Service not ready (503) |
| Message Bus | Connection + queue availability | 5s | Service not ready (503) |
| Cache (Redis) | Connection + ping | 3s | Service degraded (200 with warning) |
| Blob Storage | Container existence | 5s | Service not ready (503) |
| Key Vault | Secret retrieval | 5s | Service not ready (503) |
| External APIs | HTTP health endpoint | 10s | Service degraded (200 with warning) |
Health Check Validation Script (PowerShell):
# scripts/validate-health-checks.ps1
param(
[Parameter(Mandatory=$true)]
[string]$ServiceUrl,
[Parameter(Mandatory=$false)]
[int]$TimeoutSeconds = 30
)
Write-Host "🏥 Validating health check endpoints for: $ServiceUrl" -ForegroundColor Cyan
$errors = @()
$warnings = @()
# Test 1: Liveness Probe (/health/live)
Write-Host "`n1. Testing liveness probe (/health/live)..." -ForegroundColor Yellow
try {
$response = Invoke-WebRequest -Uri "$ServiceUrl/health/live" `
-Method Get `
-TimeoutSec $TimeoutSeconds `
-UseBasicParsing `
-ErrorAction Stop
if ($response.StatusCode -eq 200) {
Write-Host " ✅ Liveness probe returns 200 OK" -ForegroundColor Green
} else {
$errors += "❌ Liveness probe returned status code $($response.StatusCode) (expected 200)"
Write-Host " ❌ Unexpected status code: $($response.StatusCode)" -ForegroundColor Red
}
}
catch {
$errors += "❌ Liveness probe failed: $_"
Write-Host " ❌ Liveness probe failed: $_" -ForegroundColor Red
}
# Test 2: Readiness Probe (/health/ready)
Write-Host "`n2. Testing readiness probe (/health/ready)..." -ForegroundColor Yellow
try {
$response = Invoke-WebRequest -Uri "$ServiceUrl/health/ready" `
-Method Get `
-TimeoutSec $TimeoutSeconds `
-UseBasicParsing `
-ErrorAction Stop
if ($response.StatusCode -eq 200) {
Write-Host " ✅ Readiness probe returns 200 OK" -ForegroundColor Green
# Parse health check response (JSON)
$healthData = $response.Content | ConvertFrom-Json
# Validate required dependencies
$requiredDependencies = @("database", "messagebus")
foreach ($dep in $requiredDependencies) {
if ($healthData.checks -and ($healthData.checks | Where-Object { $_.name -eq $dep })) {
$depCheck = $healthData.checks | Where-Object { $_.name -eq $dep } | Select-Object -First 1
if ($depCheck.status -eq "Healthy") {
Write-Host " ✅ $dep dependency is healthy" -ForegroundColor Green
} else {
$errors += "❌ Required dependency '$dep' is not healthy: $($depCheck.status)"
Write-Host " ❌ $dep dependency unhealthy: $($depCheck.status)" -ForegroundColor Red
}
} else {
$warnings += "⚠️ Required dependency '$dep' not found in health check response"
Write-Host " ⚠️ Dependency '$dep' not checked" -ForegroundColor Yellow
}
}
# Validate optional dependencies
$optionalDependencies = @("redis", "blobstorage", "keyvault")
foreach ($dep in $optionalDependencies) {
if ($healthData.checks -and ($healthData.checks | Where-Object { $_.name -eq $dep })) {
$depCheck = $healthData.checks | Where-Object { $_.name -eq $dep } | Select-Object -First 1
if ($depCheck.status -eq "Healthy") {
Write-Host " ✅ $dep dependency is healthy" -ForegroundColor Green
} else {
$warnings += "⚠️ Optional dependency '$dep' is not healthy: $($depCheck.status)"
Write-Host " ⚠️ $dep dependency unhealthy: $($depCheck.status)" -ForegroundColor Yellow
}
}
}
} else {
$errors += "❌ Readiness probe returned status code $($response.StatusCode) (expected 200)"
Write-Host " ❌ Unexpected status code: $($response.StatusCode)" -ForegroundColor Red
}
}
catch {
$errors += "❌ Readiness probe failed: $_"
Write-Host " ❌ Readiness probe failed: $_" -ForegroundColor Red
}
# Test 3: Startup Probe (/health/startup) - Optional
Write-Host "`n3. Testing startup probe (/health/startup)..." -ForegroundColor Yellow
try {
$response = Invoke-WebRequest -Uri "$ServiceUrl/health/startup" `
-Method Get `
-TimeoutSec $TimeoutSeconds `
-UseBasicParsing `
-ErrorAction Stop
if ($response.StatusCode -eq 200) {
Write-Host " ✅ Startup probe returns 200 OK" -ForegroundColor Green
} else {
$warnings += "⚠️ Startup probe returned status code $($response.StatusCode) (optional endpoint)"
Write-Host " ⚠️ Unexpected status code: $($response.StatusCode)" -ForegroundColor Yellow
}
}
catch {
$warnings += "⚠️ Startup probe not available (optional endpoint)"
Write-Host " ⚠️ Startup probe not available (optional)" -ForegroundColor Yellow
}
# Test 4: Aggregated Health (/health)
Write-Host "`n4. Testing aggregated health endpoint (/health)..." -ForegroundColor Yellow
try {
$response = Invoke-WebRequest -Uri "$ServiceUrl/health" `
-Method Get `
-TimeoutSec $TimeoutSeconds `
-UseBasicParsing `
-ErrorAction Stop
if ($response.StatusCode -eq 200) {
Write-Host " ✅ Aggregated health endpoint returns 200 OK" -ForegroundColor Green
} else {
$errors += "❌ Aggregated health endpoint returned status code $($response.StatusCode) (expected 200)"
Write-Host " ❌ Unexpected status code: $($response.StatusCode)" -ForegroundColor Red
}
}
catch {
$errors += "❌ Aggregated health endpoint failed: $_"
Write-Host " ❌ Aggregated health endpoint failed: $_" -ForegroundColor Red
}
# Test 5: Response Time (health checks should be fast)
Write-Host "`n5. Testing health check response times..." -ForegroundColor Yellow
$endpoints = @("/health/live", "/health/ready", "/health")
foreach ($endpoint in $endpoints) {
try {
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()
$response = Invoke-WebRequest -Uri "$ServiceUrl$endpoint" `
-Method Get `
-TimeoutSec 5 `
-UseBasicParsing `
-ErrorAction Stop
$stopwatch.Stop()
$elapsedMs = $stopwatch.ElapsedMilliseconds
if ($elapsedMs -lt 1000) {
Write-Host " ✅ $endpoint: ${elapsedMs}ms (fast)" -ForegroundColor Green
} elseif ($elapsedMs -lt 5000) {
Write-Host " ⚠️ $endpoint: ${elapsedMs}ms (acceptable)" -ForegroundColor Yellow
$warnings += "⚠️ $endpoint response time is ${elapsedMs}ms (should be <1s)"
} else {
$errors += "❌ $endpoint response time is ${elapsedMs}ms (too slow, should be <1s)"
Write-Host " ❌ $endpoint: ${elapsedMs}ms (too slow)" -ForegroundColor Red
}
}
catch {
# Already reported in previous tests
}
}
# Summary
Write-Host "`n" -NoNewline
Write-Host "=" * 80 -ForegroundColor Cyan
Write-Host "Health Check Validation Summary" -ForegroundColor Cyan
Write-Host "=" * 80 -ForegroundColor Cyan
if ($errors.Count -gt 0) {
Write-Host "`n❌ ERRORS ($($errors.Count)):" -ForegroundColor Red
foreach ($error in $errors) {
Write-Host " $error" -ForegroundColor Red
}
Write-Host "`n❌ Health check validation FAILED. Fix errors before deployment." -ForegroundColor Red
exit 1
}
if ($warnings.Count -gt 0) {
Write-Host "`n⚠️ WARNINGS ($($warnings.Count)):" -ForegroundColor Yellow
foreach ($warning in $warnings) {
Write-Host " $warning" -ForegroundColor Yellow
}
}
Write-Host "`n✅ Health check validation PASSED" -ForegroundColor Green
exit 0
Azure Pipelines Integration:
# Observability Gate: Health Check Validation
- job: ValidateHealthChecks
displayName: 'Validate Health Check Endpoints'
dependsOn: Deploy_Dev # Deploy to dev environment first
condition: succeeded()
steps:
# Wait for deployment to be ready
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
$maxAttempts = 30
$attempt = 0
$serviceUrl = "$(AppServiceUrl)"
Write-Host "Waiting for service to be ready..."
while ($attempt -lt $maxAttempts) {
try {
$response = Invoke-WebRequest -Uri "$serviceUrl/health/ready" `
-Method Get `
-TimeoutSec 5 `
-UseBasicParsing `
-ErrorAction Stop
if ($response.StatusCode -eq 200) {
Write-Host "✅ Service is ready!"
exit 0
}
}
catch {
Write-Host "Attempt $($attempt + 1)/$maxAttempts: Service not ready yet..."
}
$attempt++
Start-Sleep -Seconds 10
}
Write-Error "Service did not become ready within $($maxAttempts * 10) seconds"
exit 1
displayName: 'Wait for Service Readiness'
timeoutInMinutes: 10
# Validate health check endpoints
- task: PowerShell@2
inputs:
targetType: 'filePath'
filePath: '$(Build.SourcesDirectory)/scripts/validate-health-checks.ps1'
arguments: '-ServiceUrl "$(AppServiceUrl)" -TimeoutSeconds 30'
displayName: 'Validate Health Check Endpoints'
continueOnError: false # BLOCKER: Fail if health checks invalid
C# Health Check Implementation (ASP.NET Core):
// Program.cs
using Microsoft.AspNetCore.Diagnostics.HealthChecks;
using Microsoft.Extensions.Diagnostics.HealthChecks;
using System.Net;
var builder = WebApplication.CreateBuilder(args);
// Add health checks
builder.Services.AddHealthChecks()
// Database health check (required)
.AddSqlServer(
connectionString: builder.Configuration.GetConnectionString("DefaultConnection"),
healthQuery: "SELECT 1",
name: "database",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "db", "sql" },
timeout: TimeSpan.FromSeconds(5))
// Message Bus health check (required)
.AddRabbitMQ(
rabbitConnectionString: builder.Configuration.GetConnectionString("RabbitMQ"),
name: "messagebus",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "messaging", "rabbitmq" },
timeout: TimeSpan.FromSeconds(5))
// Redis cache health check (optional)
.AddRedis(
redisConnectionString: builder.Configuration.GetConnectionString("Redis"),
name: "redis",
failureStatus: HealthStatus.Degraded,
tags: new[] { "cache", "redis" },
timeout: TimeSpan.FromSeconds(3))
// Blob Storage health check (optional)
.AddAzureBlobStorage(
connectionString: builder.Configuration.GetConnectionString("AzureStorage"),
containerName: "atp-audit-events",
name: "blobstorage",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "storage", "blob" },
timeout: TimeSpan.FromSeconds(5))
// Key Vault health check (optional)
.AddAzureKeyVault(
keyVaultClientFactory: sp =>
{
var keyVaultUrl = builder.Configuration["KeyVault:VaultUri"];
return new Azure.Security.KeyVault.Secrets.SecretClient(
new Uri(keyVaultUrl),
sp.GetRequiredService<Azure.Core.TokenCredential>());
},
name: "keyvault",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "secrets", "keyvault" },
timeout: TimeSpan.FromSeconds(5));
var app = builder.Build();
// Liveness probe (Kubernetes)
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = registration => registration.Tags.Contains("live"),
ResultStatusCodes =
{
[HealthStatus.Healthy] = StatusCodes.Status200OK
},
ResponseWriter = async (context, report) =>
{
context.Response.ContentType = "application/json";
await context.Response.WriteAsync(JsonSerializer.Serialize(new
{
status = report.Status.ToString(),
timestamp = DateTime.UtcNow
}));
}
}).WithTags("live");
// Readiness probe (Kubernetes)
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = registration => registration.Tags.Count == 0 || !registration.Tags.Contains("live"),
ResultStatusCodes =
{
[HealthStatus.Healthy] = StatusCodes.Status200OK,
[HealthStatus.Degraded] = StatusCodes.Status200OK,
[HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
},
ResponseWriter = async (context, report) =>
{
context.Response.ContentType = "application/json";
var result = new
{
status = report.Status.ToString(),
timestamp = DateTime.UtcNow,
checks = report.Entries.Select(entry => new
{
name = entry.Key,
status = entry.Value.Status.ToString(),
description = entry.Value.Description,
duration = entry.Value.Duration.TotalMilliseconds,
data = entry.Value.Data
})
};
await context.Response.WriteAsync(JsonSerializer.Serialize(result));
}
});
// Startup probe (Kubernetes, optional)
app.MapHealthChecks("/health/startup", new HealthCheckOptions
{
Predicate = _ => false, // No checks for startup probe
ResultStatusCodes =
{
[HealthStatus.Healthy] = StatusCodes.Status200OK,
[HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
},
ResponseWriter = async (context, report) =>
{
context.Response.ContentType = "application/json";
await context.Response.WriteAsync(JsonSerializer.Serialize(new
{
status = report.Status.ToString(),
timestamp = DateTime.UtcNow
}));
}
}).WithTags("startup");
// Aggregated health endpoint (Azure App Service)
app.MapHealthChecks("/health", new HealthCheckOptions
{
ResultStatusCodes =
{
[HealthStatus.Healthy] = StatusCodes.Status200OK,
[HealthStatus.Degraded] = StatusCodes.Status200OK,
[HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
},
ResponseWriter = async (context, report) =>
{
context.Response.ContentType = "application/json";
var result = new
{
status = report.Status.ToString(),
timestamp = DateTime.UtcNow,
checks = report.Entries.Select(entry => new
{
name = entry.Key,
status = entry.Value.Status.ToString(),
description = entry.Value.Description,
duration = entry.Value.Duration.TotalMilliseconds
})
};
await context.Response.WriteAsync(JsonSerializer.Serialize(result));
}
});
app.Run();
Summary¶
- Observability Gates: 5-10 minute execution; block production if instrumentation or health checks missing
- OpenTelemetry Validation: PowerShell script checks 6 requirements (HTTP instrumentation, DB instrumentation, custom metrics, trace propagation, ActivitySource naming, HTTP client instrumentation)
- OpenTelemetry Setup: Complete C# Program.cs example with tracing, metrics, logging, custom ActivitySource, custom Meter
- Custom ActivitySource Usage: Controller example showing activity creation, tagging, exception recording, custom metrics
- Health Check Requirements: 4 endpoints (liveness, readiness, startup, aggregated) with status codes and blocker status
- Health Check Dependencies: 6 dependency types (database, message bus, cache, blob storage, Key Vault, external APIs) with timeout and failure impact
- Health Check Validation Script: PowerShell script testing all endpoints, dependency checks, response times (<1s requirement)
- Health Check Implementation: Complete ASP.NET Core Program.cs with SQL Server, RabbitMQ, Redis, Blob Storage, Key Vault health checks
- Azure Pipelines Integration: YAML for OpenTelemetry validation and health check validation (with service readiness wait)
Contract & API Gates (Deep Dive)¶
Contract and API gates validate that ATP services maintain backward compatibility in API contracts (REST, WebSocket) and message schemas (events, commands, queries). These gates execute in CI stage and block production deployment if breaking changes are detected without proper versioning.
Philosophy: Backward compatibility is a promise—API consumers and event subscribers must not be broken by service updates. Breaking changes require explicit API versioning (e.g., /v2/audit-records) and deprecation notices (minimum 6 months before removal).
Contract Gate Workflow¶
graph TD
A[Observability Gates Passed] --> B[Extract OpenAPI Spec]
B --> C[Compare with Baseline]
C --> D{Breaking Changes?}
D -->|Yes| E[Breaking Change Detected ❌]
D -->|No| F{Version Incremented?}
F -->|No| G[Non-Breaking Change ✅]
F -->|Yes| H[Validate Version Format]
H --> I{Version Valid?}
I -->|No| J[Invalid Version Format ❌]
I -->|Yes| K[Breaking Change Allowed with Version ✅]
E --> L[Block Production Deployment]
J --> L
G --> M[Validate Event Schemas]
K --> M
M --> N{Schema Compatibility?}
N -->|No| O[Incompatible Schema ❌]
N -->|Yes| P{Schema Version Incremented?}
P -->|Yes| Q[Schema Version Valid?]
P -->|No| R[Compatible Schema ✅]
Q -->|No| S[Invalid Schema Version ❌]
Q -->|Yes| R
O --> L
S --> L
R --> T[Contract Gates Passed ✅]
T --> U[Ready for Production]
style E fill:#ff6b6b
style J fill:#ff6b6b
style O fill:#ff6b6b
style S fill:#ff6b6b
style G fill:#90EE90
style R fill:#90EE90
style T fill:#90EE90
Typical Contract Gate Duration: 2-5 minutes (OpenAPI diff + schema validation)
OpenAPI Breaking Change Detection¶
Purpose: Ensure REST API contracts maintain backward compatibility or explicitly version breaking changes (e.g., /v1/audit-events → /v2/audit-events).
Baseline Strategy:
| Baseline Source | Usage | Update Trigger |
|---|---|---|
| Last Release | Production baseline from last tagged release | On each release (Git tag) |
| Main Branch | Latest merged PR baseline | Continuous validation against main |
| Explicit Baseline | Manually pinned OpenAPI spec | On major architectural changes |
Breaking Change Detection Rules:
| Change Type | Breaking? | Action | Example |
|---|---|---|---|
| Removed Endpoint | ✅ Yes | ❌ Block; require /v2/ endpoint |
DELETE /api/audit-events/{id} removed |
| Removed Parameter | ✅ Yes | ❌ Block; make parameter optional or version | GET /api/audit-events?tenantId= removed |
| Changed Parameter Type | ✅ Yes | ❌ Block; require version | pageSize: string → pageSize: number |
| Changed Required Status | ✅ Yes | ❌ Block; require version | tenantId optional → required |
| Removed Response Property | ✅ Yes | ❌ Block; require version | Response {id, name} → {id} |
| Changed Status Code | ✅ Yes | ❌ Block; require version | 200 OK → 201 Created |
| Added Required Parameter | ✅ Yes | ❌ Block; require version | New required query parameter |
| Added Endpoint | ❌ No | ✅ Allow | New POST /api/audit-events |
| Added Optional Parameter | ❌ No | ✅ Allow | New optional query parameter |
| Added Response Property | ❌ No | ✅ Allow | Response {id} → {id, name} |
| Removed Optional Parameter | ❌ No | ✅ Allow | Optional parameter removed |
OpenAPI Spec Extraction (Swashbuckle/NSwag):
// Program.cs - Configure OpenAPI generation
using Microsoft.OpenApi.Models;
using Swashbuckle.AspNetCore.SwaggerGen;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen(options =>
{
options.SwaggerDoc("v1", new OpenApiInfo
{
Title = "ATP Ingestion API",
Version = "v1",
Description = "API for ingesting audit events into the ConnectSoft Audit Trail Platform",
Contact = new OpenApiContact
{
Name = "ATP Platform Team",
Email = "platform@connectsoft.example"
}
});
// Enable schema validation
options.CustomSchemaIds(type => type.FullName);
// Include XML comments
var xmlFile = $"{Assembly.GetExecutingAssembly().GetName().Name}.xml";
var xmlPath = Path.Combine(AppContext.BaseDirectory, xmlFile);
if (File.Exists(xmlPath))
{
options.IncludeXmlComments(xmlPath);
}
// Generate deterministic OpenAPI spec (no random values)
options.SchemaFilter<DeterministicSchemaFilter>();
// Enforce API versioning
options.DocInclusionPredicate((docName, apiDesc) =>
{
if (!apiDesc.RelativePath.StartsWith($"api/{docName}/", StringComparison.OrdinalIgnoreCase))
return false;
return true;
});
});
var app = builder.Build();
// Swagger UI for development
if (app.Environment.IsDevelopment())
{
app.UseSwagger();
app.UseSwaggerUI(options =>
{
options.SwaggerEndpoint("/swagger/v1/swagger.json", "ATP Ingestion API v1");
});
}
app.Run();
OpenAPI Breaking Change Detection Script (PowerShell):
# scripts/validate-openapi-breaking-changes.ps1
param(
[Parameter(Mandatory=$true)]
[string]$BaselineSpecPath,
[Parameter(Mandatory=$true)]
[string]$CurrentSpecPath,
[Parameter(Mandatory=$false)]
[switch]$FailOnBreakingChanges = $true,
[Parameter(Mandatory=$false)]
[string]$ApiVersion = "v1"
)
Write-Host "📋 Validating OpenAPI contract compatibility..." -ForegroundColor Cyan
$errors = @()
$warnings = @()
# Load OpenAPI specs
try {
$baseline = Get-Content $BaselineSpecPath -Raw | ConvertFrom-Json
$current = Get-Content $CurrentSpecPath -Raw | ConvertFrom-Json
}
catch {
$errors += "❌ Failed to parse OpenAPI specs: $_"
Write-Host " ❌ Error: $_" -ForegroundColor Red
if ($FailOnBreakingChanges) { exit 1 } else { exit 0 }
}
# Check 1: Removed endpoints
Write-Host "`n1. Checking for removed endpoints..." -ForegroundColor Yellow
$baselinePaths = $baseline.paths.PSObject.Properties.Name
$currentPaths = $current.paths.PSObject.Properties.Name
foreach ($path in $baselinePaths) {
if ($path -notin $currentPaths) {
$errors += "❌ Endpoint removed: $path (breaking change)"
Write-Host " ❌ Removed: $path" -ForegroundColor Red
}
else {
# Check removed HTTP methods
$baselineMethods = $baseline.paths.$path.PSObject.Properties.Name
$currentMethods = $current.paths.$path.PSObject.Properties.Name
foreach ($method in $baselineMethods) {
if ($method -notin $currentMethods) {
$errors += "❌ HTTP method removed: $method $path (breaking change)"
Write-Host " ❌ Removed method: $method $path" -ForegroundColor Red
}
}
}
}
# Check 2: Changed parameters (removed, type changed, required status changed)
Write-Host "`n2. Checking for parameter changes..." -ForegroundColor Yellow
foreach ($path in $baselinePaths) {
if ($path -notin $currentPaths) { continue }
$baselineMethods = $baseline.paths.$path.PSObject.Properties.Name
foreach ($method in $baselineMethods) {
if ($method -eq "parameters") { continue } # Skip path-level parameters for now
if ($baseline.paths.$path.$method.parameters) {
$baselineParams = $baseline.paths.$path.$method.parameters
if ($current.paths.$path.$method.parameters) {
$currentParams = $current.paths.$path.$method.parameters
foreach ($baselineParam in $baselineParams) {
$paramName = $baselineParam.name
$currentParam = $currentParams | Where-Object { $_.name -eq $paramName }
if (-not $currentParam) {
# Parameter removed
if ($baselineParam.required -eq $true) {
$errors += "❌ Required parameter removed: $paramName in $method $path (breaking change)"
Write-Host " ❌ Removed required param: $paramName in $method $path" -ForegroundColor Red
} else {
$warnings += "⚠️ Optional parameter removed: $paramName in $method $path"
Write-Host " ⚠️ Removed optional param: $paramName in $method $path" -ForegroundColor Yellow
}
} else {
# Check parameter type change
if ($baselineParam.schema.type -ne $currentParam.schema.type) {
$errors += "❌ Parameter type changed: $paramName in $method $path ($($baselineParam.schema.type) → $($currentParam.schema.type))"
Write-Host " ❌ Type changed: $paramName ($($baselineParam.schema.type) → $($currentParam.schema.type))" -ForegroundColor Red
}
# Check required status change (optional → required)
if ($baselineParam.required -eq $false -and $currentParam.required -eq $true) {
$errors += "❌ Parameter became required: $paramName in $method $path (breaking change)"
Write-Host " ❌ Param became required: $paramName" -ForegroundColor Red
}
}
}
} else {
# All parameters removed
foreach ($baselineParam in $baselineParams) {
if ($baselineParam.required -eq $true) {
$errors += "❌ Required parameter removed: $paramName in $method $path (breaking change)"
}
}
}
}
}
}
# Check 3: Removed response properties
Write-Host "`n3. Checking for removed response properties..." -ForegroundColor Yellow
foreach ($path in $baselinePaths) {
if ($path -notin $currentPaths) { continue }
$baselineMethods = $baseline.paths.$path.PSObject.Properties.Name | Where-Object { $_ -ne "parameters" }
foreach ($method in $baselineMethods) {
$baselineResponses = $baseline.paths.$path.$method.responses
$currentResponses = $current.paths.$path.$method.responses
if ($baselineResponses -and $currentResponses) {
foreach ($statusCode in $baselineResponses.PSObject.Properties.Name) {
if ($statusCode -in $currentResponses.PSObject.Properties.Name) {
# Compare response schemas
$baselineSchema = $baselineResponses.$statusCode.content.'application/json'.schema
$currentSchema = $currentResponses.$statusCode.content.'application/json'.schema
if ($baselineSchema.properties -and $currentSchema.properties) {
$baselineProps = $baselineSchema.properties.PSObject.Properties.Name
$currentProps = $currentSchema.properties.PSObject.Properties.Name
foreach ($prop in $baselineProps) {
if ($prop -notin $currentProps) {
$errors += "❌ Response property removed: $prop in $method $path $statusCode (breaking change)"
Write-Host " ❌ Removed property: $prop" -ForegroundColor Red
}
}
}
} else {
# Status code removed (if it was a success code, this might be breaking)
if ([int]$statusCode -ge 200 -and [int]$statusCode -lt 300) {
$warnings += "⚠️ Success status code removed: $statusCode in $method $path"
Write-Host " ⚠️ Status code removed: $statusCode" -ForegroundColor Yellow
}
}
}
}
}
}
# Check 4: API versioning validation (if breaking changes exist and version not incremented)
Write-Host "`n4. Validating API versioning..." -ForegroundColor Yellow
if ($errors.Count -gt 0) {
# Check if path includes version (e.g., /v2/audit-events)
$versionPattern = "/(v\d+)/"
$hasVersionedPath = $false
foreach ($path in $currentPaths) {
if ($path -match $versionPattern) {
$matchedVersion = $matches[1]
if ($matchedVersion -ne $ApiVersion) {
$hasVersionedPath = $true
Write-Host " ✅ Breaking changes are in versioned endpoint: $path" -ForegroundColor Green
break
}
}
}
if (-not $hasVersionedPath) {
$errors += "❌ Breaking changes detected but API version not incremented. Use /v2/ endpoint for breaking changes."
Write-Host " ❌ Breaking changes require API versioning (e.g., /v2/audit-events)" -ForegroundColor Red
}
}
# Summary
Write-Host "`n" -NoNewline
Write-Host "=" * 80 -ForegroundColor Cyan
Write-Host "OpenAPI Contract Validation Summary" -ForegroundColor Cyan
Write-Host "=" * 80 -ForegroundColor Cyan
if ($errors.Count -gt 0) {
Write-Host "`n❌ BREAKING CHANGES ($($errors.Count)):" -ForegroundColor Red
foreach ($error in $errors) {
Write-Host " $error" -ForegroundColor Red
}
if ($FailOnBreakingChanges) {
Write-Host "`n❌ OpenAPI contract validation FAILED. Fix breaking changes or increment API version." -ForegroundColor Red
exit 1
}
}
if ($warnings.Count -gt 0) {
Write-Host "`n⚠️ WARNINGS ($($warnings.Count)):" -ForegroundColor Yellow
foreach ($warning in $warnings) {
Write-Host " $warning" -ForegroundColor Yellow
}
}
if ($errors.Count -eq 0) {
Write-Host "`n✅ OpenAPI contract validation PASSED (no breaking changes)" -ForegroundColor Green
}
exit 0
Azure Pipelines Integration:
# Contract Gate: OpenAPI Breaking Change Detection
- stage: Contract_Gates
displayName: 'API Contract Validation'
dependsOn: Build_Test_Publish
condition: succeeded()
jobs:
- job: ValidateOpenApiContract
displayName: 'Validate OpenAPI Contract Compatibility'
pool:
vmImage: 'windows-latest'
steps:
# Extract OpenAPI spec from build
- task: DotNetCoreCLI@2
inputs:
command: 'run'
projects: '**/ConnectSoft.ATP.*.csproj'
arguments: '--urls "http://localhost:5000" --launch-profile "Swagger"'
displayName: 'Generate OpenAPI Spec'
continueOnError: false
# Download baseline OpenAPI spec (from last release)
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
# Get latest release tag
$latestTag = git describe --tags --abbrev=0 --match "v*.*.*" 2>$null
if ($latestTag) {
Write-Host "Using baseline from release: $latestTag"
# Download baseline spec from artifacts or Git
git checkout $latestTag -- swagger.json 2>$null
if (Test-Path "swagger.json") {
New-Item -ItemType Directory -Force -Path "$(Pipeline.Workspace)/baseline" | Out-Null
Move-Item swagger.json "$(Pipeline.Workspace)/baseline/openapi.json"
}
} else {
Write-Host "No release tags found; using main branch baseline"
# Use main branch spec as baseline
git checkout origin/main -- swagger.json 2>$null
if (Test-Path "swagger.json") {
New-Item -ItemType Directory -Force -Path "$(Pipeline.Workspace)/baseline" | Out-Null
Move-Item swagger.json "$(Pipeline.Workspace)/baseline/openapi.json"
}
}
displayName: 'Download Baseline OpenAPI Spec'
# Extract current OpenAPI spec
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
# Wait for Swagger UI to be available
$maxAttempts = 30
$attempt = 0
while ($attempt -lt $maxAttempts) {
try {
$response = Invoke-WebRequest -Uri "http://localhost:5000/swagger/v1/swagger.json" -UseBasicParsing -TimeoutSec 5
$response.Content | Out-File "$(Build.SourcesDirectory)/swagger.json" -Encoding UTF8
Write-Host "<|place▁holder▁no▁339|> OpenAPI spec extracted"
exit 0
}
catch {
livu-Host "Attempt $($attempt + 1)/$maxAttempts: Waiting for Swagger..."
}
$attempt++
Start-Sleep -Seconds 2
}
Write-Error "Failed to extract OpenAPI spec"
exit 1
displayName: 'Extract Current OpenAPI Spec'
# Validate breaking changes
- task: PowerShell@2
inputs:
targetType: 'filePath'
filePath: '$(Build.SourcesDirectory)/scripts/validate-openapi-breaking-changes.ps1'
arguments: >
-BaselineSpecPath "$(Pipeline.Workspace)/baseline/openapi.json"
-CurrentSpecPath "$(Build.SourcesDirectory)/swagger.json"
-FailOnBreakingChanges
-ApiVersion "v1"
displayName: 'Validate OpenAPI Breaking Changes'
continueOnError: false # BLOCKER: Fail on breaking changes without versioning
# Publish OpenAPI specs as artifacts
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.SourcesDirectory)/swagger.json'
ArtifactName: 'openapi-spec-$(Build.BuildNumber)'
displayName: 'Publish OpenAPI Spec'
condition: always()
Message Schema Compatibility¶
Purpose: Ensure event/command/query schemas (JSON Schema, Avro, Protobuf) maintain backward compatibility for event-driven architectures.
Schema Compatibility Rules:
| Change Type | Compatible? | Action | Example |
|---|---|---|---|
| Added Optional Field | ✅ Yes | ✅ Allow | {id, name} → {id, name, email?} |
| Removed Required Field | ❌ No | ❌ Block | {id, name} → {id} (name was required) |
| Removed Optional Field | ⚠️ Deprecated | ⚠️ Warning | {id, name?, email?} → {id, email?} |
| Changed Field Type | ❌ No | ❌ Block | age: string → age: number |
| Removed Enum Value | ❌ No | ❌ Block | status: ["active", "inactive"] → status: ["active"] |
| Added Enum Value | ✅ Yes | ✅ Allow | status: ["active"] → status: ["active", "pending"] |
| Changed Field Required | ❌ No | ❌ Block | email?: string → email: string |
| Schema Version Not Incremented | ❌ No | ❌ Block | Breaking change without version bump |
Event Schema Example (JSON Schema):
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://connectsoft.example/schemas/audit-event/v1.0.0",
"title": "AuditEvent",
"description": "Audit event schema for ATP Ingestion service",
"type": "object",
"required": ["eventId", "tenantId", "action", "timestamp", "version"],
"properties": {
"eventId": {
"type": "string",
"format": "uuid",
"description": "Unique event identifier"
},
"tenantId": {
"type": "string",
"format": "uuid",
"description": "Tenant identifier"
},
"action": {
"type": "string",
"enum": ["UserLogin", "UserLogout", "DataAccess", "DataModification", "DataDeletion"],
"description": "Action performed"
},
"timestamp": {
"type": "string",
"format": "date-time",
"description": "Event timestamp (ISO 8601)"
},
"version": {
"type": "string",
"pattern": "^\\d+\\.\\d+\\.\\d+$",
"description": "Schema version (semantic versioning)"
},
"userId": {
"type": "string",
"format": "uuid",
"description": "User identifier (optional)"
},
"metadata": {
"type": "object",
"additionalProperties": true,
"description": "Additional event metadata (optional)"
}
},
"additionalProperties": false
}
Message Schema Compatibility Validation Script (PowerShell):
# scripts/validate-schema-compatibility.ps1
param(
[Parameter(Mandatory=$true)]
[string]$BaselineDir,
[Parameter(Mandatory=$true)]
[string]$CurrentDir,
[Parameter(Mandatory=$false)]
[switch]$FailOnBreakingChanges = $true
)
Write-Host "📋 Validating event schema compatibility..." -ForegroundColor Cyan
$errors = @()
$warnings = @()
# Find all JSON Schema files
$baselineSchemas = Get-ChildItem -Path $BaselineDir -Filter "*.schema.json" -Recurse
$currentSchemas = Get-ChildItem -Path $CurrentDir -Filter "*.schema.json" -Recurse
foreach ($baselineSchemaFile in $baselineSchemas) {
$schemaName = $baselineSchemaFile.BaseName -replace '\.schema$', ''
Write-Host "`n🔍 Validating schema: $schemaName" -ForegroundColor Yellow
try {
$baselineSchema = Get-Content $baselineSchemaFile.FullName -Raw | ConvertFrom-Json
# Find corresponding current schema
$currentSchemaFile = $currentSchemas | Where-Object { $_.BaseName -eq $baselineSchemaFile.BaseName }
if (-not $currentSchemaFile) {
$errors += "❌ Schema removed: $schemaName (breaking change)"
Write-Host " ❌ Schema removed: $schemaName" -ForegroundColor Red
continue
}
$currentSchema = Get-Content $currentSchemaFile.FullName -Raw | ConvertFrom-Json
# Check schema version increment (if breaking changes exist)
$baselineVersion = $baselineSchema.'$id' -match 'v(\d+\.\d+\.\d+)' ? $matches[1] : $null
$currentVersion = $currentSchema.'$id' -match 'v(\d+\.\d+\.\d+)' ? $matches[1] : $null
# Check 1: Required fields
$baselineRequired = if ($baselineSchema.required) { $baselineSchema.required } else { @() }
$currentRequired = if ($currentSchema.required) { $currentSchema.required } else { @() }
$removedRequired = $baselineRequired | Where-Object { $_ -notin $currentRequired }
foreach ($field in $removedRequired) {
$errors += "❌ Required field removed: $field in $schemaName (breaking change)"
Write-Host " ❌ Required field removed: $field" -ForegroundColor Red
}
# Check 2: Field type changes
$baselineProps = $baselineSchema.properties.PSObject.Properties
$currentProps = $currentSchema.properties.PSObject.Properties
foreach ($baselineProp in $baselineProps) {
$fieldName = $baselineProp.Name
$currentProp = $currentProps | Where-Object { $_.Name -eq $fieldName }
if (-not $currentProp) {
# Field removed (already checked if required)
if ($fieldName -notin $baselineRequired) {
$warnings += "⚠️ Optional field removed: $fieldName in $schemaName (deprecated)"
Write-Host " ⚠️ Optional field removed: $fieldName" -ForegroundColor Yellow
}
} else {
# Check type change
$baselineType = $baselineProp.Value.type
$currentType = $currentProp.Value.type
if ($baselineType -ne $currentType) {
$errors += "❌ Field type changed: $fieldName in $schemaName ($baselineType → $currentType, breaking change)"
Write-Host " ❌ Type changed: $fieldName ($baselineType → $currentType)" -ForegroundColor Red
}
# Check enum value removal
if ($baselineProp.Value.enum -and $currentProp.Value.enum) {
$baselineEnum = $baselineProp.Value.enum
$currentEnum = $currentProp.Value.enum
$removedEnumValues = $baselineEnum | Where-Object { $_ -notin $currentEnum }
foreach ($enumValue in $removedEnumValues) {
$errors += "❌ Enum value removed: $fieldName = $enumValue in $schemaName (breaking change)"
Write-Host " ❌ Enum value removed: $fieldName = $enumValue" -ForegroundColor Red
}
}
}
}
# Check 3: Required status change (optional → required)
foreach ($baselineProp in $baselineProps) {
$fieldName = $baselineProp.Name
$currentProp = $currentProps | Where-Object { $_.Name -eq $fieldName }
if ($currentProp) {
$wasOptional = $fieldName -notin $baselineRequired
$isRequired = $fieldName -in $currentRequired
if ($wasOptional -and $isRequired) {
$errors += "❌ Field became required: $fieldName in $schemaName (breaking change)"
Write-Host " ❌ Field became required: $fieldName" -ForegroundColor Red
}
}
}
# Check 4: Schema version increment (if breaking changes)
if ($errors.Count -gt 0 -and $baselineVersion -and $currentVersion) {
$baselineMajor = [int]($baselineVersion -split '\.')[0]
$currentMajor = [int]($currentVersion -split '\.')[0]
if ($currentMajor -le $baselineMajor) {
$errors += "❌ Schema version not incremented: $schemaName ($baselineVersion → $currentVersion, breaking changes require major version bump)"
Write-Host " ❌ Version not incremented: $baselineVersion → $currentVersion" -ForegroundColor Red
} else {
Write-Host " ✅ Version incremented: $baselineVersion → $currentVersion" -ForegroundColor Green
}
}
}
catch {
$errors += "❌ Failed to validate schema $schemaName: $_"
Write-Host " ❌ Error: $_" -ForegroundColor Red
}
}
# Summary
Write-Host "`n" -NoNewline
Write-Host "=" * 80 -ForegroundColor Cyan
Write-Host "Message Schema Compatibility Validation Summary" -ForegroundColor Cyan
Write-Host "=" * 80 -ForegroundColor Cyan
if ($errors.Count -gt 0) {
Write-Host "`n❌ BREAKING CHANGES ($($errors.Count)):" -ForegroundColor Red
foreach ($error in $errors) {
Write-Host " $error" -ForegroundColor Red
}
if ($FailOnBreakingChanges) {
Write-Host "`n❌ Schema compatibility validation FAILED. Fix breaking changes or increment schema version." -ForegroundColor Red
exit 1
}
}
if ($warnings.Count -gt 0) {
Write-Host "`n⚠️ WARNINGS ($($warnings.Count)):" -ForegroundColor Yellow
foreach ($warning in $warnings) {
Write-Host " $warning" -ForegroundColor Yellow
}
}
if ($errors.Count -eq 0) {
Write-Host "`n✅ Schema compatibility validation PASSED (backward compatible)" -ForegroundColor Green
}
exit 0
Azure Pipelines Integration:
# Contract Gate: Message Schema Compatibility
- job: ValidateSchemaCompatibility
displayName: 'Validate Event Schema Compatibility'
dependsOn: ValidateOpenApiContract
condition: succeeded()
steps:
# Download baseline schemas (from last release)
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
$latestTag = git describe --tags --abbrev=0 --match "v*.*.*" 2>$null
if ($latestTag) {
git checkout $latestTag -- schemas/ 2>$null
if (Test-Path "schemas") {
New-Item -ItemType Directory -Force -Path "$(Pipeline.Workspace)/baseline/schemas" | Out-Null
Copy-Item schemas/* "$(Pipeline.Workspace)/baseline/schemas/" -Recurse
}
}
displayName: 'Download Baseline Schemas'
# Validate schema compatibility
- task: PowerShell@2
inputs:
targetType: 'filePath'
filePath: '$(Build.SourcesDirectory)/scripts/validate-schema-compatibility.ps1'
arguments: >
-BaselineDir "$(Pipeline.Workspace)/baseline/schemas"
-CurrentDir "$(Build.SourcesDirectory)/schemas"
-FailOnBreakingChanges
displayName: 'Validate Event Schema Compatibility'
continueOnError: false # BLOCKER: Fail on breaking schema changes
Summary¶
- Contract & API Gates: 2-5 minute execution; block production if breaking changes detected without versioning
- OpenAPI Breaking Change Detection: PowerShell script validates 8 breaking change types (removed endpoints/parameters, type changes, required status changes, removed response properties, status code changes)
- OpenAPI Baseline Strategy: 3 strategies (last release, main branch, explicit baseline) with update triggers
- OpenAPI Spec Extraction: C# Program.cs example with Swashbuckle configuration, deterministic schema generation, API versioning enforcement
- Message Schema Compatibility: JSON Schema validation with 8 compatibility rules (additive changes allowed, removals blocked, enum value removal blocked, version increment required)
- Event Schema Example: Complete JSON Schema with required/optional fields, enum constraints, version field, metadata support
- Schema Compatibility Validation Script: PowerShell script validating required field removal, type changes, enum value removal, required status changes, version increment enforcement
- Azure Pipelines Integration: YAML for OpenAPI spec extraction, baseline download, breaking change detection, schema compatibility validation
Approval Gates (Manual Governance)¶
Approval gates enforce human oversight and organizational governance for deployments to staging and production environments. These gates ensure that deployments are reviewed by appropriate stakeholders (engineers, architects, SREs, CAB) and that risk assessments are completed before changes reach production.
Philosophy: Automation plus human judgment—while automated quality gates catch technical issues, manual approval gates ensure business readiness, risk awareness, and deployment coordination (change windows, on-call coverage, rollback preparedness).
Approval Gate Workflow¶
graph TD
A[Contract Gates Passed] --> B[Deploy to Staging Request]
B --> C{Pre-Production Gate}
C -->|Not Ready| D[Approval Denied ❌]
C -->|Ready| E[Lead Engineer Approval]
E --> F{1 Approver?}
F -->|No| D
F -->|Yes| G[Deploy to Staging]
G --> H[Staging Soak Period 24h]
H --> I{Production Gate}
I -->|Not Ready| J[Production Approval Denied ❌]
I -->|Ready| K[Architect Approval]
K --> L{2 Approvers?}
L -->|No| J
L -->|Yes| M[SRE Approval]
M --> N{1 SRE Approver?}
N -->|No| J
N -->|Yes| O[CAB Approval]
O --> P{CAB Approved?}
P -->|No| J
P -->|Yes| Q{Active Incidents?}
Q -->|Yes P1/P2| J
Q -->|No| R{Change Freeze?}
R -->|Yes| J
R -->|No| S[Deploy to Production]
D --> T[Remediate Issues]
J --> U[Reschedule Deployment]
S --> V[Production Monitoring]
style D fill:#ff6b6b
style J fill:#ff6b6b
style S fill:#90EE90
style V fill:#90EE90
Typical Approval Duration: Staging (1-4 hours), Production (4-24 hours depending on CAB schedule)
Pre-Production Approval (Staging)¶
Purpose: Ensure technical readiness for staging deployment through peer review and automated gate validation.
Approval Requirements:
| Requirement | Details | Timeout | Bypass Allowed |
|---|---|---|---|
| Minimum Approvers | 1 Lead Engineer | 4 hours | ❌ No (except hotfix) |
| Automated Gates | All quality gates passed | N/A | ❌ No |
| Test Results | 100% pass rate | N/A | ❌ No |
| Security Scan | Zero critical/high vulnerabilities | N/A | ⚠️ Yes (with risk acceptance) |
| Coverage Threshold | Service-specific threshold met | N/A | ❌ No |
Approval Checklist (Lead Engineer):
## Staging Deployment Approval Checklist
**Build**: `$(Build.BuildNumber)`
**Requested By**: `$(Build.RequestedFor)`
**Date**: `$(System.Date)`
### Automated Quality Gates
- [ ] All automated tests passed (unit, integration, E2E)
- [ ] Code coverage threshold met (≥70% for service)
- [ ] Security scans clean (SAST, dependency, secrets, container)
- [ ] OpenAPI contract backward compatible (or versioned)
- [ ] Event schema backward compatible (or versioned)
- [ ] OpenTelemetry instrumentation validated
- [ ] Health check endpoints validated
### Security & Compliance
- [ ] SBOM generated and reviewed (no prohibited licenses)
- [ ] Dependency vulnerabilities assessed (critical/high resolved or accepted)
- [ ] Secrets detection passed (no leaked credentials)
- [ ] Container image hardened (Trivy scan clean)
### Observability & Monitoring
- [ ] Structured logging validated (PII redacted)
- [ ] Distributed tracing configured (ActivitySource registered)
- [ ] Custom metrics emitted (business KPIs)
- [ ] Health check dependencies validated (database, message bus, cache)
### Documentation & Communication
- [ ] Architecture Decision Record (ADR) updated (if applicable)
- [ ] CHANGELOG updated with user-facing changes
- [ ] Rollback plan documented in deployment notes
### Approval Decision
- [ ] **APPROVED** — Deploy to staging
- [ ] **DENIED** — Block deployment (reason: _________________)
**Approver**: ________________
**Date/Time**: ________________
**Comments**: ________________
Azure DevOps Environment Configuration (Staging):
# Azure DevOps Environment: ATP-Staging
name: ATP-Staging
resourceType: none # No direct Kubernetes/VM resources
# Approval configuration
approvals:
- type: requiredApprovers
requiredApprovers:
- group: ATP-Lead-Engineers
minRequiredApprovers: 1
instructions: |
Review the staging deployment approval checklist before approving.
Key validation points:
- All automated quality gates passed
- Security scans clean or risks accepted
- SBOM reviewed for license compliance
- Observability validated (logs, traces, metrics)
- Rollback plan documented
timeout: 4h
notifyOnlyInitiator: false # Notify all group members
# Pre-deployment gates
gates:
- type: azureFunction
function: ValidateTestResults
url: https://atp-approval-gates.azurewebsites.net/api/ValidateTestResults
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"testPassRate": 100}'
timeout: 5m
- type: azureFunction
function: ValidateSecurityScan
url: https://atp-approval-gates.azurewebsites.net/api/ValidateSecurityScan
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"criticalVulnerabilities": 0, "highVulnerabilities": 0}'
timeout: 5m
- type: azureFunction
function: ValidateCoverageThreshold
url: https://atp-approval-gates.azurewebsites.net/api/ValidateCoverageThreshold
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"coverageMet": true}'
timeout: 5m
# Deployment lock (prevent concurrent deployments)
lock:
enabled: true
lockType: sequential
Automated Gate Validation Function (C#):
// ValidateTestResults.cs — Azure Function for pre-deployment gate
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using System.Net.Http;
using System.Threading.Tasks;
public static class ValidateTestResults
{
[FunctionName("ValidateTestResults")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req,
ILogger log)
{
log.LogInformation("Validating test results for deployment approval");
// Parse request body (Azure DevOps sends build info)
var requestBody = await new StreamReader(req.Body).ReadToEndAsync();
var gateRequest = JsonSerializer.Deserialize<GateRequest>(requestBody);
// Query Azure DevOps Test Results API
var testResults = await GetTestResultsAsync(gateRequest.BuildId);
// Validate test pass rate (must be 100%)
var totalTests = testResults.TotalCount;
var passedTests = testResults.PassedTests;
var testPassRate = (double)passedTests / totalTests * 100;
log.LogInformation($"Test pass rate: {testPassRate:F2}% ({passedTests}/{totalTests})");
if (testPassRate < 100)
{
return new BadRequestObjectResult(new
{
status = "Failed",
message = $"Test pass rate is {testPassRate:F2}% (expected 100%). {totalTests - passedTests} tests failed.",
testPassRate,
totalTests,
passedTests,
failedTests = totalTests - passedTests
});
}
return new OkObjectResult(new
{
status = "Success",
message = "All tests passed",
testPassRate = 100,
totalTests,
passedTests
});
}
private static async Task<TestResults> GetTestResultsAsync(string buildId)
{
var azureDevOpsUrl = Environment.GetEnvironmentVariable("AZURE_DEVOPS_URL");
var pat = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PAT");
var client = new HttpClient();
client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue(
"Basic",
Convert.ToBase64String(Encoding.ASCII.GetBytes($":{pat}"))
);
var response = await client.GetAsync(
$"{azureDevOpsUrl}/_apis/test/ResultSummaryByBuild?buildId={buildId}&api-version=7.0"
);
response.EnsureSuccessStatusCode();
var content = await response.Content.ReadAsStringAsync();
return JsonSerializer.Deserialize<TestResults>(content);
}
}
public class GateRequest
{
public string BuildId { get; set; }
public string EnvironmentName { get; set; }
public string StageName { get; set; }
}
public class TestResults
{
public int TotalCount { get; set; }
public int PassedTests { get; set; }
}
Production Approval (Multi-Level Governance)¶
Purpose: Ensure business readiness, risk mitigation, and organizational alignment for production deployment through multi-level approval and Change Advisory Board (CAB) review.
Approval Requirements:
| Requirement | Details | Timeout | Bypass Allowed |
|---|---|---|---|
| Minimum Approvers (Architects) | 2 ATP Architects | 24 hours | ❌ No (except emergency hotfix) |
| Minimum Approvers (SRE) | 1 SRE Team Member | 24 hours | ❌ No (except emergency hotfix) |
| CAB Approval | Change Advisory Board review | 24-72 hours | ❌ No (except emergency hotfix) |
| Automated Gates | Load tests, chaos tests, incident check | N/A | ❌ No |
| Staging Soak Period | Minimum 24 hours in staging | N/A | ⚠️ Yes (with architect override) |
| Active Incidents | No P1/P2 incidents open | N/A | ❌ No |
| Change Freeze | Outside blackout periods | N/A | ⚠️ Yes (with executive approval) |
Approval Checklist (Architects + SRE):
## Production Deployment Approval Checklist
**Build**: `$(Build.BuildNumber)`
**Requested By**: `$(Build.RequestedFor)`
**Deployment Window**: `[Start Date/Time] - [End Date/Time]`
**On-Call Engineer**: `[Name]`
### Staging Validation
- [ ] Staging deployment successful (minimum 24 hours soak period)
- [ ] No errors/exceptions in staging logs (Log Analytics reviewed)
- [ ] Performance metrics within thresholds (p95 latency <500ms, error rate <0.1%)
- [ ] Synthetic monitors passing (health checks, smoke tests)
- [ ] Load tests passed (1000 RPS sustained, p95 <500ms)
- [ ] Chaos tests passed (pod restart, network latency, storage failure)
### Change Management
- [ ] CAB approval obtained (change ticket: CR-XXXXXXX)
- [ ] Deployment window scheduled (change calendar updated)
- [ ] Change freeze respected (no deployment during blackout periods)
- [ ] Rollback plan tested in staging (slot swap or canary rollback)
- [ ] On-call engineer notified and available during deployment
- [ ] Communication plan prepared (status page, tenant email, Slack announcement)
### Risk Assessment
- [ ] No active P1/P2 incidents (Azure DevOps Boards checked)
- [ ] No concurrent deployments scheduled (deployment calendar reviewed)
- [ ] Breaking changes versioned and documented (API versioning, deprecation notices)
- [ ] Database migrations backward compatible (no downtime required)
- [ ] Feature flags configured for gradual rollout (10% → 25% → 50% → 100%)
### Compliance & Audit
- [ ] SBOM published to artifact feed (compliance evidence collected)
- [ ] Security scan reports archived (immutable storage, 7-year retention)
- [ ] Deployment approval trail captured (Azure DevOps audit log)
- [ ] ADR updated for architectural changes
### Post-Deployment Monitoring
- [ ] Monitoring dashboard prepared (Application Insights, Grafana)
- [ ] Alert rules validated (error rate, latency, availability)
- [ ] Runbook updated for incident response
### Approval Decision
- [ ] **APPROVED** — Deploy to production
- [ ] **APPROVED WITH CONDITIONS** — Deploy with specific constraints (e.g., canary only, feature flag off by default)
- [ ] **DENIED** — Block deployment (reason: _________________)
**Architect Approver 1**: ________________
**Architect Approver 2**: ________________
**SRE Approver**: ________________
**CAB Decision**: ________________
**Date/Time**: ________________
**Comments**: ________________
Azure DevOps Environment Configuration (Production):
# Azure DevOps Environment: ATP-Production
name: ATP-Production
resourceType: kubernetes # Optional: link to AKS cluster
# Multi-level approval configuration
approvals:
# Level 1: Architect approval (minimum 2)
- type: requiredApprovers
requiredApprovers:
- group: ATP-Architects
minRequiredApprovers: 2
instructions: |
Review the production deployment approval checklist.
Key validation points:
- Staging soak period completed (minimum 24 hours)
- Load tests and chaos tests passed
- No active P1/P2 incidents
- CAB approval obtained
- Deployment window scheduled
- Rollback plan tested in staging
timeout: 24h
notifyOnlyInitiator: false
# Level 2: SRE approval (minimum 1)
- type: requiredApprovers
requiredApprovers:
- group: SRE-Team
minRequiredApprovers: 1
instructions: |
SRE team review for production deployment.
Validate:
- On-call coverage during deployment window
- Monitoring and alerting configured
- Runbook updated for incident response
- Rollback procedure validated
timeout: 24h
notifyOnlyInitiator: false
# Pre-deployment gates
gates:
# Gate 1: Validate load test results
- type: azureFunction
function: ValidateLoadTests
url: https://atp-approval-gates.azurewebsites.net/api/ValidateLoadTests
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"p95Latency": "<500", "errorRate": "<0.001", "throughput": ">=1000"}'
timeout: 10m
retryInterval: 2m
# Gate 2: Validate chaos test results
- type: azureFunction
function: ValidateChaosTests
url: https://atp-approval-gates.azurewebsites.net/api/ValidateChaosTests
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"podRestartPassed": true, "storageFailurePassed": true}'
timeout: 10m
retryInterval: 2m
# Gate 3: Check active incidents (block if P1/P2 open)
- type: azureFunction
function: CheckActiveIncidents
url: https://atp-approval-gates.azurewebsites.net/api/CheckActiveIncidents
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"activeP1Incidents": 0, "activeP2Incidents": 0}'
timeout: 5m
retryInterval: 1m
# Gate 4: Validate staging soak period (minimum 24 hours)
- type: azureFunction
function: ValidateStagingSoakPeriod
url: https://atp-approval-gates.azurewebsites.net/api/ValidateStagingSoakPeriod
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"soakPeriodHours": ">=24", "stagingHealthy": true}'
timeout: 5m
# Gate 5: Check change freeze (block if in blackout period)
- type: azureFunction
function: CheckChangeFreeze
url: https://atp-approval-gates.azurewebsites.net/api/CheckChangeFreeze
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"inChangeFreeze": false}'
timeout: 5m
# Deployment lock (prevent concurrent deployments)
lock:
enabled: true
lockType: exclusive # Only one deployment at a time
Automated Gate: Check Active Incidents (C#):
// CheckActiveIncidents.cs — Block production deployment if P1/P2 incidents open
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using Microsoft.TeamFoundation.WorkItemTracking.WebApi;
using Microsoft.TeamFoundation.WorkItemTracking.WebApi.Models;
using Microsoft.VisualStudio.Services.Common;
using Microsoft.VisualStudio.Services.WebApi;
using System;
using System.Linq;
using System.Threading.Tasks;
public static class CheckActiveIncidents
{
[FunctionName("CheckActiveIncidents")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req,
ILogger log)
{
log.LogInformation("Checking for active P1/P2 incidents");
var azureDevOpsUrl = Environment.GetEnvironmentVariable("AZURE_DEVOPS_URL");
var pat = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PAT");
var projectName = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PROJECT");
var credentials = new VssBasicCredential(string.Empty, pat);
var connection = new VssConnection(new Uri(azureDevOpsUrl), credentials);
var witClient = connection.GetClient<WorkItemTrackingHttpClient>();
// Query for active P1/P2 incidents
var wiql = new Wiql
{
Query = @"
SELECT [System.Id], [System.Title], [System.State], [Microsoft.VSTS.Common.Priority]
FROM WorkItems
WHERE [System.WorkItemType] = 'Incident'
AND [System.State] = 'Active'
AND [Microsoft.VSTS.Common.Priority] <= 2
ORDER BY [Microsoft.VSTS.Common.Priority]"
};
var result = await witClient.QueryByWiqlAsync(wiql, projectName);
var activeIncidents = result.WorkItems.ToList();
var activeP1 = activeIncidents.Count(wi =>
int.Parse(wi.Fields["Microsoft.VSTS.Common.Priority"].ToString()) == 1);
var activeP2 = activeIncidents.Count(wi =>
int.Parse(wi.Fields["Microsoft.VSTS.Common.Priority"].ToString()) == 2);
log.LogInformation($"Active P1 incidents: {activeP1}, Active P2 incidents: {activeP2}");
if (activeP1 > 0 || activeP2 > 0)
{
return new BadRequestObjectResult(new
{
status = "Failed",
message = $"Active high-priority incidents detected: {activeP1} P1, {activeP2} P2. Resolve incidents before production deployment.",
activeP1Incidents = activeP1,
activeP2Incidents = activeP2,
incidents = activeIncidents.Select(wi => new
{
id = wi.Id,
title = wi.Fields["System.Title"].ToString(),
priority = wi.Fields["Microsoft.VSTS.Common.Priority"].ToString()
})
});
}
return new OkObjectResult(new
{
status = "Success",
message = "No active P1/P2 incidents",
activeP1Incidents = 0,
activeP2Incidents = 0
});
}
}
Change Advisory Board (CAB) Process¶
Purpose: Provide cross-functional review of production changes to assess business impact, technical risk, and deployment coordination.
CAB Composition:
| Role | Responsibilities | Required for Approval |
|---|---|---|
| Lead Architect | Technical feasibility, architectural alignment | ✅ Yes |
| SRE Lead | Operational readiness, on-call coverage | ✅ Yes |
| Product Owner | Business impact, user communication | ✅ Yes |
| Security Officer | Security risk assessment, compliance | ⚠️ For security changes only |
| Customer Success | Tenant impact, downtime communication | ⚠️ For breaking changes only |
CAB Meeting Cadence:
- Weekly: Tuesday 10:00 AM (routine changes)
- Emergency: On-demand via Slack
/cab-emergency(hotfixes, P1 incidents) - Async Review: Low-risk changes via Azure DevOps approval workflow (no meeting required)
CAB Approval Workflow:
graph TD
A[Create Change Request] --> B{Change Type?}
B -->|Standard| C[Weekly CAB Meeting]
B -->|Emergency| D[Emergency CAB]
B -->|Low-Risk| E[Async Approval]
C --> F[CAB Review]
D --> G[Emergency Review within 2h]
E --> H[Async Review 24h]
F --> I{Approved?}
G --> I
H --> I
I -->|No| J[Change Denied/Deferred]
I -->|Yes| K[CAB Approval Granted]
J --> L[Remediate Issues]
K --> M[Schedule Deployment]
M --> N[Production Deployment]
style J fill:#ff6b6b
style K fill:#90EE90
style N fill:#90EE90
Change Request Template (Azure DevOps Work Item):
# Work Item Type: Change Request
fields:
- field: System.Title
value: "[ATP] Production Deployment — $(Build.BuildNumber)"
- field: System.Description
value: |
## Change Summary
**Service**: ATP Ingestion Service
**Build**: $(Build.BuildNumber)
**Deployment Window**: [Start] - [End]
**Estimated Duration**: 30 minutes
## Change Details
### Features Added
- Feature 1: Description
- Feature 2: Description
### Bug Fixes
- Bug 1: Description
- Bug 2: Description
### Breaking Changes
- None (or list breaking changes with mitigation)
## Risk Assessment
**Risk Level**: Low / Medium / High
**Impact**: Tenant-facing / Internal / Infrastructure
**Rollback Strategy**: Blue-green slot swap (30 seconds)
## Testing Evidence
- [ ] All automated quality gates passed
- [ ] Load tests passed (p95 <500ms, error rate <0.1%)
- [ ] Chaos tests passed (pod restart, storage failure)
- [ ] Staging soak period completed (24+ hours)
## Communication Plan
- [ ] Status page updated (if user-facing changes)
- [ ] Tenant email sent (if breaking changes)
- [ ] Slack #atp-deployments announcement
## Approval Checklist
- [ ] Lead Architect approved
- [ ] SRE Lead approved
- [ ] Product Owner approved
- field: Microsoft.VSTS.Common.Priority
value: 2 # P2 by default; P1 for emergency hotfixes
- field: Custom.ChangeType
value: Standard # Standard / Emergency / Low-Risk
- field: Custom.RiskLevel
value: Medium # Low / Medium / High
- field: Custom.DeploymentWindow
value: "[2025-11-01 02:00 UTC] - [2025-11-01 04:00 UTC]"
- field: Custom.RollbackPlan
value: "Blue-green slot swap via Azure CLI: az webapp deployment slot swap"
CAB Approval Automation (Azure Function):
// GetCABApprovalStatus.cs — Check CAB approval status for change request
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using System.Threading.Tasks;
public static class GetCABApprovalStatus
{
[FunctionName("GetCABApprovalStatus")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req,
ILogger log)
{
log.LogInformation("Checking CAB approval status");
var requestBody = await new StreamReader(req.Body).ReadToEndAsync();
var gateRequest = JsonSerializer.Deserialize<GateRequest>(requestBody);
// Query Azure DevOps for linked Change Request work item
var changeRequest = await GetChangeRequestAsync(gateRequest.BuildId);
if (changeRequest == null)
{
return new BadRequestObjectResult(new
{
status = "Failed",
message = "No Change Request work item linked to this build. Create a Change Request and link it to the build."
});
}
// Check approval fields
var cabApproved = changeRequest.Fields.ContainsKey("Custom.CABApproved") &&
changeRequest.Fields["Custom.CABApproved"].ToString() == "Yes";
var leadArchitectApproved = changeRequest.Fields.ContainsKey("Custom.LeadArchitectApproved") &&
changeRequest.Fields["Custom.LeadArchitectApproved"].ToString() == "Yes";
var sreLeadApproved = changeRequest.Fields.ContainsKey("Custom.SRELeadApproved") &&
changeRequest.Fields["Custom.SRELeadApproved"].ToString() == "Yes";
var productOwnerApproved = changeRequest.Fields.ContainsKey("Custom.ProductOwnerApproved") &&
changeRequest.Fields["Custom.ProductOwnerApproved"].ToString() == "Yes";
log.LogInformation($"CAB: {cabApproved}, Architect: {leadArchitectApproved}, SRE: {sreLeadApproved}, PO: {productOwnerApproved}");
if (!cabApproved || !leadArchitectApproved || !sreLeadApproved || !productOwnerApproved)
{
var missingApprovals = new List<string>();
if (!cabApproved) missingApprovals.Add("CAB");
if (!leadArchitectApproved) missingApprovals.Add("Lead Architect");
if (!sreLeadApproved) missingApprovals.Add("SRE Lead");
if (!productOwnerApproved) missingApprovals.Add("Product Owner");
return new BadRequestObjectResult(new
{
status = "Failed",
message = $"CAB approval incomplete. Missing approvals: {string.Join(", ", missingApprovals)}",
changeRequestId = changeRequest.Id,
cabApproved,
leadArchitectApproved,
sreLeadApproved,
productOwnerApproved
});
}
return new OkObjectResult(new
{
status = "Success",
message = "CAB approval granted",
changeRequestId = changeRequest.Id,
cabApproved = true,
leadArchitectApproved = true,
sreLeadApproved = true,
productOwnerApproved = true
});
}
private static async Task<WorkItem> GetChangeRequestAsync(string buildId)
{
// Query Azure DevOps API for Change Request work item linked to build
// Implementation omitted for brevity
throw new NotImplementedException();
}
}
Emergency Approval Procedures (Hotfixes)¶
Purpose: Enable rapid deployment of critical fixes (P1 incidents, security vulnerabilities) with streamlined approval while maintaining governance.
Emergency Approval Requirements:
| Requirement | Standard Deployment | Emergency Hotfix |
|---|---|---|
| Minimum Approvers (Architects) | 2 | 1 |
| Minimum Approvers (SRE) | 1 | 1 |
| CAB Approval | Yes (24-72h) | Async (2h post-deployment) |
| Staging Soak Period | 24+ hours | 1-2 hours (expedited) |
| Load/Chaos Tests | Required | Optional (skip if time-critical) |
| Change Freeze | Respected | Bypassed with executive approval |
Emergency Approval Workflow:
# Emergency hotfix approval (Azure DevOps Environment)
name: ATP-Production-Hotfix
approvals:
- type: requiredApprovers
requiredApprovers:
- group: ATP-Architects
- group: SRE-Team
minRequiredApprovers: 2 # 1 Architect + 1 SRE
instructions: |
**EMERGENCY HOTFIX APPROVAL**
This is an expedited approval for a critical production issue.
Validate:
- P1 incident ticket linked (incident severity justified)
- Hotfix tested in staging (minimum 1 hour)
- Rollback plan documented and tested
- On-call engineer notified and available
- CAB async review scheduled (within 2 hours post-deployment)
timeout: 2h # Expedited timeout
notifyOnlyInitiator: false
gates:
# Simplified gates for emergency hotfix
- type: azureFunction
function: ValidateEmergencyHotfix
url: https://atp-approval-gates.azurewebsites.net/api/ValidateEmergencyHotfix
apiKey: $(ApprovalGateApiKey)
successCriteria: '{"p1IncidentLinked": true, "stagingTested": true}'
timeout: 5m
Emergency Deployment Checklist:
## Emergency Hotfix Deployment Checklist
**Incident**: P1-XXXXX
**Build**: $(Build.BuildNumber)
**Severity**: Critical
**Deployment Time**: Immediate
### Emergency Justification
- [ ] P1 incident active (production down or severe degradation)
- [ ] Security vulnerability (CVSS ≥9.0) requiring immediate patching
- [ ] Data loss/corruption risk requiring immediate mitigation
### Minimal Validation
- [ ] Hotfix tested in staging (minimum 1 hour)
- [ ] Rollback plan documented and tested
- [ ] On-call engineer notified and available during deployment
- [ ] Incident commander assigned (coordinates deployment)
### Post-Deployment Requirements
- [ ] CAB async review scheduled (within 2 hours)
- [ ] Post-incident review (PIR) scheduled (within 48 hours)
- [ ] Incident status page updated (communicate fix deployed)
### Approval Decision
- [ ] **APPROVED (EMERGENCY)** — Deploy immediately
**Architect Approver**: ________________
**SRE Approver**: ________________
**Incident Commander**: ________________
**Date/Time**: ________________
Approval Tracking & Audit Trail¶
Purpose: Maintain comprehensive audit trail of all deployment approvals for compliance (SOC 2, GDPR, HIPAA).
Approval Audit Data Captured:
| Data Point | Captured | Retention | Immutable |
|---|---|---|---|
| Approver Identity | User principal name, email | 7 years | ✅ Yes |
| Approval Timestamp | UTC timestamp | 7 years | ✅ Yes |
| Approval Decision | Approved/Denied/Deferred | 7 years | ✅ Yes |
| Approval Comments | Free-text justification | 7 years | ✅ Yes |
| Build Artifacts | Build number, commit SHA, SBOM | 7 years | ✅ Yes |
| Automated Gate Results | Test results, security scans, coverage | 7 years | ✅ Yes |
Approval Audit Export (Azure Function):
// ExportApprovalAuditTrail.cs — Export approval history for compliance
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.IO;
using System.Linq;
using System.Text.Json;
using System.Threading.Tasks;
public static class ExportApprovalAuditTrail
{
[FunctionName("ExportApprovalAuditTrail")]
public static async Task Run(
[TimerTrigger("0 0 2 * * 0")] TimerInfo timer, // Weekly: Sunday 2:00 AM
ILogger log)
{
log.LogInformation("Exporting approval audit trail for compliance");
var azureDevOpsUrl = Environment.GetEnvironmentVariable("AZURE_DEVOPS_URL");
var pat = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PAT");
var projectName = Environment.GetEnvironmentVariable("AZURE_DEVOPS_PROJECT");
// Query Azure DevOps Audit Log API for approval events
var auditEvents = await GetApprovalAuditEventsAsync(azureDevOpsUrl, pat, projectName);
// Transform to compliance format
var complianceRecords = auditEvents.Select(e => new
{
timestamp = e.Timestamp,
approverUpn = e.Actor.Upn,
approverEmail = e.Actor.Email,
environment = e.Resource.EnvironmentName,
buildNumber = e.Resource.BuildNumber,
decision = e.Data.Decision, // Approved/Denied
comments = e.Data.Comments,
changeRequestId = e.Data.ChangeRequestId
}).ToList();
// Export to JSON (for compliance evidence)
var json = JsonSerializer.Serialize(complianceRecords, new JsonSerializerOptions
{
WriteIndented = true
});
// Upload to immutable Azure Blob Storage (WORM, 7-year retention)
var storageConnectionString = Environment.GetEnvironmentVariable("COMPLIANCE_STORAGE_CONNECTION");
var blobServiceClient = new BlobServiceClient(storageConnectionString);
var containerClient = blobServiceClient.GetBlobContainerClient("approval-audit-trail");
var blobName = $"approval-audit-trail-{DateTime.UtcNow:yyyy-MM-dd}.json";
var blobClient = containerClient.GetBlobClient(blobName);
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(json)))
{
await blobClient.UploadAsync(stream, overwrite: false);
}
// Set legal hold (immutability)
await blobClient.SetLegalHoldAsync(hasLegalHold: true);
log.LogInformation($"Approval audit trail exported: {blobName} ({complianceRecords.Count} records)");
}
private static async Task<List<AuditEvent>> GetApprovalAuditEventsAsync(
string azureDevOpsUrl, string pat, string projectName)
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
}
Summary¶
- Approval Gates (Manual): Human oversight for staging (1 approver, 4h) and production (3 approvers + CAB, 24h)
- Pre-Production Approval: Lead Engineer reviews automated gate results, security scans, observability, rollback plan
- Production Approval: Multi-level (2 Architects + 1 SRE + CAB), validates staging soak period, active incidents, change freeze, deployment window
- Azure DevOps Environment Configuration: Complete YAML for approval groups, minimum approvers, timeout, automated gates (test results, security, coverage, incidents, change freeze)
- Automated Gate Functions: 5 C# Azure Functions (ValidateTestResults, CheckActiveIncidents, ValidateStagingSoakPeriod, CheckChangeFreeze, GetCABApprovalStatus)
- CAB Process: Weekly meetings, emergency on-demand, async for low-risk, change request template (risk level, deployment window, rollback plan)
- Emergency Hotfix Procedures: Expedited approval (1 Architect + 1 SRE, 2h timeout), simplified gates, P1 incident justification, CAB async review post-deployment
- Approval Audit Trail: 7-year retention in immutable storage (WORM, legal hold), weekly export to Azure Blob, compliance evidence for SOC 2/GDPR/HIPAA
Quality Gate Metrics & Dashboards¶
Quality gate metrics provide data-driven insights into pipeline health, test effectiveness, security posture, and deployment reliability. These metrics enable continuous improvement through trend analysis, anomaly detection, and proactive remediation of quality issues.
Philosophy: What gets measured gets improved—comprehensive metrics enable teams to identify quality trends, detect regressions early, and make data-driven decisions about process improvements. ATP tracks 15+ quality metrics with monthly reviews and quarterly improvement cycles.
Quality Metrics Architecture¶
graph TD
A[Pipeline Execution] --> B[Emit Metrics]
B --> C[Azure DevOps Analytics]
B --> D[Application Insights]
B --> E[Log Analytics]
C --> F[Quality Dashboard]
D --> F
E --> F
F --> G{Threshold Exceeded?}
G -->|Yes| H[Alert & Notify]
G -->|No| I[Store Historical Data]
H --> J[Slack Notification]
H --> K[Email Notification]
H --> L[PagerDuty Alert]
I --> M[Trend Analysis]
M --> N[Monthly Quality Review]
N --> O[Improvement Backlog]
O --> P[Quarterly Roadmap]
style H fill:#feca57
style J fill:#feca57
style K fill:#feca57
style L fill:#ff6b6b
style P fill:#90EE90
Key Metrics (Tracked)¶
Purpose: Monitor quality gate effectiveness across all ATP services and identify improvement opportunities.
Quality Metrics Scorecard:
| Metric | Target | Current | Trend | Blocker | Measurement Frequency |
|---|---|---|---|---|---|
| Build Success Rate | ≥98% | 97.2% | ↗️ Improving | ❌ No | Per build |
| Test Pass Rate | 100% | 99.8% | → Stable | ✅ Yes | Per build |
| Code Coverage (Avg) | ≥70% | 73.5% | ↗️ Improving | ✅ Yes | Per build |
| Branch Coverage (Avg) | ≥60% | 64.2% | ↗️ Improving | ⚠️ Warning | Per build |
| Security Scan Pass Rate | 100% | 98.5% | ↗️ Improving | ✅ Yes | Per build |
| SBOM Generation Success | 100% | 100% | → Stable | ✅ Yes | Per build |
| Container Scan Pass Rate | ≥95% | 96.8% | → Stable | ⚠️ Warning | Per build |
| Deployment Success Rate | ≥95% | 96.1% | → Stable | ❌ No | Per deployment |
| Flaky Test Rate | <2% | 1.3% | ↘️ Decreasing | ❌ No | Daily |
| Mean Time to Fix Gate | <4 hours | 3.2 hours | ↘️ Decreasing | ❌ No | Per failure |
| API Breaking Changes | 0 | 0 | → Stable | ✅ Yes | Per build |
| Schema Breaking Changes | 0 | 0 | → Stable | ✅ Yes | Per build |
| Critical Vulnerabilities | 0 | 0 | → Stable | ✅ Yes | Per build |
| High Vulnerabilities | 0 | 1 | ↗️ Regressing | ⚠️ Warning | Per build |
| Compliance Gate Pass Rate | 100% | 100% | → Stable | ✅ Yes | Per build |
Metric Trend Indicators:
- ↗️ Improving: Metric moving toward target (positive trend)
- → Stable: Metric at target or within acceptable variance (±2%)
- ↘️ Decreasing: Metric improving beyond target (overachieving)
- ⚠️ Regressing: Metric moving away from target (requires attention)
KQL Queries for Quality Metrics:
// Build Success Rate (Last 30 Days)
Build
| where Repository == "ConnectSoft.ATP.Ingestion"
| where QueueTime >= ago(30d)
| summarize
TotalBuilds = count(),
SuccessfulBuilds = countif(Result == "succeeded"),
FailedBuilds = countif(Result == "failed" or Result == "canceled")
by bin(QueueTime, 1d)
| extend SuccessRate = round((todouble(SuccessfulBuilds) / TotalBuilds) * 100, 2)
| project
Date = format_datetime(QueueTime, 'yyyy-MM-dd'),
TotalBuilds,
SuccessfulBuilds,
FailedBuilds,
SuccessRate
| order by Date desc
| render timechart with (title="Build Success Rate (30 Days)", ytitle="Success Rate %", xtitle="Date")
// Test Pass Rate (Per Service, Last 7 Days)
TestRun
| where StartedDate >= ago(7d)
| summarize
TotalTests = sum(TotalTests),
PassedTests = sum(PassedTests),
FailedTests = sum(FailedTests)
by BuildDefinitionName, bin(StartedDate, 1d)
| extend TestPassRate = round((todouble(PassedTests) / TotalTests) * 100, 2)
| project
Service = BuildDefinitionName,
Date = format_datetime(StartedDate, 'yyyy-MM-dd'),
TotalTests,
PassedTests,
FailedTests,
TestPassRate
| order by Service, Date desc
// Code Coverage Trend (Last 90 Days)
CodeCoverage
| where BuildCompletedDate >= ago(90d)
| where Repository startswith "ConnectSoft.ATP"
| summarize
AvgLineCoverage = round(avg(LineCoveragePercent), 2),
AvgBranchCoverage = round(avg(BranchCoveragePercent), 2)
by bin(BuildCompletedDate, 7d), Repository
| project
Week = format_datetime(BuildCompletedDate, 'yyyy-MM-dd'),
Service = extract(@"ConnectSoft\.ATP\.(\w+)", 1, Repository),
AvgLineCoverage,
AvgBranchCoverage
| order by Week desc, Service
| render timechart with (title="Code Coverage Trend (90 Days)", ytitle="Coverage %")
// Security Vulnerability Trend (Last 180 Days)
SecurityScan
| where ScanDate >= ago(180d)
| where Project == "ConnectSoft.ATP"
| summarize
CriticalCount = countif(Severity == "Critical"),
HighCount = countif(Severity == "High"),
MediumCount = countif(Severity == "Medium"),
LowCount = countif(Severity == "Low")
by bin(ScanDate, 7d), Service
| extend TotalVulnerabilities = CriticalCount + HighCount + MediumCount + LowCount
| project
Week = format_datetime(ScanDate, 'yyyy-MM-dd'),
Service,
CriticalCount,
HighCount,
MediumCount,
LowCount,
TotalVulnerabilities
| order by Week desc
Azure DevOps Dashboard Configuration¶
Purpose: Provide at-a-glance visibility into quality gate health across all ATP services with drill-down capabilities for root cause analysis.
Dashboard Structure:
ConnectSoft ATP — Quality Gates Dashboard
═══════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────┐
│ BUILD HEALTH │
├─────────────────────────────────────────────────────────┤
│ • Build Success Rate (30d): 97.2% ↗️ │
│ • Average Build Duration: 8.3 min → Target: <10 min │
│ • Failed Builds (7d): 3 builds │
│ • Top Failure Reasons: │
│ 1. Code coverage below threshold (2 builds) │
│ 2. Security scan failed (1 build) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ TEST RESULTS │
├─────────────────────────────────────────────────────────┤
│ • Test Pass Rate: 99.8% → Target: 100% │
│ • Flaky Tests Detected: 4 tests (1.3% of total) │
│ • Average Test Duration: 4.1 min → Target: <5 min │
│ • Coverage Trend (30d): │
│ - Ingestion: 76.2% ↗️ │
│ - Query: 81.5% → │
│ - Integrity: 86.1% ↗️ │
│ - Export: 71.8% ↗️ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ SECURITY POSTURE │
├─────────────────────────────────────────────────────────┤
│ • Critical Vulnerabilities: 0 ✅ │
│ • High Vulnerabilities: 1 ⚠️ (1 accepted risk) │
│ • Medium Vulnerabilities: 5 (backlog) │
│ • Secrets Detected: 0 ✅ │
│ • Container Scan Pass: 96.8% │
│ • License Compliance: 100% ✅ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ DEPLOYMENT FREQUENCY (DORA Metrics) │
├─────────────────────────────────────────────────────────┤
│ • Deployment Frequency: 12.3/month → Elite (>1/week) │
│ • Lead Time (Commit→Prod): 3.2 days → High (1-7 days) │
│ • MTTR (Incident→Fix): 2.1 hours → Elite (<1 hour) │
│ • Change Failure Rate: 3.9% → Elite (<5%) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ QUALITY GATE VIOLATIONS (Last 30 Days) │
├─────────────────────────────────────────────────────────┤
│ 1. Coverage gate failed: 12 builds (39%) │
│ 2. Security scan failed: 8 builds (26%) │
│ 3. Test failures: 7 builds (23%) │
│ 4. API breaking changes: 4 builds (13%) │
│ 5. SBOM generation failed: 0 builds (0%) │
│ │
│ Mean Time to Fix: 3.2 hours │
└─────────────────────────────────────────────────────────┘
Azure DevOps Dashboard Widgets (JSON configuration):
{
"name": "ATP Quality Gates Dashboard",
"description": "Quality gate metrics and trends for ConnectSoft ATP",
"widgets": [
{
"name": "Build Success Rate",
"position": { "row": 1, "column": 1 },
"size": { "rowSpan": 2, "columnSpan": 2 },
"settings": {
"query": "Build | where Repository startswith 'ConnectSoft.ATP' | where QueueTime >= ago(30d) | summarize SuccessRate = round((todouble(countif(Result == 'succeeded')) / count()) * 100, 2) by bin(QueueTime, 1d)"
},
"contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.QueryScalarWidget"
},
{
"name": "Test Coverage Trend",
"position": { "row": 1, "column": 3 },
"size": { "rowSpan": 2, "columnSpan": 3 },
"settings": {
"query": "CodeCoverage | where BuildCompletedDate >= ago(90d) | summarize AvgCoverage = avg(LineCoveragePercent) by bin(BuildCompletedDate, 7d), Repository"
},
"contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.QueryChartWidget"
},
{
"name": "Security Vulnerability Count",
"position": { "row": 3, "column": 1 },
"size": { "rowSpan": 2, "columnSpan": 2 },
"settings": {
"query": "SecurityScan | where ScanDate >= ago(30d) | summarize CriticalCount = countif(Severity == 'Critical'), HighCount = countif(Severity == 'High') by Service"
},
"contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.QueryTableWidget"
},
{
"name": "DORA Metrics",
"position": { "row": 3, "column": 3 },
"size": { "rowSpan": 2, "columnSpan": 3 },
"settings": {
"metrics": [
{ "name": "Deployment Frequency", "value": "12.3/month", "target": ">1/week", "classification": "Elite" },
{ "name": "Lead Time", "value": "3.2 days", "target": "<7 days", "classification": "High" },
{ "name": "MTTR", "value": "2.1 hours", "target": "<1 hour", "classification": "Elite" },
{ "name": "Change Failure Rate", "value": "3.9%", "target": "<5%", "classification": "Elite" }
]
},
"contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.MarkdownWidget"
},
{
"name": "Quality Gate Violations",
"position": { "row": 5, "column": 1 },
"size": { "rowSpan": 3, "columnSpan": 5 },
"settings": {
"query": "QualityGateViolation | where ViolationDate >= ago(30d) | summarize Count = count() by GateType, FailureReason | order by Count desc"
},
"contributionId": "ms.vss-dashboards-web.Microsoft.VisualStudioOnline.Dashboards.QueryChartWidget"
}
]
}
Dashboard Widget KQL Queries (Detailed):
// Widget 1: Build Success Rate (30-Day Trend)
Build
| where Repository startswith "ConnectSoft.ATP"
| where QueueTime >= ago(30d)
| summarize
TotalBuilds = count(),
SuccessfulBuilds = countif(Result == "succeeded"),
FailedBuilds = countif(Result != "succeeded")
by bin(QueueTime, 1d)
| extend SuccessRate = round((todouble(SuccessfulBuilds) / TotalBuilds) * 100, 2)
| project
Date = format_datetime(QueueTime, 'yyyy-MM-dd'),
SuccessRate,
TotalBuilds,
SuccessfulBuilds,
FailedBuilds
| order by Date desc
| render timechart with (title="Build Success Rate (30 Days)", ytitle="Success Rate %", xtitle="Date", ymin=0, ymax=100)
// Widget 2: Test Coverage by Service (Current + Trend)
CodeCoverage
| where BuildCompletedDate >= ago(90d)
| where Repository startswith "ConnectSoft.ATP"
| extend Service = extract(@"ConnectSoft\.ATP\.(\w+)", 1, Repository)
| summarize
CurrentCoverage = round(avg(LineCoveragePercent), 2),
PreviousCoverage = round(avgif(LineCoveragePercent, BuildCompletedDate < ago(30d)), 2)
by Service
| extend
Trend = case(
CurrentCoverage > PreviousCoverage + 2, "↗️ Improving",
CurrentCoverage < PreviousCoverage - 2, "⚠️ Regressing",
"→ Stable"
),
Target = 70.0,
Status = case(
CurrentCoverage >= 70, "✅ Met",
CurrentCoverage >= 60, "⚠️ Close",
"❌ Below"
)
| project
Service,
CurrentCoverage,
Target,
Trend,
Status
| order by CurrentCoverage desc
// Widget 3: Security Vulnerability Summary (Current State)
SecurityScan
| where ScanDate >= ago(7d)
| where Project == "ConnectSoft.ATP"
| summarize arg_max(ScanDate, *) by Service // Latest scan per service
| summarize
CriticalCount = sumif(VulnerabilityCount, Severity == "Critical"),
HighCount = sumif(VulnerabilityCount, Severity == "High"),
MediumCount = sumif(VulnerabilityCount, Severity == "Medium"),
LowCount = sumif(VulnerabilityCount, Severity == "Low")
by Service
| extend TotalVulnerabilities = CriticalCount + HighCount + MediumCount + LowCount
| extend
RiskLevel = case(
CriticalCount > 0, "🔴 Critical",
HighCount > 0, "🟠 High",
MediumCount > 5, "🟡 Medium",
"🟢 Low"
)
| project
Service,
RiskLevel,
CriticalCount,
HighCount,
MediumCount,
LowCount,
TotalVulnerabilities
| order by CriticalCount desc, HighCount desc
// Widget 4: Flaky Test Detection (Last 30 Days)
TestRun
| where StartedDate >= ago(30d)
| where BuildDefinitionName startswith "ConnectSoft.ATP"
| join kind=inner (
TestResult
| where CompletedDate >= ago(30d)
) on TestRunId
| summarize
TotalRuns = count(),
PassCount = countif(Outcome == "Passed"),
FailCount = countif(Outcome == "Failed")
by TestCaseName, BuildDefinitionName
| where TotalRuns >= 10 // Only tests run at least 10 times
| extend FlakyScore = round((todouble(FailCount) / TotalRuns) * 100, 2)
| where FlakyScore > 0 and FlakyScore < 100 // Exclude always-passing and always-failing tests
| project
Service = extract(@"ConnectSoft\.ATP\.(\w+)", 1, BuildDefinitionName),
TestCaseName,
TotalRuns,
PassCount,
FailCount,
FlakyScore
| order by FlakyScore desc, TotalRuns desc
| take 20
DORA Metrics (DevOps Research & Assessment)¶
Purpose: Measure software delivery performance using industry-standard DORA metrics to benchmark ATP against elite-performing teams.
DORA Metric Definitions:
| Metric | Definition | ATP Target | Industry Elite | Current Performance | Classification |
|---|---|---|---|---|---|
| Deployment Frequency | How often code is deployed to production | >1/week | On-demand (multiple/day) | 12.3/month (~3/week) | Elite ✅ |
| Lead Time for Changes | Time from commit to production deployment | <7 days | <1 day | 3.2 days | High ⚠️ |
| Mean Time to Recovery (MTTR) | Time to restore service after incident | <1 hour | <1 hour | 2.1 hours | Medium ⚠️ |
| Change Failure Rate | Percentage of deployments causing incidents | <5% | <5% | 3.9% | Elite ✅ |
DORA Metrics Calculation (KQL):
// Deployment Frequency (Deployments per month)
Deployment
| where DeploymentTime >= ago(90d)
| where Environment == "Production"
| where Project == "ConnectSoft.ATP"
| summarize DeploymentCount = count() by bin(DeploymentTime, 30d)
| extend DeploymentsPerMonth = DeploymentCount
| project
Month = format_datetime(DeploymentTime, 'yyyy-MM'),
DeploymentsPerMonth
| order by Month desc
// Lead Time for Changes (Commit → Production)
Build
| where QueueTime >= ago(90d)
| where Repository startswith "ConnectSoft.ATP"
| where Result == "succeeded"
| join kind=inner (
Deployment
| where Environment == "Production"
) on BuildNumber
| extend LeadTimeHours = datetime_diff('hour', DeploymentTime, SourceVersion.CommitTime)
| summarize
AvgLeadTimeHours = round(avg(LeadTimeHours), 2),
P50LeadTimeHours = round(percentile(LeadTimeHours, 50), 2),
P95LeadTimeHours = round(percentile(LeadTimeHours, 95), 2)
by bin(QueueTime, 30d)
| extend
AvgLeadTimeDays = round(AvgLeadTimeHours / 24, 1),
Classification = case(
AvgLeadTimeHours < 24, "Elite (<1 day)",
AvgLeadTimeHours < 168, "High (1-7 days)",
AvgLeadTimeHours < 720, "Medium (1-30 days)",
"Low (>30 days)"
)
| project
Month = format_datetime(QueueTime, 'yyyy-MM'),
AvgLeadTimeDays,
P50LeadTimeHours,
P95LeadTimeHours,
Classification
| order by Month desc
// Mean Time to Recovery (MTTR)
Incident
| where CreatedDate >= ago(90d)
| where Project == "ConnectSoft.ATP"
| where Severity in ("P1", "P2")
| extend RecoveryTimeMinutes = datetime_diff('minute', ResolvedDate, CreatedDate)
| summarize
AvgMTTRMinutes = round(avg(RecoveryTimeMinutes), 2),
P50MTTRMinutes = round(percentile(RecoveryTimeMinutes, 50), 2),
P95MTTRMinutes = round(percentile(RecoveryTimeMinutes, 95), 2),
IncidentCount = count()
by bin(CreatedDate, 30d)
| extend
AvgMTTRHours = round(AvgMTTRMinutes / 60, 1),
Classification = case(
AvgMTTRMinutes < 60, "Elite (<1 hour)",
AvgMTTRMinutes < 1440, "High (1-24 hours)",
"Medium (>24 hours)"
)
| project
Month = format_datetime(CreatedDate, 'yyyy-MM'),
IncidentCount,
AvgMTTRHours,
P50MTTRMinutes,
P95MTTRMinutes,
Classification
| order by Month desc
// Change Failure Rate (Deployments → Incidents)
Deployment
| where DeploymentTime >= ago(90d)
| where Environment == "Production"
| join kind=leftouter (
Incident
| where Severity in ("P1", "P2")
| extend DeploymentCausedIncident = true
) on DeploymentId
| summarize
TotalDeployments = count(),
FailedDeployments = countif(DeploymentCausedIncident == true)
by bin(DeploymentTime, 30d)
| extend ChangeFailureRate = round((todouble(FailedDeployments) / TotalDeployments) * 100, 2)
| extend
Classification = case(
ChangeFailureRate < 5, "Elite (<5%)",
ChangeFailureRate < 15, "High (5-15%)",
"Medium (>15%)"
)
| project
Month = format_datetime(DeploymentTime, 'yyyy-MM'),
TotalDeployments,
FailedDeployments,
ChangeFailureRate,
Classification
| order by Month desc
Alerting on Gate Failures¶
Purpose: Provide immediate feedback when quality gates fail, enabling rapid remediation and preventing quality regressions.
Alert Configuration Matrix:
| Gate Failure | Severity | Channel | Recipients | SLA | Escalation |
|---|---|---|---|---|---|
| Build Failure | Medium | Slack #atp-builds |
Team lead, build author | 4 hours | Architect (8h) |
| Test Failure (>5%) | High | Slack + Email | Team lead, QA lead | 2 hours | Architect (4h) |
| Coverage Drop (>5%) | Medium | Team lead, architect | 1 business day | Weekly review | |
| Security Critical | Critical | PagerDuty + Slack + Email | Security team, architect, SRE | 1 hour | CISO (2h) |
| Security High | High | Email + Slack | Security team, team lead | 24 hours | Security Officer (48h) |
| SBOM Generation Failed | High | Team lead, compliance officer | 4 hours | Compliance team (8h) | |
| API Breaking Change | Critical | Slack + Email | Architect, API team | 2 hours | CTO (4h) |
| Schema Breaking Change | Critical | Slack + Email | Architect, integration team | 2 hours | CTO (4h) |
| Deployment Failure | Critical | PagerDuty + Slack | SRE on-call, team lead | 15 minutes | SRE Lead (30m) |
| Health Check Failure | Critical | PagerDuty | SRE on-call | 5 minutes | SRE Lead (15m) |
Alert Routing Configuration (Azure Monitor):
{
"name": "ATP Quality Gate Alerts",
"description": "Alert rules for quality gate failures",
"actionGroups": [
{
"name": "ATP-Team-Lead",
"shortName": "ATPLead",
"emailReceivers": [
{ "name": "Team Lead", "emailAddress": "atp-lead@connectsoft.example" }
],
"smsReceivers": [],
"webhookReceivers": [
{
"name": "Slack-ATP-Builds",
"serviceUri": "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX"
}
]
},
{
"name": "ATP-Security-Team",
"shortName": "ATPSec",
"emailReceivers": [
{ "name": "Security Team", "emailAddress": "security@connectsoft.example" }
],
"azureFunctionReceivers": [
{
"name": "PagerDuty-Integration",
"functionAppResourceId": "/subscriptions/.../resourceGroups/ATP-Prod-RG/providers/Microsoft.Web/sites/atp-pagerduty-function",
"functionName": "SendToPagerDuty",
"httpTriggerUrl": "https://atp-pagerduty-function.azurewebsites.net/api/SendToPagerDuty"
}
]
},
{
"name": "ATP-SRE-On-Call",
"shortName": "ATPSRE",
"emailReceivers": [
{ "name": "SRE On-Call", "emailAddress": "sre-oncall@connectsoft.example" }
],
"azureFunctionReceivers": [
{
"name": "PagerDuty-SRE",
"functionAppResourceId": "/subscriptions/.../resourceGroups/ATP-Prod-RG/providers/Microsoft.Web/sites/atp-pagerduty-function",
"functionName": "SendToPagerDuty",
"httpTriggerUrl": "https://atp-pagerduty-function.azurewebsites.net/api/SendToPagerDuty"
}
],
"smsReceivers": [
{ "name": "SRE On-Call Mobile", "phoneNumber": "+1234567890" }
]
}
],
"alertRules": [
{
"name": "Build-Failure-Alert",
"description": "Alert when ATP build fails",
"severity": 2,
"enabled": true,
"query": "Build | where Repository startswith 'ConnectSoft.ATP' | where Result != 'succeeded'",
"frequency": "PT5M",
"timeWindow": "PT5M",
"actionGroups": ["ATP-Team-Lead"],
"throttling": "PT1H"
},
{
"name": "Security-Critical-Vulnerability-Alert",
"description": "Alert when critical vulnerability detected",
"severity": 0,
"enabled": true,
"query": "SecurityScan | where Severity == 'Critical' | where ScanDate >= ago(5m)",
"frequency": "PT5M",
"timeWindow": "PT5M",
"actionGroups": ["ATP-Security-Team"],
"throttling": "PT15M"
},
{
"name": "Coverage-Drop-Alert",
"description": "Alert when coverage drops >5% from baseline",
"severity": 2,
"enabled": true,
"query": "CodeCoverage | where LineCoveragePercent < (prev(LineCoveragePercent) - 5)",
"frequency": "PT1H",
"timeWindow": "PT1H",
"actionGroups": ["ATP-Team-Lead"],
"throttling": "PT24H"
},
{
"name": "Deployment-Failure-Alert",
"description": "Alert when production deployment fails",
"severity": 0,
"enabled": true,
"query": "Deployment | where Environment == 'Production' | where Result != 'succeeded'",
"frequency": "PT1M",
"timeWindow": "PT5M",
"actionGroups": ["ATP-SRE-On-Call"],
"throttling": "PT5M"
}
]
}
Slack Alert Integration (C# Azure Function):
// SendSlackAlert.cs — Send quality gate failure alerts to Slack
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System.Net.Http;
using System.Text;
using System.Text.Json;
using System.Threading.Tasks;
public static class SendSlackAlert
{
private static readonly HttpClient HttpClient = new HttpClient();
[FunctionName("SendSlackAlert")]
public static async Task Run(
[QueueTrigger("quality-gate-alerts")] QualityGateAlert alert,
ILogger log)
{
log.LogInformation($"Sending Slack alert for {alert.GateType} failure");
var slackWebhookUrl = Environment.GetEnvironmentVariable("SLACK_WEBHOOK_URL");
// Build Slack message
var slackMessage = new
{
text = $"⚠️ *Quality Gate Failure*: {alert.GateType}",
blocks = new[]
{
new
{
type = "header",
text = new
{
type = "plain_text",
text = $"🚨 Quality Gate Failure: {alert.GateType}"
}
},
new
{
type = "section",
fields = new[]
{
new { type = "mrkdwn", text = $"*Service:*\n{alert.Service}" },
new { type = "mrkdwn", text = $"*Build:*\n{alert.BuildNumber}" },
new { type = "mrkdwn", text = $"*Gate:*\n{alert.GateType}" },
new { type = "mrkdwn", text = $"*Severity:*\n{alert.Severity}" }
}
},
new
{
type = "section",
text = new
{
type = "mrkdwn",
text = $"*Failure Reason:*\n```{alert.FailureReason}```"
}
},
new
{
type = "section",
text = new
{
type = "mrkdwn",
text = $"*Remediation:*\n{alert.RemediationGuidance}"
}
},
new
{
type = "actions",
elements = new[]
{
new
{
type = "button",
text = new { type = "plain_text", text = "View Build" },
url = alert.BuildUrl,
style = "primary"
},
new
{
type = "button",
text = new { type = "plain_text", text = "View Logs" },
url = alert.LogsUrl
}
}
}
}
};
var json = JsonSerializer.Serialize(slackMessage);
var content = new StringContent(json, Encoding.UTF8, "application/json");
var response = await HttpClient.PostAsync(slackWebhookUrl, content);
response.EnsureSuccessStatusCode();
log.LogInformation($"Slack alert sent successfully for {alert.GateType}");
}
}
public class QualityGateAlert
{
public string Service { get; set; }
public string BuildNumber { get; set; }
public string GateType { get; set; }
public string Severity { get; set; }
public string FailureReason { get; set; }
public string RemediationGuidance { get; set; }
public string BuildUrl { get; set; }
public string LogsUrl { get; set; }
}
Continuous Improvement Framework¶
Purpose: Use quality metrics to drive quarterly improvement cycles with measurable outcomes and accountability.
Improvement Process:
graph TD
A[Monthly Metrics Review] --> B{Metrics Below Target?}
B -->|No| C[Continue Monitoring]
B -->|Yes| D[Root Cause Analysis]
D --> E[Identify Improvement Areas]
E --> F[Create Improvement Backlog]
F --> G[Prioritize by Impact]
G --> H[Quarterly Planning]
H --> I[Assign Improvement Epics]
I --> J[Implement Improvements]
J --> K[Measure Impact]
K --> L{Target Achieved?}
L -->|Yes| M[Document Success]
L -->|No| N[Iterate on Solution]
M --> A
N --> E
C --> A
style D fill:#feca57
style E fill:#feca57
style M fill:#90EE90
Monthly Quality Review Meeting:
Cadence: First Tuesday of each month, 10:00 AM
Duration: 60 minutes
Attendees: Architects, Team Leads, QA Lead, SRE Lead
Agenda:
-
Metrics Review (20 minutes)
- Build success rate, test pass rate, coverage trends
- Security scan results, vulnerability trends
- Deployment success rate, DORA metrics
-
Quality Gate Violations (15 minutes)
- Top 5 failure reasons (coverage, security, tests, API breaking changes)
- Mean time to fix (MTTF) trend
- Repeat offenders (same failures across builds)
-
Improvement Opportunities (15 minutes)
- Metrics below target (identify root causes)
- Process bottlenecks (manual approval delays, test timeouts)
- Tool enhancements (better linters, faster test execution)
-
Action Items (10 minutes)
- Assign improvement epics to teams
- Set measurable targets for next month
- Review progress on previous action items
Quality Improvement Backlog (Azure DevOps Queries):
// Quality Improvement Work Items
WorkItem
| where WorkItemType == "Epic"
| where Tags contains "QualityImprovement"
| where State in ("New", "Active")
| summarize
TotalEpics = count(),
InProgressEpics = countif(State == "Active"),
CompletedEpics = countif(State == "Closed")
by AssignedTo
| extend CompletionRate = round((todouble(CompletedEpics) / TotalEpics) * 100, 2)
| project
Owner = AssignedTo,
TotalEpics,
InProgressEpics,
CompletedEpics,
CompletionRate
| order by CompletionRate desc
Quarterly Improvement Roadmap (Example):
| Quarter | Focus Area | Initiatives | Target Metric Improvement | Owner |
|---|---|---|---|---|
| Q1 2025 | Test Coverage | Add unit tests for uncovered paths; improve integration tests | Coverage 70% → 80% | QA Lead |
| Q2 2025 | Security Posture | Upgrade vulnerable dependencies; implement secrets rotation | High vulnerabilities 5 → 0 | Security Team |
| Q3 2025 | Build Performance | Parallelize test execution; optimize Docker layer caching | Build duration 8min → 5min | Platform Team |
| Q4 2025 | Deployment Reliability | Implement automated rollback; enhance health checks | Deployment success 96% → 99% | SRE Team |
Quality Trend Analysis & Predictions¶
Purpose: Use historical data to predict future quality trends and proactively address issues before they impact production.
Trend Analysis Script (Python):
#!/usr/bin/env python3
# scripts/analyze-quality-trends.py
import pandas as pd
import numpy as np
from scipy.stats import linregress
from datetime import datetime, timedelta
import json
def analyze_build_success_trend(builds_df):
"""
Analyze build success rate trend using linear regression.
Predict success rate for next 30 days.
"""
builds_df['date'] = pd.to_datetime(builds_df['QueueTime'])
builds_df['date_numeric'] = (builds_df['date'] - builds_df['date'].min()).dt.days
# Calculate success rate per day
daily_success = builds_df.groupby('date').agg({
'Result': lambda x: (x == 'succeeded').sum() / len(x) * 100
}).reset_index()
daily_success.columns = ['date', 'success_rate']
daily_success['date_numeric'] = (daily_success['date'] - daily_success['date'].min()).dt.days
# Linear regression
slope, intercept, r_value, p_value, std_err = linregress(
daily_success['date_numeric'],
daily_success['success_rate']
)
# Predict next 30 days
last_date_numeric = daily_success['date_numeric'].max()
future_dates = range(last_date_numeric + 1, last_date_numeric + 31)
predictions = [slope * d + intercept for d in future_dates]
# Determine trend classification
if slope > 0.1:
trend = "↗️ Improving"
recommendation = "Continue current practices; success rate trending upward"
elif slope < -0.1:
trend = "⚠️ Regressing"
recommendation = "URGENT: Investigate root causes of declining build success"
else:
trend = "→ Stable"
recommendation = "Maintain current quality standards"
return {
"metric": "Build Success Rate",
"current": round(daily_success['success_rate'].iloc[-1], 2),
"trend": trend,
"slope": round(slope, 4),
"r_squared": round(r_value ** 2, 4),
"predicted_30d": round(predictions[-1], 2),
"recommendation": recommendation
}
def analyze_coverage_trend(coverage_df):
"""Analyze code coverage trend and predict future coverage."""
coverage_df['date'] = pd.to_datetime(coverage_df['BuildCompletedDate'])
coverage_df['date_numeric'] = (coverage_df['date'] - coverage_df['date'].min()).dt.days
# Linear regression on line coverage
slope, intercept, r_value, _, _ = linregress(
coverage_df['date_numeric'],
coverage_df['LineCoveragePercent']
)
# Predict next 30 days
last_date_numeric = coverage_df['date_numeric'].max()
predicted_coverage = slope * (last_date_numeric + 30) + intercept
# Determine if coverage will meet target (70%) in next 90 days
days_to_target = (70 - intercept) / slope if slope > 0 else -1
return {
"metric": "Code Coverage",
"current": round(coverage_df['LineCoveragePercent'].iloc[-1], 2),
"target": 70.0,
"trend_slope": round(slope, 4),
"predicted_30d": round(predicted_coverage, 2),
"days_to_target": int(days_to_target) if days_to_target > 0 else "N/A",
"recommendation": f"At current rate, will reach 70% target in {int(days_to_target)} days" if days_to_target > 0 else "Increase test coverage velocity"
}
def main():
# Load data from Azure DevOps Analytics API (example)
builds_df = pd.read_json("builds.json")
coverage_df = pd.read_json("coverage.json")
# Analyze trends
build_trend = analyze_build_success_trend(builds_df)
coverage_trend = analyze_coverage_trend(coverage_df)
# Generate report
report = {
"generated_at": datetime.utcnow().isoformat(),
"trends": [build_trend, coverage_trend],
"summary": {
"metrics_improving": sum(1 for t in [build_trend, coverage_trend] if "Improving" in t.get("trend", "")),
"metrics_regressing": sum(1 for t in [build_trend, coverage_trend] if "Regressing" in t.get("trend", ""))
}
}
# Output report
print(json.dumps(report, indent=2))
# Exit with error if metrics regressing
if report["summary"]["metrics_regressing"] > 0:
print("\n⚠️ WARNING: Some metrics are regressing. Review recommendations.")
exit(1)
print("\n✅ Quality trends are positive or stable.")
exit(0)
if __name__ == "__main__":
main()
Summary¶
- Quality Gate Metrics: 15 tracked metrics (build success, test pass rate, coverage, security scan, SBOM, deployment, flaky tests, MTTR, API/schema changes, vulnerabilities, compliance)
- Metrics Scorecard: Current values, targets, trends (improving/stable/regressing), blocker status, measurement frequency
- KQL Queries: 8 detailed queries (build success rate, test coverage trend, security vulnerabilities, flaky test detection, DORA metrics)
- Azure DevOps Dashboard: 5-widget configuration (build health, test coverage, security posture, DORA metrics, quality gate violations)
- DORA Metrics: 4 metrics (deployment frequency 12.3/month Elite, lead time 3.2 days High, MTTR 2.1 hours Medium, change failure rate 3.9% Elite)
- Alert Configuration: 10 alert types with severity, channels (Slack/Email/PagerDuty), recipients, SLAs, escalation paths
- Alert Routing: Azure Monitor action groups (ATP-Team-Lead, ATP-Security-Team, ATP-SRE-On-Call) with email/SMS/webhook/Azure Function receivers
- Slack Integration: C# Azure Function sending rich Slack messages with failure details, remediation guidance, action buttons
- Continuous Improvement Framework: Monthly quality review meetings, quarterly roadmap (Q1-Q4 2025), improvement backlog tracking
- Trend Analysis: Python script using linear regression to predict quality trends, identify regressions, provide recommendations
Remediation & Continuous Improvement¶
Purpose: Provide systematic approach to resolving quality gate violations, preventing recurrence, and continuously raising quality standards through data-driven threshold ratcheting.
Violation Response Workflow:
graph TD
A[Quality Gate Failure Detected] --> B[Alert Sent to Team]
B --> C[Developer/SRE Triage]
C --> D{Root Cause Identified?}
D -->|No| E[Escalate to Architect]
D -->|Yes| F{Fix Type?}
E --> D
F -->|Code Fix| G[Implement Code Changes]
F -->|Dependency Update| H[Update Dependencies]
F -->|Risk Acceptance| I[Create Risk Acceptance Record]
F -->|Threshold Adjustment| J[Propose Threshold Change]
G --> K[Re-run Pipeline]
H --> K
I --> L[Document in ADR]
J --> M[Quality Gate Retrospective]
K --> N{Gate Passed?}
N -->|No| C
N -->|Yes| O[Verify Fix]
L --> O
M --> O
O --> P[Document Lessons Learned]
P --> Q[Update Runbook]
Q --> R[Close Incident]
style A fill:#ff6b6b
style N fill:#feca57
style R fill:#90EE90
Step 1: Detect — Pipeline fails with clear error message
Error Message Format (Standardized):
========================================
❌ QUALITY GATE FAILURE
========================================
Gate: Test Coverage
Service: ConnectSoft.ATP.Ingestion
Build: 1.0.42
Threshold: 75% line coverage
Actual: 72.3% line coverage
Difference: -2.7%
📊 Coverage by Module:
• Controllers: 85.2% ✅
• Services: 78.9% ✅
• Repositories: 68.4% ❌ (below threshold)
• Models: 95.6% ✅
🔍 Remediation Guidance:
1. Add unit tests for uncovered repository methods
2. Focus on Repositories/AuditEventRepository.cs (52% coverage)
3. Run: dotnet test --collect:"XPlat Code Coverage" --filter FullyQualifiedName~Repository
4. Exclude generated code if necessary (update .runsettings)
📚 Documentation:
• Coverage policy: docs/ci-cd/quality-gates.md#test-coverage-gates
• Runbook: docs/operations/runbooks/coverage-failure.md
⏱️ Estimated Fix Time: 2-4 hours
========================================
Step 2: Triage — Developer or SRE investigates root cause
Triage Checklist:
## Quality Gate Failure Triage
**Gate**: ___________________
**Build**: ___________________
**Assignee**: ___________________
**Triage Start**: ___________________
### Initial Assessment
- [ ] Error message reviewed
- [ ] Build logs analyzed
- [ ] Previous successful build identified (for comparison)
- [ ] Recent code changes reviewed (Git diff)
### Root Cause Analysis
- [ ] Root cause identified: ___________________
- [ ] Contributing factors: ___________________
- [ ] Similar failures in history? ___________________
### Fix Strategy
- [ ] **Code Fix** — Implement missing tests/fix bugs
- [ ] **Dependency Update** — Upgrade/downgrade package
- [ ] **Risk Acceptance** — Document accepted risk (with justification)
- [ ] **Threshold Adjustment** — Propose threshold change (with retrospective)
- [ ] **False Positive** — Report tool issue for investigation
### Estimated Time to Fix
- [ ] <4 hours (immediate fix)
- [ ] 4-8 hours (within sprint)
- [ ] >8 hours (requires spike/research)
**Triage Completed**: ___________________
**Next Action**: ___________________
Step 3: Fix — Code changes, dependency updates, or risk acceptance
Fix Implementation Patterns:
| Failure Type | Fix Pattern | Example | Estimated Time |
|---|---|---|---|
| Coverage Below Threshold | Add unit tests for uncovered code | xUnit tests for repository methods | 2-4 hours |
| Test Failure | Fix bug or update test expectations | Fix race condition in integration test | 1-8 hours |
| Security Critical | Upgrade dependency or patch code | Upgrade System.Text.Json 6.0 → 8.0 | 1-2 hours |
| Security High | Upgrade dependency or accept risk | Suppress false positive with justification | 2-4 hours |
| API Breaking Change | Create /v2/ endpoint or revert |
Implement /v2/audit-events with new schema |
1-2 days |
| SBOM Generation Failed | Fix project references or restore | Repair NuGet package references | 30 minutes |
| OpenTelemetry Missing | Add ActivitySource instrumentation | Add using statement + activity creation | 1-2 hours |
| Health Check Failed | Fix dependency connection or timeout | Increase database health check timeout | 30 minutes |
Risk Acceptance Record (ADR Template):
# ADR-XXX: Risk Acceptance — [Vulnerability/Issue Description]
## Status
**Accepted** — Date: YYYY-MM-DD
**Expires**: YYYY-MM-DD (6 months from acceptance)
## Context
**Vulnerability**: CVE-XXXX-XXXXX
**Severity**: High (CVSS 7.8)
**Affected Package**: Newtonsoft.Json 12.0.3
**Exploitability**: Low (requires authenticated admin access)
## Decision
Accept risk for 6 months due to:
1. No patch available from vendor (Microsoft investigating)
2. Exploitability requires admin credentials (mitigated by RBAC)
3. Breaking change to migrate to System.Text.Json (requires 2-week refactoring)
## Mitigation
- [ ] Enable Azure Firewall rules to block external access
- [ ] Add runtime validation to reject malicious JSON payloads
- [ ] Monitor vendor advisory for patch availability
- [ ] Schedule migration to System.Text.Json for Q2 2025
## Acceptance Criteria
- Patch becomes available → Apply immediately
- 6 months elapse → Escalate to CTO for re-acceptance or mandatory fix
- Exploitation detected in wild → Immediate hotfix required
**Accepted By**: [Architect Name]
**Reviewed By**: [Security Officer Name]
**Escalation Contact**: [CTO Email]
Step 4: Verify — Re-run pipeline; ensure gate passes
Verification Script (PowerShell):
# scripts/verify-quality-gate-fix.ps1
param(
[Parameter(Mandatory=$true)]
[string]$BuildId,
[Parameter(Mandatory=$true)]
[string]$GateType # Coverage, Security, Test, etc.
)
Write-Host "🔍 Verifying quality gate fix for build $BuildId..." -ForegroundColor Cyan
# Query Azure DevOps Build API
$azureDevOpsUrl = $env:AZURE_DEVOPS_URL
$pat = $env:AZURE_DEVOPS_PAT
$headers = @{
Authorization = "Basic " + [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(":$pat"))
}
$buildUrl = "$azureDevOpsUrl/_apis/build/builds/$BuildId`?api-version=7.0"
$build = Invoke-RestMethod -Uri $buildUrl -Headers $headers -Method Get
# Check if build succeeded
if ($build.result -eq "succeeded") {
Write-Host "✅ Build passed: $($build.buildNumber)" -ForegroundColor Green
# Verify specific gate passed
switch ($GateType) {
"Coverage" {
$coverageUrl = "$azureDevOpsUrl/_apis/test/CodeCoverage?buildId=$BuildId&api-version=7.0"
$coverage = Invoke-RestMethod -Uri $coverageUrl -Headers $headers -Method Get
$lineCoverage = $coverage.coverageData[0].coverageStats | Where-Object { $_.label -eq "Lines" } | Select-Object -ExpandProperty covered
$totalLines = $coverage.coverageData[0].coverageStats | Where-Object { $_.label -eq "Lines" } | Select-Object -ExpandProperty total
$coveragePercent = [math]::Round(($lineCoverage / $totalLines) * 100, 2)
Write-Host " Coverage: $coveragePercent% (threshold: 75%)" -ForegroundColor Green
}
"Security" {
Write-Host " Security scan passed (no critical/high vulnerabilities)" -ForegroundColor Green
}
"Test" {
$testUrl = "$azureDevOpsUrl/_apis/test/ResultSummaryByBuild?buildId=$BuildId&api-version=7.0"
$testResults = Invoke-RestMethod -Uri $testUrl -Headers $headers -Method Get
$passRate = [math]::Round(($testResults.aggregatedResultsAnalysis.totalTests - $testResults.aggregatedResultsAnalysis.resultsDifference.failureCount) / $testResults.aggregatedResultsAnalysis.totalTests * 100, 2)
Write-Host " Test pass rate: $passRate% (threshold: 100%)" -ForegroundColor Green
}
}
Write-Host "`n✅ Quality gate fix verified successfully" -ForegroundColor Green
exit 0
} else {
Write-Host "❌ Build failed: $($build.result)" -ForegroundColor Red
Write-Host " Review build logs for details" -ForegroundColor Yellow
exit 1
}
Step 5: Document — Update ADR if architectural decision required
Lessons Learned Template:
# Lessons Learned — Quality Gate Failure [Build Number]
**Date**: YYYY-MM-DD
**Gate**: [Gate Type]
**Build**: [Build Number]
**Service**: [Service Name]
**Time to Fix**: [X hours/days]
## Failure Summary
**Error Message**: [Exact error from pipeline]
**Root Cause**: [Technical root cause]
**Impact**: [Build blocked, deployment delayed, etc.]
## Resolution
**Fix Applied**: [Code changes, dependency updates, configuration changes]
**Verification**: [How fix was verified]
**Pull Request**: #[PR number]
## Prevention
**Process Improvement**: [Changes to prevent recurrence]
**Automation Enhancement**: [New linter rules, pre-commit hooks, etc.]
**Documentation Update**: [Updated runbooks, ADRs, etc.]
## Action Items
- [ ] Update runbook: docs/operations/runbooks/[gate-type]-failure.md
- [ ] Add pre-commit hook to catch issue locally
- [ ] Share lessons learned in team meeting
**Author**: [Developer Name]
**Reviewed By**: [Tech Lead Name]
Ratcheting Thresholds¶
Purpose: Continuously raise quality standards by incrementally increasing thresholds as team capabilities improve, preventing quality regression.
Ratcheting Strategy:
| Threshold Type | Current | Q1 2025 Target | Q2 2025 Target | Q3 2025 Target | Rationale |
|---|---|---|---|---|---|
| Line Coverage (Ingestion) | 75% | 77% | 80% | 82% | Sustained improvement; add 2% per quarter |
| Line Coverage (Query) | 80% | 82% | 85% | 87% | Complex query logic requires higher coverage |
| Branch Coverage (All) | 60% | 62% | 65% | 67% | Improve conditional logic testing |
| Critical Vulnerabilities | 0 | 0 | 0 | 0 | Zero tolerance maintained |
| High Vulnerabilities | 0 | 0 | 0 | 0 | Ratchet down from current 1 accepted risk |
| Medium Vulnerabilities | <10 | <8 | <5 | <3 | Gradual reduction; backlog cleanup |
| Flaky Test Rate | <2% | <1.5% | <1% | <0.5% | Improve test reliability |
| Mean Time to Fix Gate | <4h | <3h | <2h | <1h | Faster remediation through automation |
Ratcheting Automation (C# Azure Function):
// RatchetQualityThresholds.cs — Quarterly threshold adjustment
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
public static class RatchetQualityThresholds
{
[FunctionName("RatchetQualityThresholds")]
public static async Task Run(
[TimerTrigger("0 0 0 1 1,4,7,10 *")] TimerInfo timer, // Quarterly: Jan 1, Apr 1, Jul 1, Oct 1
ILogger log)
{
log.LogInformation("Evaluating quality threshold ratcheting for quarter");
var currentQuarter = GetCurrentQuarter();
// Get historical metrics (last 90 days)
var metrics = await GetHistoricalMetricsAsync();
// Evaluate ratcheting eligibility per service
var ratchetRecommendations = new List<RatchetRecommendation>();
foreach (var service in metrics.GroupBy(m => m.Service))
{
var serviceName = service.Key;
var serviceMetrics = service.ToList();
// Calculate sustained performance (avg last 90 days)
var avgLineCoverage = serviceMetrics.Average(m => m.LineCoverage);
var avgBranchCoverage = serviceMetrics.Average(m => m.BranchCoverage);
var currentThreshold = GetCurrentThreshold(serviceName);
log.LogInformation($"Service: {serviceName}, Avg Coverage: {avgLineCoverage:F2}%, Current Threshold: {currentThreshold}%");
// Ratchet if sustained performance exceeds threshold by 5%
if (avgLineCoverage >= currentThreshold + 5)
{
var newThreshold = currentThreshold + 2; // Ratchet up by 2%
ratchetRecommendations.Add(new RatchetRecommendation
{
Service = serviceName,
MetricType = "LineCoverage",
CurrentThreshold = currentThreshold,
NewThreshold = newThreshold,
SustainedPerformance = avgLineCoverage,
Justification = $"Sustained performance of {avgLineCoverage:F2}% exceeds threshold by {(avgLineCoverage - currentThreshold):F2}%",
ApprovalRequired = true
});
log.LogInformation($" ✅ Ratchet recommendation: {currentThreshold}% → {newThreshold}%");
}
else if (avgLineCoverage < currentThreshold)
{
log.LogWarning($" ⚠️ Performance below threshold: {avgLineCoverage:F2}% < {currentThreshold}%");
}
else
{
log.LogInformation($" → Threshold maintained (performance within 5% of threshold)");
}
}
// Create work items for threshold adjustments
if (ratchetRecommendations.Any())
{
await CreateRatchetWorkItemsAsync(ratchetRecommendations, currentQuarter);
log.LogInformation($"Created {ratchetRecommendations.Count} ratchet recommendation work items for Q{currentQuarter} 2025");
}
else
{
log.LogInformation("No ratchet recommendations for this quarter");
}
}
private static int GetCurrentQuarter()
{
var month = DateTime.UtcNow.Month;
return (month - 1) / 3 + 1;
}
private static async Task<List<QualityMetric>> GetHistoricalMetricsAsync()
{
// Query Azure DevOps Analytics API for last 90 days
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static double GetCurrentThreshold(string serviceName)
{
// Retrieve current threshold from configuration
var thresholds = new Dictionary<string, double>
{
["ConnectSoft.ATP.Ingestion"] = 75.0,
["ConnectSoft.ATP.Query"] = 80.0,
["ConnectSoft.ATP.Integrity"] = 85.0,
["ConnectSoft.ATP.Export"] = 70.0
};
return thresholds.ContainsKey(serviceName) ? thresholds[serviceName] : 70.0;
}
private static async Task CreateRatchetWorkItemsAsync(List<RatchetRecommendation> recommendations, int quarter)
{
// Create Azure DevOps work items for architect review
// Implementation omitted for brevity
throw new NotImplementedException();
}
}
public class RatchetRecommendation
{
public string Service { get; set; }
public string MetricType { get; set; }
public double CurrentThreshold { get; set; }
public double NewThreshold { get; set; }
public double SustainedPerformance { get; set; }
public string Justification { get; set; }
public bool ApprovalRequired { get; set; }
}
public class QualityMetric
{
public string Service { get; set; }
public DateTime Date { get; set; }
public double LineCoverage { get; set; }
public double BranchCoverage { get; set; }
}
Threshold Ratcheting Policy:
# Threshold Ratcheting Policy
policy:
# Coverage thresholds
coverage:
incremental: 2% # Increase threshold by 2% per quarter
sustainedPerformance: 5% # Must exceed threshold by 5% for 90 days
maxThreshold: 95% # Cap at 95% (allow for generated code exclusions)
reviewCadence: Quarterly
# Security vulnerability thresholds
security:
critical: 0 # Zero tolerance (never ratchet)
high: 0 # Zero tolerance (never ratchet)
medium: -2 # Reduce by 2 per quarter (if sustained)
low: -5 # Reduce by 5 per quarter
# Flaky test rate
flakyTests:
incremental: -0.5% # Reduce by 0.5% per quarter
sustainedImprovement: 30 days # Must maintain improvement for 30 days
targetRate: 0% # Ultimate goal: zero flaky tests
# Mean time to fix
mttf:
incremental: -30min # Reduce by 30 minutes per quarter
sustainedImprovement: 60 days
targetTime: 1h # Ultimate goal: fix within 1 hour
Flaky Test Quarantine & Remediation¶
Purpose: Systematically eliminate flaky tests by quarantining unreliable tests and requiring fixes within 2 sprints.
Flaky Test Detection (automated daily):
// DetectFlakyTests.cs — Daily detection of unreliable tests
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
public static class DetectFlakyTests
{
[FunctionName("DetectFlakyTests")]
public static async Task Run(
[TimerTrigger("0 0 6 * * *")] TimerInfo timer, // Daily: 6:00 AM UTC
ILogger log)
{
log.LogInformation("Detecting flaky tests (last 30 days)");
// Query test results (last 30 days)
var testResults = await GetTestResultsAsync(days: 30);
// Calculate flaky score per test
var flakyTests = testResults
.GroupBy(t => t.TestCaseName)
.Where(g => g.Count() >= 10) // Only tests run at least 10 times
.Select(g => new
{
TestCaseName = g.Key,
TotalRuns = g.Count(),
PassCount = g.Count(t => t.Outcome == "Passed"),
FailCount = g.Count(t => t.Outcome == "Failed"),
FlakyScore = (double)g.Count(t => t.Outcome == "Failed") / g.Count() * 100
})
.Where(t => t.FlakyScore > 0 && t.FlakyScore < 100) // Exclude always-passing/failing
.Where(t => t.FlakyScore > 10) // Flaky if >10% failure rate
.OrderByDescending(t => t.FlakyScore)
.ToList();
log.LogInformation($"Detected {flakyTests.Count} flaky tests");
// Quarantine flaky tests (create work items)
foreach (var test in flakyTests)
{
// Check if work item already exists
var existingWorkItem = await GetExistingFlakyTestWorkItemAsync(test.TestCaseName);
if (existingWorkItem == null)
{
// Create new work item
await CreateFlakyTestWorkItemAsync(new FlakyTestWorkItem
{
Title = $"[Flaky Test] {test.TestCaseName}",
Description = $@"
## Flaky Test Detection
**Test**: `{test.TestCaseName}`
**Flaky Score**: {test.FlakyScore:F2}% ({test.FailCount}/{test.TotalRuns} runs failed)
**Detection Date**: {DateTime.UtcNow:yyyy-MM-dd}
**Deadline**: {DateTime.UtcNow.AddDays(28):yyyy-MM-dd} (2 sprints)
## Remediation Actions
- [ ] Investigate root cause (race condition, timing dependency, shared state)
- [ ] Fix test (add synchronization, isolate state, increase timeout)
- [ ] Verify fix (test passes 20+ consecutive runs)
- [ ] Remove from quarantine list
## Quarantine Status
- [ ] Test disabled in pipeline (add `[Fact(Skip=""Flaky"")]`)
- [ ] Work item assigned to original test author
- [ ] Deadline tracked (2 sprints from detection)
",
Priority = test.FlakyScore > 50 ? 1 : 2, // P1 if >50% flaky
Tags = new[] { "FlakyTest", "TechnicalDebt", "QualityImprovement" },
AssignedTo = await GetTestAuthorAsync(test.TestCaseName)
});
log.LogInformation($" ✅ Created work item for: {test.TestCaseName} (flaky score: {test.FlakyScore:F2}%)");
}
else
{
log.LogInformation($" → Work item already exists for: {test.TestCaseName}");
}
}
// Generate daily flaky test report
await GenerateFlakyTestReportAsync(flakyTests);
}
private static async Task<List<TestResult>> GetTestResultsAsync(int days)
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task<WorkItem> GetExistingFlakyTestWorkItemAsync(string testCaseName)
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task CreateFlakyTestWorkItemAsync(FlakyTestWorkItem workItem)
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task<string> GetTestAuthorAsync(string testCaseName)
{
// Git blame to find original test author
// Implementation omitted for brevity
return "unassigned";
}
private static async Task GenerateFlakyTestReportAsync(List<dynamic> flakyTests)
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
}
public class FlakyTestWorkItem
{
public string Title { get; set; }
public string Description { get; set; }
public int Priority { get; set; }
public string[] Tags { get; set; }
public string AssignedTo { get; set; }
}
Quality Gate Retrospectives¶
Purpose: Regularly review quality gate effectiveness, identify false positives, adjust thresholds, and propose new gates based on lessons learned.
Retrospective Cadence:
- Frequency: Monthly (first Tuesday, immediately after metrics review)
- Duration: 60 minutes
- Participants: Tech Leads, QA Engineers, Security Officers, SRE
- Facilitator: Rotating (different team member each month)
Retrospective Agenda:
## Quality Gate Retrospective — [Month YYYY]
**Date**: [Date]
**Facilitator**: [Name]
**Participants**: [Names]
### 1. Metrics Review (15 minutes)
**Presented By**: Metrics Lead
- Quality scorecard review (15 metrics)
- DORA metrics update
- Quality gate violation trends
**Questions for Discussion**:
- Which metrics improved this month?
- Which metrics regressed? Root causes?
- Are we tracking the right metrics?
---
### 2. False Positives & Threshold Adjustments (20 minutes)
**Presented By**: Team Leads
**False Positives Identified**:
| Gate | False Positive Count | Root Cause | Proposed Fix |
|------|---------------------|------------|--------------|
| [Gate Type] | [Count] | [Why it failed incorrectly] | [Tool fix, threshold adjustment] |
**Threshold Adjustment Proposals**:
| Metric | Current Threshold | Proposed Threshold | Justification |
|--------|-------------------|-------------------|---------------|
| [Metric] | [Current] | [Proposed] | [Why adjustment needed] |
**Discussion**:
- Are thresholds too aggressive? Too lenient?
- Should we ratchet thresholds this quarter?
---
### 3. New Gate Proposals (15 minutes)
**Presented By**: Quality Champions
**Proposed New Gates**:
| Gate | Purpose | Enforcement Point | Blocker | Estimated Effort |
|------|---------|-------------------|---------|------------------|
| [Gate Name] | [Why needed] | [CI/CD stage] | [Yes/No] | [Hours/Days] |
**Discussion**:
- Which new gates add value without friction?
- Priority for implementation?
---
### 4. Gate Effectiveness & Developer Experience (10 minutes)
**Presented By**: Team (Open Forum)
**Questions**:
- Which gates caught real issues this month?
- Which gates caused frustration or delays?
- Are error messages clear and actionable?
- Is remediation guidance helpful?
**Feedback**:
- [Positive feedback]
- [Improvement suggestions]
---
### 5. Action Items (5 minutes)
**Facilitator**: Retrospective Lead
- [ ] Action 1: [Description] — **Owner**: [Name], **Due**: [Date]
- [ ] Action 2: [Description] — **Owner**: [Name], **Due**: [Date]
- [ ] Action 3: [Description] — **Owner**: [Name], **Due**: [Date]
---
### Retrospective Outcomes
- **Continue Doing**: [Effective practices to maintain]
- **Start Doing**: [New practices to adopt]
- **Stop Doing**: [Ineffective practices to eliminate]
**Next Retrospective**: [Date]
Retrospective Action Item Tracking (Azure DevOps):
// Quality Gate Retrospective Action Items
WorkItem
| where WorkItemType == "Task"
| where Tags contains "Retrospective"
| where Tags contains "QualityGate"
| where CreatedDate >= ago(90d)
| summarize
TotalItems = count(),
CompletedItems = countif(State == "Closed"),
InProgressItems = countif(State == "Active"),
OverdueItems = countif(State != "Closed" and DueDate < now())
by AssignedTo
| extend CompletionRate = round((todouble(CompletedItems) / TotalItems) * 100, 2)
| project
Owner = AssignedTo,
TotalItems,
CompletedItems,
InProgressItems,
OverdueItems,
CompletionRate
| order by OverdueItems desc, CompletionRate asc
Gate Effectiveness Scoring¶
Purpose: Quantify gate effectiveness to prioritize improvements and retire ineffective gates.
Effectiveness Metrics:
| Gate | True Positives | False Positives | False Negatives | Precision | Recall | F1 Score | Effectiveness |
|---|---|---|---|---|---|---|---|
| Test Coverage | 42 | 3 | 1 | 93.3% | 97.7% | 95.5% | Excellent ✅ |
| Security Scan | 35 | 8 | 0 | 81.4% | 100% | 89.7% | Good ✅ |
| API Breaking Change | 12 | 1 | 0 | 92.3% | 100% | 96.0% | Excellent ✅ |
| Flaky Test Detection | 18 | 5 | 3 | 78.3% | 85.7% | 81.8% | Good ✅ |
| Health Check | 8 | 0 | 0 | 100% | 100% | 100% | Excellent ✅ |
| Load Test | 4 | 2 | 1 | 66.7% | 80.0% | 72.7% | Acceptable ⚠️ |
Definitions:
- True Positive (TP): Gate correctly blocked a problematic build (issue found in production would have occurred)
- False Positive (FP): Gate incorrectly blocked a valid build (no actual issue)
- False Negative (FN): Gate incorrectly passed a problematic build (issue escaped to production)
- Precision: TP / (TP + FP) — How often gate failures are correct
- Recall: TP / (TP + FN) — How often gate catches issues
- F1 Score: Harmonic mean of precision and recall
Effectiveness Classification:
- Excellent (F1 ≥90%): Gate is highly effective; maintain current configuration
- Good (F1 80-89%): Gate is effective; minor tuning may improve precision
- Acceptable (F1 70-79%): Gate adds value; investigate false positives
- Poor (F1 <70%): Gate may be ineffective; consider retiring or major overhaul
Gate Effectiveness Tracking (C#):
// TrackGateEffectiveness.cs — Track true/false positives/negatives
public class GateEffectivenessTracker
{
public async Task RecordGateOutcomeAsync(GateOutcome outcome)
{
var record = new GateEffectivenessRecord
{
GateType = outcome.GateType,
BuildId = outcome.BuildId,
Timestamp = DateTime.UtcNow,
GateResult = outcome.GateResult, // Passed/Failed
ActualIssue = outcome.ActualIssue, // Was there a real issue?
// Classification
OutcomeType = ClassifyOutcome(outcome.GateResult, outcome.ActualIssue),
// Context
FailureReason = outcome.FailureReason,
RemediationTime = outcome.RemediationTime,
EscapedToProduction = outcome.EscapedToProduction
};
await _cosmosClient.UpsertAsync(record);
}
private string ClassifyOutcome(string gateResult, bool actualIssue)
{
if (gateResult == "Failed" && actualIssue)
return "TruePositive"; // Gate correctly caught issue
if (gateResult == "Failed" && !actualIssue)
return "FalsePositive"; // Gate incorrectly blocked valid build
if (gateResult == "Passed" && actualIssue)
return "FalseNegative"; // Gate missed issue (escaped to production)
return "TrueNegative"; // Gate correctly passed valid build
}
public async Task<GateEffectivenessReport> CalculateEffectivenessAsync(string gateType, int days = 90)
{
var records = await _cosmosClient.QueryAsync<GateEffectivenessRecord>(
r => r.GateType == gateType && r.Timestamp >= DateTime.UtcNow.AddDays(-days)
);
var tp = records.Count(r => r.OutcomeType == "TruePositive");
var fp = records.Count(r => r.OutcomeType == "FalsePositive");
var fn = records.Count(r => r.OutcomeType == "FalseNegative");
var tn = records.Count(r => r.OutcomeType == "TrueNegative");
var precision = tp + fp > 0 ? (double)tp / (tp + fp) * 100 : 0;
var recall = tp + fn > 0 ? (double)tp / (tp + fn) * 100 : 0;
var f1Score = precision + recall > 0 ? 2 * (precision * recall) / (precision + recall) : 0;
var effectiveness = f1Score >= 90 ? "Excellent" :
f1Score >= 80 ? "Good" :
f1Score >= 70 ? "Acceptable" : "Poor";
return new GateEffectivenessReport
{
GateType = gateType,
TruePositives = tp,
FalsePositives = fp,
FalseNegatives = fn,
TrueNegatives = tn,
Precision = Math.Round(precision, 2),
Recall = Math.Round(recall, 2),
F1Score = Math.Round(f1Score, 2),
Effectiveness = effectiveness,
Recommendation = GetRecommendation(effectiveness, fp, fn)
};
}
private string GetRecommendation(string effectiveness, int fp, int fn)
{
if (effectiveness == "Poor" && fp > fn)
return "High false positive rate; consider relaxing threshold or improving detection logic";
if (effectiveness == "Poor" && fn > fp)
return "High false negative rate; consider tightening threshold or adding additional checks";
if (effectiveness == "Acceptable" && fp > 5)
return "Reduce false positives by refining gate logic or threshold";
return "Gate is effective; maintain current configuration";
}
}
Summary¶
- Violation Response Workflow: 5-step process (detect, triage, fix, verify, document) with Mermaid diagram, standardized error messages, triage checklist, fix patterns, risk acceptance template, verification script
- Ratcheting Thresholds: Quarterly adjustment policy (coverage +2%, vulnerabilities reduced, flaky tests -0.5%, MTTF -30min), C# Azure Function for automated recommendations, threshold ratcheting policy YAML
- Flaky Test Quarantine: Daily detection (Azure Function), >10% failure rate triggers quarantine, work item creation with 2-sprint deadline, automated assignment to test author
- Quality Gate Retrospectives: Monthly meetings (first Tuesday, 60min), 5-part agenda (metrics, false positives, new gates, effectiveness, action items), retrospective template, action item tracking (KQL query)
- Gate Effectiveness Scoring: Precision/recall/F1 score calculation, 6-gate effectiveness table (test coverage 95.5%, security 89.7%, API breaking change 96.0%, flaky test 81.8%, health check 100%, load test 72.7%), C# tracker with true/false positive/negative classification, recommendations based on effectiveness
Exception Handling & Risk Acceptance¶
Purpose: Provide governance framework for suppressing quality gate violations when legitimate exceptions exist (false positives, accepted risks, mitigated vulnerabilities).
Suppression Principles:
- Time-Bounded: All suppressions expire (max 6 months); require re-review and re-approval
- Auditable: Every suppression logged in meta-audit stream with justification and approver
- Minimal: Suppressions are exception, not the rule; prefer fixing issues over suppressing
- Governed: Requires security officer or architect approval; no self-approval
Suppression File Formats¶
Purpose: Enable structured suppression of false positives and accepted risks across multiple quality gate tools.
Suppression Files by Gate Type:
| Gate Type | Suppression File | Format | Approval Required |
|---|---|---|---|
| OWASP Dependency Check | dependency-check-suppressions.xml |
XML | Security Officer |
| Secrets Detection (CredScan) | credscan-suppressions.json |
JSON | Security Officer |
| SonarQube | .sonarqube/suppressions.xml |
XML | Architect |
| StyleCop | stylecop.json or .editorconfig |
JSON/INI | Team Lead |
| Roslyn Analyzers | .globalconfig or .editorconfig |
INI | Architect |
| Trivy (Container Scan) | .trivyignore |
Text | Security Officer |
OWASP Dependency Check Suppression (XML):
<?xml version="1.0" encoding="UTF-8"?>
<!-- dependency-check-suppressions.xml -->
<suppressions xmlns="https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.3.xsd">
<!-- Suppression 1: False positive for Newtonsoft.Json -->
<suppress>
<packageUrl regex="true">^pkg:nuget/Newtonsoft\.Json@12\.0\.3$</packageUrl>
<cve>CVE-2024-12345</cve>
<reason>
False positive. CVE affects Newtonsoft.Json deserialization with TypeNameHandling enabled.
ATP does not use TypeNameHandling; all deserialization uses safe defaults.
Confirmed by security team analysis on 2024-10-15.
</reason>
<approvedBy>security-team@connectsoft.example</approvedBy>
<approvedDate>2024-10-15</approvedDate>
<expiresOn>2025-04-15</expiresOn> <!-- 6 months from approval -->
<notes>
Re-review on expiration. If CVE still reported, consider upgrade to System.Text.Json.
Tracking issue: ATP-1234
</notes>
</suppress>
<!-- Suppression 2: Accepted risk for legacy library -->
<suppress>
<packageUrl regex="true">^pkg:nuget/OldLibrary@1\.2\.3$</packageUrl>
<cve>CVE-2023-98765</cve>
<reason>
Accepted risk. OldLibrary has known vulnerability (CVSS 6.5) but is only used in
non-production dev tooling (data seeders). Not deployed to production.
Migration to ModernLibrary scheduled for Q2 2025.
</reason>
<approvedBy>architect@connectsoft.example</approvedBy>
<approvedDate>2024-09-01</approvedDate>
<expiresOn>2025-03-01</expiresOn>
<notes>
Mitigation: OldLibrary isolated to dev environment only.
Epic for migration: ATP-EPIC-567
</notes>
</suppress>
<!-- Suppression 3: Vulnerability mitigated by application controls -->
<suppress>
<packageUrl regex="true">^pkg:nuget/Azure\.Storage\.Blobs@12\.14\.0$</packageUrl>
<vulnerabilityName>CWE-22</vulnerabilityName>
<reason>
Path traversal vulnerability (CWE-22) mitigated by application-level path validation.
All blob paths validated against allowlist regex before passing to Azure.Storage.Blobs.
Code review completed by security team on 2024-10-20.
</reason>
<approvedBy>security-officer@connectsoft.example</approvedBy>
<approvedDate>2024-10-20</approvedDate>
<expiresOn>2025-04-20</expiresOn>
<notes>
Mitigation code: BlobStorageService.cs lines 45-58
Unit tests validate path validation: BlobStorageServiceTests.cs
</notes>
</suppress>
</suppressions>
Secrets Detection Suppression (JSON):
{
"$schema": "https://aka.ms/credscan/suppression-schema.json",
"suppressions": [
{
"fingerprint": "12345abcdef67890",
"pattern": "ConnectionStrings__DefaultConnection",
"reason": "False positive. This is a configuration key name, not an actual secret. Real connection string loaded from Key Vault at runtime.",
"approvedBy": "security-team@connectsoft.example",
"approvedDate": "2024-10-15",
"expiresOn": "2025-04-15",
"notes": "Configuration pattern documented in appsettings.json schema"
},
{
"fingerprint": "abcdef1234567890",
"pattern": "-----BEGIN CERTIFICATE-----",
"filePath": "tests/TestCertificates/test-cert.pem",
"reason": "Test certificate for development only. Not a real production certificate. Certificate is self-signed and expires in 30 days.",
"approvedBy": "security-team@connectsoft.example",
"approvedDate": "2024-09-01",
"expiresOn": "2025-03-01",
"notes": "Test certificates isolated to tests/ directory; never deployed to production"
},
{
"fingerprint": "9876543210fedcba",
"pattern": "xoxb-",
"filePath": "docs/examples/slack-integration.md",
"reason": "Example Slack token in documentation. Placeholder value, not a real token. Format: xoxb-XXXX-YYYY-ZZZZ",
"approvedBy": "tech-lead@connectsoft.example",
"approvedDate": "2024-10-01",
"expiresOn": "2025-04-01",
"notes": "Documentation example; clarified with comment that it's a placeholder"
}
]
}
SonarQube Suppression (XML):
<?xml version="1.0" encoding="UTF-8"?>
<!-- .sonarqube/suppressions.xml -->
<suppressions>
<!-- Suppression for S3776: Cognitive Complexity -->
<suppression>
<ruleKey>csharpsquid:S3776</ruleKey>
<filePath>src/ConnectSoft.ATP.Integrity/Services/TamperEvidenceService.cs</filePath>
<lineNumber>142</lineNumber>
<reason>
High cognitive complexity (35) in hash chain validation method.
Complexity inherent to cryptographic validation algorithm (Merkle tree traversal).
Refactoring would reduce readability and introduce bugs.
Code reviewed and approved by cryptography expert.
</reason>
<approvedBy>architect@connectsoft.example</approvedBy>
<approvedDate>2024-10-10</approvedDate>
<expiresOn>2025-04-10</expiresOn>
</suppression>
<!-- Suppression for S1135: TODO comments -->
<suppression>
<ruleKey>csharpsquid:S1135</ruleKey>
<filePath>src/ConnectSoft.ATP.Query/Services/QueryOptimizer.cs</filePath>
<lineNumber>78</lineNumber>
<reason>
TODO comment tracking planned optimization for Q2 2025.
Work item created: ATP-789. TODO will be removed when implemented.
</reason>
<approvedBy>tech-lead@connectsoft.example</approvedBy>
<approvedDate>2024-10-01</approvedDate>
<expiresOn>2025-06-01</expiresOn> <!-- Extended to Q2 2025 -->
</suppression>
</suppressions>
Trivy Container Scan Suppression (.trivyignore):
# .trivyignore — Suppress container image vulnerabilities
# CVE-2024-11111: False positive for Alpine base image
# Reason: CVE affects OpenSSL 3.0.x; Alpine 3.18 uses LibreSSL (not affected)
# Approved By: security-team@connectsoft.example
# Approved Date: 2024-10-15
# Expires On: 2025-04-15
CVE-2024-11111
# CVE-2023-22222: Accepted risk for curl vulnerability
# Reason: curl only used in health check scripts (non-production utility)
# Mitigation: Health check scripts do not accept user input
# Approved By: architect@connectsoft.example
# Approved Date: 2024-09-20
# Expires On: 2025-03-20
CVE-2023-22222
# CVE-2024-33333: Mitigated by runtime validation
# Reason: Vulnerability in JSON parser mitigated by schema validation before parsing
# Approved By: security-officer@connectsoft.example
# Approved Date: 2024-10-25
# Expires On: 2025-04-25
CVE-2024-33333
Risk Acceptance Process¶
Purpose: Provide formal governance for accepting security or quality risks when remediation is not immediately feasible.
Risk Acceptance Workflow:
graph TD
A[Quality Gate Failure] --> B{Can Fix Immediately?}
B -->|Yes| C[Implement Fix]
B -->|No| D[Evaluate Risk Acceptance]
D --> E{Risk Acceptable?}
E -->|No| F[Block Deployment]
E -->|Yes| G[Document Justification]
G --> H[Create Risk Acceptance Record]
H --> I{Risk Level?}
I -->|Critical/High| J[Security Officer Approval]
I -->|Medium| K[Architect Approval]
I -->|Low| L[Team Lead Approval]
J --> M{Approved?}
K --> M
L --> M
M -->|No| F
M -->|Yes| N[Create Suppression File]
N --> O[Set Expiration Date]
O --> P[Log in Meta-Audit Stream]
P --> Q[Schedule Re-Review]
Q --> R[Allow Deployment]
F --> S[Remediate or Escalate]
C --> R
style F fill:#ff6b6b
style R fill:#90EE90
Risk Acceptance Criteria:
| Risk Level | Examples | Approval Required | Max Duration | Re-Review Cadence |
|---|---|---|---|---|
| Critical (CVSS 9-10) | RCE, data breach, auth bypass | CISO + Security Officer | 30 days | Weekly |
| High (CVSS 7-8.9) | Privilege escalation, XSS, SQLi | Security Officer + Architect | 90 days | Monthly |
| Medium (CVSS 4-6.9) | Information disclosure, DoS | Architect | 180 days (6 months) | Quarterly |
| Low (CVSS 0-3.9) | Low-impact bugs, code smells | Team Lead | 365 days (1 year) | Annually |
Risk Acceptance Steps:
Step 1: Justification — Document why risk is acceptable
Acceptable Justifications:
## Valid Risk Acceptance Justifications
### False Positives
- Tool incorrectly flagged code as vulnerable (verified by manual analysis)
- CVE does not apply to ATP's usage pattern (e.g., feature not enabled)
- Vulnerability requires preconditions not present in ATP (e.g., specific OS version)
### Mitigated Risks
- Application-level controls prevent exploitation (e.g., input validation)
- Network isolation prevents attack vector (e.g., private VNet)
- Defense-in-depth compensating controls (e.g., WAF rules, rate limiting)
### Temporary Exceptions
- No patch available from vendor (waiting for upstream fix)
- Patch introduces breaking changes (migration requires extensive refactoring)
- Library only used in non-production environments (dev/test tooling)
### Business Decisions
- Cost of remediation outweighs risk (executive sign-off required)
- Feature scheduled for deprecation (will be removed within 6 months)
- Third-party dependency with no viable alternative (accepted risk with monitoring)
Step 2: Approval — Security officer or architect sign-off required
Approval Matrix:
| Risk Level | Approver 1 | Approver 2 | Approver 3 | Documentation Required |
|---|---|---|---|---|
| Critical | CISO | Security Officer | Architect | ADR + Mitigation Plan + Monitoring Plan |
| High | Security Officer | Architect | — | ADR + Mitigation Plan |
| Medium | Architect | — | — | ADR or suppression comment |
| Low | Team Lead | — | — | Suppression comment |
Risk Acceptance Form (Azure DevOps Work Item):
# Work Item Type: Risk Acceptance
fields:
- field: System.Title
value: "[Risk Acceptance] CVE-XXXX-XXXXX — [Package Name]"
- field: System.Description
value: |
## Vulnerability Details
**CVE ID**: CVE-XXXX-XXXXX
**Package**: [Package Name @ Version]
**Severity**: [Critical/High/Medium/Low] (CVSS [Score])
**CWE**: [CWE-###]
**Description**: [Vulnerability description]
## Risk Assessment
**Exploitability**: [Low/Medium/High]
**Impact**: [Low/Medium/High/Critical]
**Attack Vector**: [Network/Adjacent/Local/Physical]
**Privileges Required**: [None/Low/High]
**User Interaction**: [None/Required]
## Justification for Acceptance
**Category**: [False Positive / Mitigated Risk / Temporary Exception / Business Decision]
**Rationale**:
[Detailed explanation of why this risk is acceptable]
**Evidence**:
- [ ] Manual code review completed (no vulnerable code path)
- [ ] Mitigation controls validated (input validation, network isolation, WAF)
- [ ] Vendor advisory reviewed (no patch available)
- [ ] Alternative libraries evaluated (no viable replacement)
## Mitigation Controls
**Primary Control**: [Description]
**Secondary Control**: [Description]
**Monitoring**: [How risk is monitored for exploitation attempts]
## Remediation Plan
**Timeline**: [When will this be permanently fixed]
**Epic/Story**: [Link to work item for permanent fix]
**Fallback**: [What happens if vulnerability is exploited]
## Re-Review Schedule
**Initial Approval**: [YYYY-MM-DD]
**Expiration Date**: [YYYY-MM-DD] (max 6 months)
**Re-Review Cadence**: [Weekly/Monthly/Quarterly]
**Escalation**: If not fixed by expiration → Escalate to CISO
- field: Microsoft.VSTS.Common.Priority
value: 1 # P1 for Critical/High, P2 for Medium/Low
- field: Custom.RiskLevel
value: High # Critical / High / Medium / Low
- field: Custom.CVEID
value: CVE-XXXX-XXXXX
- field: Custom.CVSSScore
value: 7.8
- field: Custom.ApprovedBy
value: security-officer@connectsoft.example
- field: Custom.ExpirationDate
value: 2025-04-15
- field: Custom.MitigationControls
value: "Input validation; network isolation; WAF rules"
Step 3: Expiration — Time-bound suppression (max 6 months); re-review on expiry
Suppression Expiration Tracker (C# Azure Function):
// TrackSuppressionExpirations.cs — Alert on expiring suppressions
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using System.Xml.Linq;
public static class TrackSuppressionExpirations
{
[FunctionName("TrackSuppressionExpirations")]
public static async Task Run(
[TimerTrigger("0 0 9 * * 1")] TimerInfo timer, // Weekly: Monday 9:00 AM
ILogger log)
{
log.LogInformation("Checking for expiring suppressions");
var expiringSuppressions = new List<SuppressionExpiration>();
// Check OWASP Dependency Check suppressions
var owaspSuppressions = await CheckOwaspSuppressionsAsync();
expiringSuppressions.AddRange(owaspSuppressions);
// Check CredScan suppressions
var credscanSuppressions = await CheckCredscanSuppressionsAsync();
expiringSuppressions.AddRange(credscanSuppressions);
// Check SonarQube suppressions
var sonarSuppressions = await CheckSonarQubeSuppressionsAsync();
expiringSuppressions.AddRange(sonarSuppressions);
// Check Trivy suppressions
var trivySuppressions = await CheckTrivySuppressionsAsync();
expiringSuppressions.AddRange(trivySuppressions);
// Filter suppressions expiring within 30 days
var expiringSoon = expiringSuppressions
.Where(s => s.ExpirationDate <= DateTime.UtcNow.AddDays(30))
.OrderBy(s => s.ExpirationDate)
.ToList();
log.LogInformation($"Found {expiringSoon.Count} suppressions expiring within 30 days");
if (expiringSoon.Any())
{
// Send alert email
await SendExpirationAlertAsync(expiringSoon);
// Create work items for re-review
foreach (var suppression in expiringSoon)
{
await CreateReReviewWorkItemAsync(suppression);
}
log.LogInformation($"Created {expiringSoon.Count} re-review work items");
}
}
private static async Task<List<SuppressionExpiration>> CheckOwaspSuppressionsAsync()
{
var suppressions = new List<SuppressionExpiration>();
// Load suppression file from source control
var suppressionXml = await File.ReadAllTextAsync("dependency-check-suppressions.xml");
var doc = XDocument.Parse(suppressionXml);
var ns = "https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.3.xsd";
foreach (var suppress in doc.Descendants(XName.Get("suppress", ns)))
{
var expiresOn = suppress.Element(XName.Get("expiresOn", ns))?.Value;
if (DateTime.TryParse(expiresOn, out var expirationDate))
{
suppressions.Add(new SuppressionExpiration
{
Tool = "OWASP Dependency Check",
CVE = suppress.Element(XName.Get("cve", ns))?.Value ?? "N/A",
Package = suppress.Element(XName.Get("packageUrl", ns))?.Value ?? "N/A",
Reason = suppress.Element(XName.Get("reason", ns))?.Value ?? "N/A",
ApprovedBy = suppress.Element(XName.Get("approvedBy", ns))?.Value ?? "Unknown",
ExpirationDate = expirationDate,
DaysUntilExpiration = (expirationDate - DateTime.UtcNow).Days
});
}
}
return suppressions;
}
private static async Task<List<SuppressionExpiration>> CheckCredscanSuppressionsAsync()
{
// Similar implementation for credscan-suppressions.json
// Implementation omitted for brevity
return new List<SuppressionExpiration>();
}
private static async Task<List<SuppressionExpiration>> CheckSonarQubeSuppressionsAsync()
{
// Similar implementation for .sonarqube/suppressions.xml
// Implementation omitted for brevity
return new List<SuppressionExpiration>();
}
private static async Task<List<SuppressionExpiration>> CheckTrivySuppressionsAsync()
{
// Parse .trivyignore file for expiration comments
// Implementation omitted for brevity
return new List<SuppressionExpiration>();
}
private static async Task SendExpirationAlertAsync(List<SuppressionExpiration> suppressions)
{
// Send email to security team with expiring suppressions
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task CreateReReviewWorkItemAsync(SuppressionExpiration suppression)
{
// Create Azure DevOps work item for re-review
// Implementation omitted for brevity
throw new NotImplementedException();
}
}
public class SuppressionExpiration
{
public string Tool { get; set; }
public string CVE { get; set; }
public string Package { get; set; }
public string Reason { get; set; }
public string ApprovedBy { get; set; }
public DateTime ExpirationDate { get; set; }
public int DaysUntilExpiration { get; set; }
}
Step 4: Audit Trail — Suppression logged in meta-audit stream
Meta-Audit Event for Suppression (C#):
// Log suppression to meta-audit stream
public class SuppressionAuditLogger
{
private readonly IAuditLogger _auditLogger;
public async Task LogSuppressionCreatedAsync(SuppressionRecord suppression)
{
await _auditLogger.LogAsync(new AuditEvent
{
EventId = Guid.NewGuid(),
TenantId = Guid.Parse("00000000-0000-0000-0000-000000000000"), // Platform-level
Action = "SuppressionCreated",
UserId = suppression.ApprovedBy,
Timestamp = DateTime.UtcNow,
Metadata = new Dictionary<string, object>
{
["suppression.tool"] = suppression.Tool,
["suppression.cve"] = suppression.CVE,
["suppression.package"] = suppression.Package,
["suppression.reason"] = suppression.Reason,
["suppression.approvedBy"] = suppression.ApprovedBy,
["suppression.expirationDate"] = suppression.ExpirationDate,
["suppression.riskLevel"] = suppression.RiskLevel,
["suppression.mitigationControls"] = string.Join("; ", suppression.MitigationControls)
}
});
}
public async Task LogSuppressionExpiredAsync(SuppressionRecord suppression)
{
await _auditLogger.LogAsync(new AuditEvent
{
EventId = Guid.NewGuid(),
TenantId = Guid.Parse("00000000-0000-0000-0000-000000000000"),
Action = "SuppressionExpired",
UserId = "system",
Timestamp = DateTime.UtcNow,
Metadata = new Dictionary<string, object>
{
["suppression.tool"] = suppression.Tool,
["suppression.cve"] = suppression.CVE,
["suppression.package"] = suppression.Package,
["suppression.originalApprover"] = suppression.ApprovedBy,
["suppression.expirationDate"] = suppression.ExpirationDate,
["suppression.action"] = "RequiresReReview"
}
});
}
public async Task LogSuppressionRenewedAsync(SuppressionRecord suppression, string renewedBy)
{
await _auditLogger.LogAsync(new AuditEvent
{
EventId = Guid.NewGuid(),
TenantId = Guid.Parse("00000000-0000-0000-0000-000000000000"),
Action = "SuppressionRenewed",
UserId = renewedBy,
Timestamp = DateTime.UtcNow,
Metadata = new Dictionary<string, object>
{
["suppression.tool"] = suppression.Tool,
["suppression.cve"] = suppression.CVE,
["suppression.package"] = suppression.Package,
["suppression.originalApprover"] = suppression.ApprovedBy,
["suppression.renewedBy"] = renewedBy,
["suppression.newExpirationDate"] = suppression.ExpirationDate.AddMonths(6),
["suppression.renewalJustification"] = suppression.RenewalJustification
}
});
}
}
public class SuppressionRecord
{
public string Tool { get; set; }
public string CVE { get; set; }
public string Package { get; set; }
public string Reason { get; set; }
public string ApprovedBy { get; set; }
public DateTime ApprovedDate { get; set; }
public DateTime ExpirationDate { get; set; }
public string RiskLevel { get; set; }
public List<string> MitigationControls { get; set; }
public string RenewalJustification { get; set; }
}
Suppression Audit Query (KQL):
// Query suppressions from meta-audit stream
AuditEvent
| where Action in ("SuppressionCreated", "SuppressionExpired", "SuppressionRenewed")
| where Timestamp >= ago(365d)
| extend
Tool = tostring(Metadata.['suppression.tool']),
CVE = tostring(Metadata.['suppression.cve']),
Package = tostring(Metadata.['suppression.package']),
ApprovedBy = tostring(Metadata.['suppression.approvedBy']),
ExpirationDate = todatetime(Metadata.['suppression.expirationDate']),
RiskLevel = tostring(Metadata.['suppression.riskLevel'])
| summarize
TotalSuppressions = countif(Action == "SuppressionCreated"),
ActiveSuppressions = countif(Action == "SuppressionCreated" and ExpirationDate > now()),
ExpiredSuppressions = countif(Action == "SuppressionExpired"),
RenewedSuppressions = countif(Action == "SuppressionRenewed")
by Tool, RiskLevel
| project
Tool,
RiskLevel,
TotalSuppressions,
ActiveSuppressions,
ExpiredSuppressions,
RenewedSuppressions
| order by RiskLevel, Tool
Suppression Governance & Compliance¶
Purpose: Ensure suppressions comply with SOC 2, GDPR, and HIPAA requirements for risk management and audit trails.
Governance Controls:
| Control | Requirement | Implementation | Audit Evidence |
|---|---|---|---|
| Approval Authority | Suppressions require appropriate approval level | Azure DevOps approval workflow | Approval work item history |
| Justification | All suppressions must document rationale | Suppression file comments + ADR | Suppression files in Git history |
| Expiration | No suppressions exceed 6 months (Critical/High) | Automated expiration tracking | Weekly expiration reports |
| Audit Trail | All suppressions logged in meta-audit stream | SuppressionAuditLogger | Meta-audit stream query |
| Periodic Review | Active suppressions reviewed quarterly | Quality gate retrospective | Retrospective meeting notes |
| Removal | Suppressions removed when issue resolved | Git commit removing suppression | Git history, audit log |
Suppression Compliance Report (Monthly):
// GenerateSuppressionComplianceReport.cs — Monthly compliance report
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
public static class GenerateSuppressionComplianceReport
{
[FunctionName("GenerateSuppressionComplianceReport")]
public static async Task Run(
[TimerTrigger("0 0 8 1 * *")] TimerInfo timer, // Monthly: 1st day, 8:00 AM
ILogger log)
{
log.LogInformation("Generating monthly suppression compliance report");
var reportMonth = DateTime.UtcNow.AddMonths(-1).ToString("yyyy-MM");
// Load all active suppressions
var owaspSuppressions = await LoadOwaspSuppressionsAsync();
var credscanSuppressions = await LoadCredscanSuppressionsAsync();
var sonarSuppressions = await LoadSonarQubeSuppressionsAsync();
var trivySuppressions = await LoadTrivySuppressionsAsync();
var allSuppressions = new List<Suppression>()
.Concat(owaspSuppressions)
.Concat(credscanSuppressions)
.Concat(sonarSuppressions)
.Concat(trivySuppressions)
.ToList();
// Compliance checks
var expired = allSuppressions.Where(s => s.ExpirationDate < DateTime.UtcNow).ToList();
var expiringSoon = allSuppressions.Where(s => s.ExpirationDate <= DateTime.UtcNow.AddDays(30) && s.ExpirationDate >= DateTime.UtcNow).ToList();
var noExpiration = allSuppressions.Where(s => s.ExpirationDate == default).ToList();
var noApprover = allSuppressions.Where(s => string.IsNullOrEmpty(s.ApprovedBy)).ToList();
var criticalHigh = allSuppressions.Where(s => s.RiskLevel == "Critical" || s.RiskLevel == "High").ToList();
// Generate report
var report = new SuppressionComplianceReport
{
ReportMonth = reportMonth,
GeneratedAt = DateTime.UtcNow,
TotalSuppressions = allSuppressions.Count,
ActiveSuppressions = allSuppressions.Count - expired.Count,
ExpiredSuppressions = expired.Count,
ExpiringSoon = expiringSoon.Count,
ComplianceIssues = new List<string>(),
Recommendations = new List<string>()
};
// Check compliance violations
if (expired.Count > 0)
{
report.ComplianceIssues.Add($"{expired.Count} suppressions have expired and must be removed or renewed");
}
if (noExpiration.Count > 0)
{
report.ComplianceIssues.Add($"{noExpiration.Count} suppressions have no expiration date (violates policy)");
}
if (noApprover.Count > 0)
{
report.ComplianceIssues.Add($"{noApprover.Count} suppressions have no approver (violates governance)");
}
if (criticalHigh.Count > 0 && criticalHigh.Any(s => (s.ExpirationDate - s.ApprovedDate).Days > 90))
{
report.ComplianceIssues.Add($"Critical/High suppressions exceed 90-day limit (policy violation)");
}
// Generate recommendations
if (expiringSoon.Count > 0)
{
report.Recommendations.Add($"Review {expiringSoon.Count} suppressions expiring within 30 days");
}
if (criticalHigh.Count > 5)
{
report.Recommendations.Add($"High number of Critical/High suppressions ({criticalHigh.Count}); prioritize remediation");
}
// Archive report to immutable storage
await ArchiveComplianceReportAsync(report);
// Send report to stakeholders
await SendComplianceReportAsync(report);
log.LogInformation($"Suppression compliance report generated: {report.ComplianceIssues.Count} issues, {report.Recommendations.Count} recommendations");
}
private static async Task<List<Suppression>> LoadOwaspSuppressionsAsync()
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task<List<Suppression>> LoadCredscanSuppressionsAsync()
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task<List<Suppression>> LoadSonarQubeSuppressionsAsync()
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task<List<Suppression>> LoadTrivySuppressionsAsync()
{
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task ArchiveComplianceReportAsync(SuppressionComplianceReport report)
{
// Archive to Azure Blob with legal hold (7-year retention)
// Implementation omitted for brevity
throw new NotImplementedException();
}
private static async Task SendComplianceReportAsync(SuppressionComplianceReport report)
{
// Send to security team, compliance officer, architects
// Implementation omitted for brevity
throw new NotImplementedException();
}
}
public class Suppression
{
public string Tool { get; set; }
public string CVE { get; set; }
public string Package { get; set; }
public string Reason { get; set; }
public string ApprovedBy { get; set; }
public DateTime ApprovedDate { get; set; }
public DateTime ExpirationDate { get; set; }
public string RiskLevel { get; set; }
}
public class SuppressionComplianceReport
{
public string ReportMonth { get; set; }
public DateTime GeneratedAt { get; set; }
public int TotalSuppressions { get; set; }
public int ActiveSuppressions { get; set; }
public int ExpiredSuppressions { get; set; }
public int ExpiringSoon { get; set; }
public List<string> ComplianceIssues { get; set; }
public List<string> Recommendations { get; set; }
}
Summary¶
- Suppression Files: 6 formats (OWASP XML, CredScan JSON, SonarQube XML, StyleCop JSON, Roslyn .globalconfig, Trivy .trivyignore) with approval metadata
- Risk Acceptance Process: 4-step workflow (justification, approval, expiration, audit trail) with Mermaid diagram
- Risk Acceptance Criteria: 4 risk levels (Critical 30 days, High 90 days, Medium 180 days, Low 365 days) with approval matrix
- Valid Justifications: False positives, mitigated risks, temporary exceptions, business decisions
- Approval Matrix: 4 approval levels (CISO+Security+Architect for critical, Security+Architect for high, Architect for medium, Team Lead for low)
- Risk Acceptance Form: Azure DevOps work item template with vulnerability details, risk assessment, justification, mitigation controls, remediation plan
- Suppression Expiration Tracker: Weekly C# Azure Function checking all suppression files for expiring items (within 30 days), creates re-review work items
- Meta-Audit Logging: 3 audit events (SuppressionCreated, SuppressionExpired, SuppressionRenewed) with complete metadata
- Suppression Compliance Report: Monthly C# Azure Function generating compliance report (expired suppressions, missing approvals, policy violations), archived to immutable storage (WORM, 7-year retention)
- Governance Controls: 6 controls mapped to SOC 2/GDPR/HIPAA (approval authority, justification, expiration, audit trail, periodic review, removal)
Testing Quality Gates¶
Purpose: Enforce comprehensive test quality standards across unit, integration, and regression test suites to ensure high-quality, maintainable, and reliable test automation.
Testing Quality Philosophy:
- Comprehensive Coverage: Tests cover critical paths, edge cases, error conditions, and tenant isolation
- Fast Feedback: Unit tests complete in <30s; integration tests in <5min; regression tests in <15min
- Reliable Execution: Zero flaky tests tolerated in main suite; quarantine mechanism for unstable tests
- Maintainable Tests: High assertion density, clear naming conventions, isolated test data
- Continuous Validation: Tests run on every commit (unit), every build (integration), every deployment (regression)
Testing Quality Workflow:
graph TD
A[Code Commit] --> B[Unit Tests]
B --> C{All Pass?}
C -->|No| D[Block Build]
C -->|Yes| E[Integration Tests]
E --> F{All Pass?}
F -->|No| D
F -->|Yes| G[Coverage Check]
G --> H{Meets Threshold?}
H -->|No| D
H -->|Yes| I[Test Quality Gates]
I --> J{Quality Metrics OK?}
J -->|No| K[Warning/Block]
J -->|Yes| L[Build Artifacts]
L --> M[Deploy to Dev]
M --> N[Regression Tests]
N --> O{All Pass?}
O -->|No| P[Rollback]
O -->|Yes| Q[Promote to Test]
D --> R[Fix Tests]
K --> R
P --> R
style D fill:#ff6b6b
style P fill:#ff6b6b
style Q fill:#90EE90
Unit Test Quality Gates¶
Purpose: Ensure high-quality unit tests that are fast, focused, isolated, and maintainable.
Unit Test Quality Criteria:
# Unit test validation gates
unitTestGates:
# Quantitative thresholds
minTests: 50 # Minimum tests per service (adjustable per service)
maxDuration: 30 # Maximum total suite duration (seconds)
flakyThreshold: 5 # Maximum flaky test rate (percentage)
assertionDensity: 1.5 # Minimum assertions per test (avg)
quarantineLimit: 3 # Maximum quarantined tests allowed
# Qualitative requirements
namingConvention: "MethodName_Scenario_ExpectedResult" # Enforced pattern
arrangeActAssert: true # AAA pattern enforced
singleResponsibility: true # One logical assertion per test
noExternalDependencies: true # No database, network, file system
# Coverage requirements (already covered by Test Coverage Gates)
lineCoverage: 70 # Minimum line coverage (per service)
branchCoverage: 60 # Minimum branch coverage (per service)
Unit Test Quality Validation (PowerShell):
# Validate-UnitTestQuality.ps1 — Enforce unit test quality standards
param(
[string]$TestResultsPath = "TestResults",
[int]$MinTests = 50,
[int]$MaxDurationSeconds = 30,
[double]$MaxFlakyRate = 5.0,
[double]$MinAssertionDensity = 1.5,
[int]$MaxQuarantinedTests = 3
)
$ErrorActionPreference = "Stop"
Write-Host "Validating unit test quality..." -ForegroundColor Cyan
# Parse test results (VSTest format)
$testResultFiles = Get-ChildItem -Path $TestResultsPath -Filter "*.trx" -Recurse
if ($testResultFiles.Count -eq 0) {
Write-Error "No test result files found in $TestResultsPath"
exit 1
}
$totalTests = 0
$totalDuration = 0
$totalAssertions = 0
$flakyTests = 0
$quarantinedTests = 0
foreach ($file in $testResultFiles) {
[xml]$trx = Get-Content $file.FullName
$ns = @{ns = "http://microsoft.com/schemas/VisualStudio/TeamTest/2010"}
# Count tests
$unitTests = $trx | Select-Xml -XPath "//ns:UnitTest" -Namespace $ns
$totalTests += $unitTests.Count
# Calculate duration
$testResults = $trx | Select-Xml -XPath "//ns:UnitTestResult" -Namespace $ns
foreach ($result in $testResults) {
$duration = $result.Node.duration
if ($duration -match 'PT([\d.]+)S') {
$totalDuration += [double]$matches[1]
}
# Check for flaky test markers
if ($result.Node.outcome -eq "Failed" -and $result.Node.testName -match "Flaky") {
$flakyTests++
}
# Check for quarantined tests
if ($result.Node.testName -match "\[Quarantine\]") {
$quarantinedTests++
}
}
# Count assertions (parse test source code for Assert.* calls)
# Simplified: assume 1.8 assertions per test (actual implementation would parse source)
$totalAssertions = $totalTests * 1.8
}
Write-Host "Test Quality Metrics:" -ForegroundColor Yellow
Write-Host " Total Unit Tests: $totalTests" -ForegroundColor White
Write-Host " Total Duration: ${totalDuration}s" -ForegroundColor White
Write-Host " Flaky Tests: $flakyTests" -ForegroundColor White
Write-Host " Quarantined Tests: $quarantinedTests" -ForegroundColor White
Write-Host " Assertion Density: $($totalAssertions / $totalTests)" -ForegroundColor White
# Validate thresholds
$failed = $false
if ($totalTests -lt $MinTests) {
Write-Error "Insufficient unit tests: $totalTests < $MinTests"
$failed = $true
}
if ($totalDuration -gt $MaxDurationSeconds) {
Write-Error "Unit test suite too slow: ${totalDuration}s > ${MaxDurationSeconds}s"
$failed = $true
}
$flakyRate = ($flakyTests / $totalTests) * 100
if ($flakyRate -gt $MaxFlakyRate) {
Write-Error "Flaky test rate too high: ${flakyRate}% > ${MaxFlakyRate}%"
$failed = $true
}
$assertionDensity = $totalAssertions / $totalTests
if ($assertionDensity -lt $MinAssertionDensity) {
Write-Error "Assertion density too low: $assertionDensity < $MinAssertionDensity"
$failed = $true
}
if ($quarantinedTests -gt $MaxQuarantinedTests) {
Write-Error "Too many quarantined tests: $quarantinedTests > $MaxQuarantinedTests"
$failed = $true
}
if ($failed) {
Write-Host "Unit test quality gates FAILED" -ForegroundColor Red
exit 1
}
Write-Host "Unit test quality gates PASSED" -ForegroundColor Green
exit 0
Azure Pipelines Integration:
# azure-pipelines.yml — Unit test quality gates
- stage: CI_Stage
jobs:
- job: Build_Test_Validate
steps:
# ... build steps ...
# Run unit tests
- task: DotNetCoreCLI@2
displayName: 'Run Unit Tests'
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: '--configuration Release --filter Category=Unit --collect:"XPlat Code Coverage" --logger trx'
publishTestResults: true
# Validate unit test quality
- task: PowerShell@2
displayName: 'Validate Unit Test Quality'
inputs:
filePath: 'scripts/Validate-UnitTestQuality.ps1'
arguments: >
-TestResultsPath "$(Agent.TempDirectory)/TestResults"
-MinTests 50
-MaxDurationSeconds 30
-MaxFlakyRate 5.0
-MinAssertionDensity 1.5
-MaxQuarantinedTests 3
continueOnError: false # Block build on failure
Unit Test Naming Convention Enforcement (Roslyn Analyzer):
// ATP003: Unit test naming convention analyzer
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Diagnostics;
[DiagnosticAnalyzer(LanguageNames.CSharp)]
public class UnitTestNamingAnalyzer : DiagnosticAnalyzer
{
private const string DiagnosticId = "ATP003";
private const string Title = "Unit test method does not follow naming convention";
private const string MessageFormat = "Test method '{0}' should follow pattern 'MethodName_Scenario_ExpectedResult'";
private const string Category = "Testing";
private static readonly DiagnosticDescriptor Rule = new DiagnosticDescriptor(
DiagnosticId,
Title,
MessageFormat,
Category,
DiagnosticSeverity.Warning,
isEnabledByDefault: true,
description: "Unit test methods should use the naming pattern MethodName_Scenario_ExpectedResult for clarity.");
public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics => ImmutableArray.Create(Rule);
public override void Initialize(AnalysisContext context)
{
context.ConfigureGeneratedCodeAnalysis(GeneratedCodeAnalysisFlags.None);
context.EnableConcurrentExecution();
context.RegisterSyntaxNodeAction(AnalyzeMethod, SyntaxKind.MethodDeclaration);
}
private void AnalyzeMethod(SyntaxNodeAnalysisContext context)
{
var methodDeclaration = (MethodDeclarationSyntax)context.Node;
var methodSymbol = context.SemanticModel.GetDeclaredSymbol(methodDeclaration);
if (methodSymbol == null)
return;
// Check if method has [Fact] or [Theory] attribute (xUnit)
var hasTestAttribute = methodSymbol.GetAttributes().Any(attr =>
attr.AttributeClass?.Name == "FactAttribute" ||
attr.AttributeClass?.Name == "TheoryAttribute" ||
attr.AttributeClass?.Name == "TestAttribute" || // NUnit
attr.AttributeClass?.Name == "TestMethodAttribute"); // MSTest
if (!hasTestAttribute)
return;
var methodName = methodSymbol.Name;
// Validate naming pattern: MethodName_Scenario_ExpectedResult
// Must have at least 2 underscores
var underscoreCount = methodName.Count(c => c == '_');
if (underscoreCount < 2)
{
var diagnostic = Diagnostic.Create(Rule, methodDeclaration.Identifier.GetLocation(), methodName);
context.ReportDiagnostic(diagnostic);
return;
}
// Validate each segment is PascalCase
var segments = methodName.Split('_');
foreach (var segment in segments)
{
if (string.IsNullOrWhiteSpace(segment) || !char.IsUpper(segment[0]))
{
var diagnostic = Diagnostic.Create(Rule, methodDeclaration.Identifier.GetLocation(), methodName);
context.ReportDiagnostic(diagnostic);
return;
}
}
}
}
Unit Test Quality Examples (C#):
// ✅ GOOD: High-quality unit test following AAA pattern
[Fact]
public void CreateAuditRecord_WithValidData_ReturnsSuccess()
{
// Arrange
var service = new AuditRecordService();
var request = new CreateAuditRecordRequest
{
TenantId = Guid.NewGuid(),
Action = "UserLogin",
UserId = "user-123",
Timestamp = DateTime.UtcNow
};
// Act
var result = service.CreateAuditRecord(request);
// Assert
Assert.NotNull(result);
Assert.True(result.Success);
Assert.NotEqual(Guid.Empty, result.RecordId);
}
// ✅ GOOD: Edge case testing
[Theory]
[InlineData(null)]
[InlineData("")]
[InlineData(" ")]
public void CreateAuditRecord_WithInvalidAction_ThrowsArgumentException(string invalidAction)
{
// Arrange
var service = new AuditRecordService();
var request = new CreateAuditRecordRequest { Action = invalidAction };
// Act & Assert
Assert.Throws<ArgumentException>(() => service.CreateAuditRecord(request));
}
// ❌ BAD: Poor naming, no AAA structure
[Fact]
public void Test1()
{
var service = new AuditRecordService();
var result = service.CreateAuditRecord(new CreateAuditRecordRequest());
Assert.NotNull(result);
}
// ❌ BAD: Multiple responsibilities (should be 2 separate tests)
[Fact]
public void CreateAndUpdateAuditRecord_ReturnsSuccess()
{
var service = new AuditRecordService();
var createResult = service.CreateAuditRecord(new CreateAuditRecordRequest());
Assert.True(createResult.Success);
var updateResult = service.UpdateAuditRecord(createResult.RecordId, new UpdateRequest());
Assert.True(updateResult.Success);
}
Integration Test Quality Gates¶
Purpose: Ensure high-quality integration tests that validate service interactions, data persistence, and tenant isolation.
Integration Test Quality Criteria:
# Integration test validation gates
integrationTestGates:
# Quantitative thresholds
minTests: 20 # Minimum integration tests per service
maxDuration: 300 # Maximum total suite duration (5 minutes)
# Service container requirements
serviceContainers:
required:
- redis # Cache integration
- sql # Database integration
- rabbitmq # Message bus integration
optional:
- otel # Observability integration
- seq # Logging integration
- cosmos # NoSQL integration (for Query service)
# Functional requirements
isolationVerified: true # Tenant isolation tests required
contractTests: true # API contract validation required
errorScenarios: true # Error handling tests required
retryLogic: true # Retry/resilience tests required
# Data management
testDataIsolation: true # Each test uses isolated data
cleanupVerified: true # Test data cleanup verified
idempotency: true # Idempotency tests required (for state-mutating ops)
Integration Test Quality Validation (PowerShell):
# Validate-IntegrationTestQuality.ps1 — Enforce integration test quality
param(
[string]$TestResultsPath = "TestResults",
[int]$MinTests = 20,
[int]$MaxDurationSeconds = 300,
[string[]]$RequiredContainers = @("redis", "sql", "rabbitmq")
)
$ErrorActionPreference = "Stop"
Write-Host "Validating integration test quality..." -ForegroundColor Cyan
# Parse test results
$testResultFiles = Get-ChildItem -Path $TestResultsPath -Filter "*integration*.trx" -Recurse
if ($testResultFiles.Count -eq 0) {
Write-Error "No integration test result files found"
exit 1
}
$totalTests = 0
$totalDuration = 0
$isolationTests = 0
$contractTests = 0
$errorScenarioTests = 0
foreach ($file in $testResultFiles) {
[xml]$trx = Get-Content $file.FullName
$ns = @{ns = "http://microsoft.com/schemas/VisualStudio/TeamTest/2010"}
$testResults = $trx | Select-Xml -XPath "//ns:UnitTestResult" -Namespace $ns
$totalTests += $testResults.Count
foreach ($result in $testResults) {
# Duration
if ($result.Node.duration -match 'PT([\d.]+)S') {
$totalDuration += [double]$matches[1]
}
# Count specific test categories
$testName = $result.Node.testName
if ($testName -match "TenantIsolation") { $isolationTests++ }
if ($testName -match "Contract") { $contractTests++ }
if ($testName -match "Error|Exception") { $errorScenarioTests++ }
}
}
Write-Host "Integration Test Metrics:" -ForegroundColor Yellow
Write-Host " Total Integration Tests: $totalTests" -ForegroundColor White
Write-Host " Total Duration: ${totalDuration}s" -ForegroundColor White
Write-Host " Tenant Isolation Tests: $isolationTests" -ForegroundColor White
Write-Host " Contract Tests: $contractTests" -ForegroundColor White
Write-Host " Error Scenario Tests: $errorScenarioTests" -ForegroundColor White
# Validate thresholds
$failed = $false
if ($totalTests -lt $MinTests) {
Write-Error "Insufficient integration tests: $totalTests < $MinTests"
$failed = $true
}
if ($totalDuration -gt $MaxDurationSeconds) {
Write-Error "Integration test suite too slow: ${totalDuration}s > ${MaxDurationSeconds}s"
$failed = $true
}
if ($isolationTests -eq 0) {
Write-Error "No tenant isolation tests found (required)"
$failed = $true
}
if ($contractTests -eq 0) {
Write-Error "No API contract tests found (required)"
$failed = $true
}
# Verify service containers are running
foreach ($container in $RequiredContainers) {
$running = docker ps --filter "name=$container" --filter "status=running" --format "{{.Names}}"
if (-not $running) {
Write-Error "Required service container not running: $container"
$failed = $true
}
}
if ($failed) {
Write-Host "Integration test quality gates FAILED" -ForegroundColor Red
exit 1
}
Write-Host "Integration test quality gates PASSED" -ForegroundColor Green
exit 0
Azure Pipelines Integration:
# azure-pipelines.yml — Integration test quality gates
- stage: CI_Stage
jobs:
- job: Build_Test_Validate
# Service containers for integration tests
services:
redis: redis
sql: mssql
rabbitmq: rabbitmq
otel: otel-collector
seq: seq
steps:
# ... build steps ...
# Run integration tests
- task: DotNetCoreCLI@2
displayName: 'Run Integration Tests'
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: '--configuration Release --filter Category=Integration --logger trx --results-directory $(Agent.TempDirectory)/TestResults'
publishTestResults: true
env:
ConnectionStrings__Redis: 'redis:6379'
ConnectionStrings__Database: 'Server=sql;Database=ATP_Test;User Id=sa;Password=P@ssw0rd123!'
ConnectionStrings__RabbitMQ: 'amqp://guest:guest@rabbitmq:5672'
# Validate integration test quality
- task: PowerShell@2
displayName: 'Validate Integration Test Quality'
inputs:
filePath: 'scripts/Validate-IntegrationTestQuality.ps1'
arguments: >
-TestResultsPath "$(Agent.TempDirectory)/TestResults"
-MinTests 20
-MaxDurationSeconds 300
-RequiredContainers @("redis", "sql", "rabbitmq")
continueOnError: false
Integration Test Examples (C#):
// ✅ GOOD: Tenant isolation integration test
[Fact]
[Trait("Category", "Integration")]
public async Task CreateAuditRecord_TenantIsolation_RecordsNotVisibleAcrossTenants()
{
// Arrange
var tenant1Id = Guid.NewGuid();
var tenant2Id = Guid.NewGuid();
var service = new AuditRecordService(_dbContext, _cache);
var record1 = new CreateAuditRecordRequest
{
TenantId = tenant1Id,
Action = "UserLogin",
UserId = "user-1"
};
var record2 = new CreateAuditRecordRequest
{
TenantId = tenant2Id,
Action = "UserLogin",
UserId = "user-2"
};
// Act
var result1 = await service.CreateAuditRecordAsync(record1);
var result2 = await service.CreateAuditRecordAsync(record2);
var tenant1Records = await service.QueryAuditRecordsAsync(new QueryRequest { TenantId = tenant1Id });
var tenant2Records = await service.QueryAuditRecordsAsync(new QueryRequest { TenantId = tenant2Id });
// Assert
Assert.Single(tenant1Records);
Assert.Single(tenant2Records);
Assert.Equal(result1.RecordId, tenant1Records.First().Id);
Assert.Equal(result2.RecordId, tenant2Records.First().Id);
Assert.DoesNotContain(tenant1Records, r => r.TenantId == tenant2Id);
Assert.DoesNotContain(tenant2Records, r => r.TenantId == tenant1Id);
}
// ✅ GOOD: Error scenario integration test
[Fact]
[Trait("Category", "Integration")]
public async Task CreateAuditRecord_DatabaseUnavailable_ThrowsServiceException()
{
// Arrange
// Simulate database failure by stopping SQL container
await _dockerCompose.StopAsync("sql");
var service = new AuditRecordService(_dbContext, _cache);
var request = new CreateAuditRecordRequest { TenantId = Guid.NewGuid(), Action = "Test" };
// Act & Assert
await Assert.ThrowsAsync<ServiceException>(async () =>
{
await service.CreateAuditRecordAsync(request);
});
// Cleanup: restart SQL
await _dockerCompose.StartAsync("sql");
}
// ✅ GOOD: Contract validation test
[Fact]
[Trait("Category", "Integration")]
public async Task CreateAuditRecord_Contract_ResponseMatchesOpenAPISchema()
{
// Arrange
var client = _factory.CreateClient();
var request = new CreateAuditRecordRequest { TenantId = Guid.NewGuid(), Action = "Test" };
// Act
var response = await client.PostAsJsonAsync("/api/audit-records", request);
// Assert
response.EnsureSuccessStatusCode();
var json = await response.Content.ReadAsStringAsync();
var schema = await LoadOpenAPISchemaAsync("CreateAuditRecordResponse");
var validationResult = _schemaValidator.Validate(json, schema);
Assert.True(validationResult.IsValid, $"Response does not match schema: {string.Join(", ", validationResult.Errors)}");
}
Regression Test Quality Gates (Staging)¶
Purpose: Ensure comprehensive regression testing in staging environment before production deployment.
Regression Test Quality Criteria:
# Regression test validation gates (Staging environment)
regressionTestGates:
# Pass rate requirements
passRate: 100 # All regression tests must pass
criticalScenariosPass: 100 # All @security, @compliance tests must pass
# Coverage matrix
environmentCoverage:
- dev # Basic smoke tests
- test # Full regression suite
- staging # Production-like regression + load tests
# Scenario coverage
tenantScenarios:
- single # Single-tenant scenarios
- multi # Multi-tenant scenarios
- isolation # Tenant isolation validation
# Test categories (required)
requiredCategories:
- smoke # Critical path smoke tests
- regression # Full regression suite
- security # Security regression tests
- compliance # Compliance validation tests
- performance # Performance regression tests
# Duration thresholds
maxDuration: 900 # 15 minutes maximum
parallelization: true # Tests must support parallel execution
Regression Test Quality Validation (PowerShell):
# Validate-RegressionTestQuality.ps1 — Staging regression test validation
param(
[string]$TestResultsPath = "TestResults",
[int]$RequiredPassRate = 100,
[string]$Environment = "Staging",
[int]$MaxDurationSeconds = 900
)
$ErrorActionPreference = "Stop"
Write-Host "Validating regression test quality for $Environment..." -ForegroundColor Cyan
# Parse test results
$testResultFiles = Get-ChildItem -Path $TestResultsPath -Filter "*regression*.trx" -Recurse
if ($testResultFiles.Count -eq 0) {
Write-Error "No regression test result files found"
exit 1
}
$totalTests = 0
$passedTests = 0
$failedTests = 0
$criticalTests = 0
$criticalPassed = 0
$totalDuration = 0
$smokeTests = 0
$securityTests = 0
$complianceTests = 0
$performanceTests = 0
foreach ($file in $testResultFiles) {
[xml]$trx = Get-Content $file.FullName
$ns = @{ns = "http://microsoft.com/schemas/VisualStudio/TeamTest/2010"}
$testResults = $trx | Select-Xml -XPath "//ns:UnitTestResult" -Namespace $ns
$totalTests += $testResults.Count
foreach ($result in $testResults) {
# Duration
if ($result.Node.duration -match 'PT([\d.]+)S') {
$totalDuration += [double]$matches[1]
}
# Pass/Fail
if ($result.Node.outcome -eq "Passed") {
$passedTests++
} else {
$failedTests++
}
# Critical tests (@security, @compliance tags)
$testName = $result.Node.testName
if ($testName -match "@security|@compliance") {
$criticalTests++
if ($result.Node.outcome -eq "Passed") {
$criticalPassed++
}
}
# Category counts
if ($testName -match "Smoke") { $smokeTests++ }
if ($testName -match "Security") { $securityTests++ }
if ($testName -match "Compliance") { $complianceTests++ }
if ($testName -match "Performance") { $performanceTests++ }
}
}
$passRate = ($passedTests / $totalTests) * 100
$criticalPassRate = if ($criticalTests -gt 0) { ($criticalPassed / $criticalTests) * 100 } else { 100 }
Write-Host "Regression Test Metrics ($Environment):" -ForegroundColor Yellow
Write-Host " Total Regression Tests: $totalTests" -ForegroundColor White
Write-Host " Passed: $passedTests" -ForegroundColor Green
Write-Host " Failed: $failedTests" -ForegroundColor $(if ($failedTests -gt 0) { "Red" } else { "White" })
Write-Host " Pass Rate: ${passRate}%" -ForegroundColor White
Write-Host " Critical Tests: $criticalTests" -ForegroundColor White
Write-Host " Critical Pass Rate: ${criticalPassRate}%" -ForegroundColor White
Write-Host " Total Duration: ${totalDuration}s" -ForegroundColor White
Write-Host "" -ForegroundColor White
Write-Host " Category Breakdown:" -ForegroundColor Yellow
Write-Host " Smoke: $smokeTests" -ForegroundColor White
Write-Host " Security: $securityTests" -ForegroundColor White
Write-Host " Compliance: $complianceTests" -ForegroundColor White
Write-Host " Performance: $performanceTests" -ForegroundColor White
# Validate thresholds
$failed = $false
if ($passRate -lt $RequiredPassRate) {
Write-Error "Regression test pass rate too low: ${passRate}% < ${RequiredPassRate}%"
$failed = $true
}
if ($criticalPassRate -lt 100) {
Write-Error "Critical tests failed: ${criticalPassRate}% pass rate (must be 100%)"
$failed = $true
}
if ($totalDuration -gt $MaxDurationSeconds) {
Write-Error "Regression test suite too slow: ${totalDuration}s > ${MaxDurationSeconds}s"
$failed = $true
}
# Validate required categories
if ($smokeTests -eq 0) {
Write-Error "No smoke tests found (required)"
$failed = $true
}
if ($securityTests -eq 0) {
Write-Error "No security tests found (required)"
$failed = $true
}
if ($complianceTests -eq 0) {
Write-Error "No compliance tests found (required)"
$failed = $true
}
if ($failed) {
Write-Host "Regression test quality gates FAILED" -ForegroundColor Red
exit 1
}
Write-Host "Regression test quality gates PASSED" -ForegroundColor Green
exit 0
Azure Pipelines Integration:
# azure-pipelines.yml — Regression test quality gates
- stage: Deploy_Staging
dependsOn: CI_Stage
jobs:
- deployment: DeployToStaging
environment: ATP-Staging
strategy:
runOnce:
deploy:
steps:
# Deploy to staging
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-staging
package: $(Pipeline.Workspace)/drop/*.zip
# Wait for deployment to stabilize
- task: PowerShell@2
displayName: 'Wait for Service Stabilization'
inputs:
targetType: 'inline'
script: Start-Sleep -Seconds 60
# Run regression tests
- task: DotNetCoreCLI@2
displayName: 'Run Regression Tests'
inputs:
command: 'test'
projects: '**/*RegressionTests.csproj'
arguments: '--configuration Release --logger trx --results-directory $(Agent.TempDirectory)/TestResults'
publishTestResults: true
env:
TestEnvironment: 'Staging'
BaseUrl: 'https://atp-ingestion-staging.azurewebsites.net'
# Validate regression test quality
- task: PowerShell@2
displayName: 'Validate Regression Test Quality'
inputs:
filePath: 'scripts/Validate-RegressionTestQuality.ps1'
arguments: >
-TestResultsPath "$(Agent.TempDirectory)/TestResults"
-RequiredPassRate 100
-Environment "Staging"
-MaxDurationSeconds 900
continueOnError: false
# On failure: rollback
- task: PowerShell@2
displayName: 'Rollback on Test Failure'
condition: failed()
inputs:
targetType: 'inline'
script: |
Write-Host "Regression tests failed; rolling back deployment..."
az webapp deployment slot swap `
--name atp-ingestion-staging `
--resource-group ATP-Staging-RG `
--slot staging `
--target-slot production `
--action swap
Regression Test Examples (C# with SpecFlow):
// ✅ GOOD: BDD-style regression test with Gherkin
[Binding]
public class AuditRecordRegressionSteps
{
private readonly ScenarioContext _scenarioContext;
private readonly HttpClient _client;
private HttpResponseMessage _response;
public AuditRecordRegressionSteps(ScenarioContext scenarioContext)
{
_scenarioContext = scenarioContext;
_client = new HttpClient { BaseAddress = new Uri(Environment.GetEnvironmentVariable("BaseUrl")) };
}
[Given(@"a tenant with ID ""(.*)""")]
public void GivenATenantWithID(string tenantId)
{
_scenarioContext["TenantId"] = Guid.Parse(tenantId);
}
[When(@"I create an audit record with action ""(.*)""")]
public async Task WhenICreateAnAuditRecordWithAction(string action)
{
var request = new CreateAuditRecordRequest
{
TenantId = (Guid)_scenarioContext["TenantId"],
Action = action,
UserId = "test-user",
Timestamp = DateTime.UtcNow
};
_response = await _client.PostAsJsonAsync("/api/audit-records", request);
_scenarioContext["Response"] = _response;
}
[Then(@"the response status code should be (.*)")]
public void ThenTheResponseStatusCodeShouldBe(int expectedStatusCode)
{
var response = (HttpResponseMessage)_scenarioContext["Response"];
Assert.Equal(expectedStatusCode, (int)response.StatusCode);
}
[Then(@"the audit record should be retrievable")]
public async Task ThenTheAuditRecordShouldBeRetrievable()
{
var response = (HttpResponseMessage)_scenarioContext["Response"];
var createResult = await response.Content.ReadFromJsonAsync<CreateAuditRecordResponse>();
var getResponse = await _client.GetAsync($"/api/audit-records/{createResult.RecordId}");
Assert.Equal(HttpStatusCode.OK, getResponse.StatusCode);
var record = await getResponse.Content.ReadFromJsonAsync<AuditRecordDto>();
Assert.Equal(createResult.RecordId, record.Id);
}
[Then(@"the audit record should be immutable")]
[Trait("Category", "Compliance")]
public async Task ThenTheAuditRecordShouldBeImmutable()
{
var response = (HttpResponseMessage)_scenarioContext["Response"];
var createResult = await response.Content.ReadFromJsonAsync<CreateAuditRecordResponse>();
// Attempt to update the record (should fail)
var updateRequest = new { Action = "ModifiedAction" };
var updateResponse = await _client.PutAsJsonAsync($"/api/audit-records/{createResult.RecordId}", updateRequest);
Assert.Equal(HttpStatusCode.MethodNotAllowed, updateResponse.StatusCode);
}
}
Gherkin Feature File:
# AuditRecordRegression.feature
Feature: Audit Record Regression Tests
As a system operator
I want to ensure audit records work correctly in staging
So that production deployments are safe
@smoke @regression
Scenario: Create and retrieve audit record
Given a tenant with ID "00000000-0000-0000-0000-000000000001"
When I create an audit record with action "UserLogin"
Then the response status code should be 201
And the audit record should be retrievable
@security @compliance @regression
Scenario: Audit records are immutable
Given a tenant with ID "00000000-0000-0000-0000-000000000001"
When I create an audit record with action "UserLogin"
Then the response status code should be 201
And the audit record should be immutable
@security @regression
Scenario: Tenant isolation is enforced
Given a tenant with ID "00000000-0000-0000-0000-000000000001"
And another tenant with ID "00000000-0000-0000-0000-000000000002"
When I create an audit record for tenant 1
And I query audit records for tenant 2
Then the response should contain zero records
And the tenant 1 record should not be visible
Test Quality Metrics & Reporting¶
Purpose: Track and report on test quality metrics to enable continuous improvement.
Test Quality Scorecard:
| Metric | Target | Current | Status |
|---|---|---|---|
| Unit Test Count | ≥50 per service | 67 | ✅ Pass |
| Unit Test Duration | <30s | 24s | ✅ Pass |
| Unit Test Pass Rate | 100% | 100% | ✅ Pass |
| Unit Test Flaky Rate | <5% | 2.1% | ✅ Pass |
| Assertion Density | ≥1.5 | 1.8 | ✅ Pass |
| Integration Test Count | ≥20 per service | 28 | ✅ Pass |
| Integration Test Duration | <5min | 4min 12s | ✅ Pass |
| Tenant Isolation Tests | ≥5 | 8 | ✅ Pass |
| Contract Tests | ≥10 | 12 | ✅ Pass |
| Regression Test Pass Rate | 100% | 100% | ✅ Pass |
| Critical Scenario Pass Rate | 100% | 100% | ✅ Pass |
| Regression Test Duration | <15min | 12min 45s | ✅ Pass |
Test Quality Dashboard (KQL):
// Query test quality metrics from Azure DevOps
TestResults
| where TestSuite in ("UnitTests", "IntegrationTests", "RegressionTests")
| where CompletedDate >= ago(7d)
| summarize
TotalTests = count(),
PassedTests = countif(Outcome == "Passed"),
FailedTests = countif(Outcome == "Failed"),
FlakyTests = countif(Outcome == "Failed" and TestName contains "Flaky"),
AvgDuration = avg(Duration),
MaxDuration = max(Duration)
by TestSuite, bin(CompletedDate, 1d)
| extend PassRate = (PassedTests * 100.0) / TotalTests
| extend FlakyRate = (FlakyTests * 100.0) / TotalTests
| project
Date = CompletedDate,
TestSuite,
TotalTests,
PassRate,
FlakyRate,
AvgDurationSeconds = AvgDuration / 1000,
MaxDurationSeconds = MaxDuration / 1000
| order by Date desc, TestSuite
Summary¶
- Unit Test Quality Gates: Min 50 tests, <30s duration, <5% flaky rate, ≥1.5 assertion density, ≤3 quarantined tests, naming convention enforced (ATP003 analyzer)
- Integration Test Quality Gates: Min 20 tests, <5min duration, required service containers (redis, sql, rabbitmq), tenant isolation/contract/error scenario tests required
- Regression Test Quality Gates: 100% pass rate, 100% critical scenario pass rate, <15min duration, environment/tenant/category coverage validated
- PowerShell Validators: 3 quality gate validation scripts (unit, integration, regression) integrated into Azure Pipelines
- Test Examples: 10+ C# examples demonstrating AAA pattern, tenant isolation, error handling, contract validation, BDD/Gherkin
- Test Quality Metrics: 12-metric scorecard tracked via KQL dashboard, test quality trends analyzed for continuous improvement
Governance & Continuous Evolution¶
Purpose: Establish clear ownership for each quality gate category and define evolution roadmap for continuous improvement of quality gate effectiveness.
Governance Principles:
- Owned & Accountable: Each gate type has a designated owner responsible for threshold maintenance and updates
- Regularly Reviewed: Quality gates reviewed quarterly (minimum) or monthly for security/compliance
- Evidence-Based Evolution: Gate thresholds adjusted based on historical data and team capability
- Transparent Communication: Gate changes communicated to all stakeholders with rationale and migration plan
- Continuous Improvement: Roadmap for enhancing gate automation, accuracy, and developer experience
Quality Gate Ownership¶
Purpose: Define clear accountability for each quality gate category with designated owners, reviewers, and update frequency.
Ownership Matrix:
| Gate Type | Owner | Reviewer | Update Frequency | Escalation Path |
|---|---|---|---|---|
| Build Quality | Tech Lead | Architect | Quarterly | CTO |
| Test Coverage | QA Lead | Tech Lead | Quarterly | VP Engineering |
| Security | Security Officer | CISO | Monthly | CISO → Board |
| SBOM & Supply Chain | Security Officer | CISO | Monthly | CISO → Board |
| Compliance | Compliance Officer | DPO (Data Protection Officer) | Monthly | Legal Counsel |
| Performance | SRE Lead | Architect | Quarterly | VP Engineering |
| Observability | SRE Lead | Tech Lead | Quarterly | VP Engineering |
| Contract & API | Architect | Tech Lead | Quarterly | CTO |
| Approval Gates | CAB (Change Advisory Board) | VP Engineering | As-needed | CTO |
Owner Responsibilities:
## Quality Gate Owner Responsibilities
### 1. Threshold Maintenance
- Review gate thresholds quarterly (or monthly for security/compliance)
- Analyze historical gate pass/fail trends
- Recommend threshold adjustments based on team capability and risk tolerance
- Document rationale for threshold changes in ADR (Architecture Decision Record)
### 2. Gate Effectiveness Monitoring
- Track gate precision/recall (true positives, false positives)
- Identify and remediate false positive patterns
- Monitor gate execution time (ensure gates provide fast feedback)
- Review gate failure remediation time (MTTR)
### 3. Stakeholder Communication
- Communicate gate changes to development teams with 2-week notice
- Provide migration guides and examples for new gates
- Conduct training sessions for complex gates (e.g., Roslyn analyzers)
- Publish monthly gate health reports to stakeholders
### 4. Continuous Improvement
- Propose new gates for emerging risks (e.g., AI model validation, privacy-preserving ML)
- Automate manual gates where feasible (e.g., shift approval gates to pre-deployment validation)
- Improve gate error messages for faster remediation
- Contribute to evolution roadmap
Reviewer Responsibilities:
## Quality Gate Reviewer Responsibilities
### 1. Threshold Review & Approval
- Review proposed threshold changes
- Validate rationale and supporting data
- Approve or reject changes based on risk assessment
- Ensure changes align with organizational standards
### 2. Risk Assessment
- Evaluate security/compliance impact of threshold changes
- Identify potential risks from relaxing thresholds
- Recommend compensating controls if thresholds lowered
- Escalate high-risk changes to executive leadership
### 3. Audit & Governance
- Ensure gate changes documented in version control (Git)
- Verify gate changes logged in meta-audit stream
- Validate gate changes comply with SOC 2/GDPR/HIPAA
- Support external audits with gate evidence
Quality Gate Change Request Process:
graph TD
A[Owner Proposes Change] --> B[Document Rationale in ADR]
B --> C[Analyze Historical Data]
C --> D[Create Change Request]
D --> E{Reviewer Approval?}
E -->|No| F[Reject with Feedback]
E -->|Yes| G{Security/Compliance Impact?}
G -->|High| H[Escalate to CISO/DPO]
G -->|Low/Medium| I[Schedule Rollout]
H --> J{Executive Approval?}
J -->|No| F
J -->|Yes| I
I --> K[Communicate to Teams]
K --> L[Update Pipeline Config]
L --> M[Deploy to Dev]
M --> N[Soak Period 2 weeks]
N --> O{Issues Detected?}
O -->|Yes| P[Rollback]
O -->|No| Q[Deploy to Test]
Q --> R[Deploy to Staging]
R --> S[Deploy to Production]
F --> T[Owner Revises Proposal]
P --> T
S --> U[Log Change in Meta-Audit]
style F fill:#ff6b6b
style P fill:#ff6b6b
style S fill:#90EE90
Quality Gate Change Request Template (Azure DevOps Work Item):
# Work Item Type: Quality Gate Change Request
fields:
- field: System.Title
value: "[Gate Change] [Gate Type] — [Change Summary]"
- field: System.Description
value: |
## Change Summary
**Gate Type**: [Build Quality / Test Coverage / Security / etc.]
**Current Threshold**: [Current value]
**Proposed Threshold**: [New value]
**Change Type**: [Tighten / Relax / Add New Gate / Remove Gate]
## Rationale
**Business Justification**:
[Why is this change needed? What problem does it solve?]
**Historical Data**:
- Current pass rate: [X%]
- Average violations per build: [N]
- Remediation time (MTTR): [N hours]
- False positive rate: [X%]
**Risk Assessment**:
- **Security Impact**: [None / Low / Medium / High]
- **Compliance Impact**: [None / Low / Medium / High]
- **Developer Productivity Impact**: [Positive / Neutral / Negative]
## Migration Plan
**Rollout Strategy**: [Gradual / Immediate]
**Soak Period**: [2 weeks in Dev/Test]
**Rollback Criteria**: [If >10% builds fail, rollback]
**Communication Plan**:
- [ ] Email to dev team (2 weeks before)
- [ ] Slack announcement with examples
- [ ] Training session scheduled (if complex change)
- [ ] Documentation updated
**Compensating Controls** (if relaxing threshold):
[What additional controls mitigate risk?]
## Approval Checklist
- [ ] Rationale documented
- [ ] Historical data analyzed
- [ ] Reviewer approved
- [ ] Security/Compliance reviewed (if applicable)
- [ ] Communication plan executed
- [ ] ADR created (for significant changes)
- field: Microsoft.VSTS.Common.Priority
value: 2 # P2 default; P1 for urgent security changes
- field: Custom.GateType
value: TestCoverage
- field: Custom.CurrentThreshold
value: "70%"
- field: Custom.ProposedThreshold
value: "75%"
- field: Custom.Owner
value: qa-lead@connectsoft.example
- field: Custom.Reviewer
value: tech-lead@connectsoft.example
Evolution Roadmap¶
Purpose: Define strategic vision for quality gate evolution, incorporating ML/AI, automation, and continuous improvement.
2025 Evolution Roadmap:
Q1 2025: ML-Based Flaky Test Detection & Auto-Quarantine
Goal: Automatically detect and quarantine flaky tests using machine learning instead of manual threshold-based detection.
Features: - ML Model: Train model on historical test results (pass/fail patterns, duration variability, environment correlations) - Auto-Quarantine: Automatically move flaky tests to quarantine suite when ML confidence > 85% - Root Cause Analysis: Use ML to identify common flaky test patterns (timing issues, resource contention, test order dependencies) - Self-Healing: Attempt automated fixes (add retries, increase timeouts, improve test isolation)
Implementation (C# + ML.NET):
// FlakyTestPredictor.cs — ML-based flaky test detection
using Microsoft.ML;
using Microsoft.ML.Data;
using System;
using System.Linq;
public class FlakyTestPredictor
{
private readonly MLContext _mlContext;
private ITransformer _model;
public FlakyTestPredictor()
{
_mlContext = new MLContext(seed: 0);
}
public void TrainModel(IEnumerable<TestExecution> historicalData)
{
var dataView = _mlContext.Data.LoadFromEnumerable(historicalData);
// Feature engineering: extract patterns from test executions
var pipeline = _mlContext.Transforms.Categorical.OneHotEncoding("TestName")
.Append(_mlContext.Transforms.NormalizeMinMax("Duration"))
.Append(_mlContext.Transforms.NormalizeMinMax("PassRate"))
.Append(_mlContext.Transforms.Concatenate("Features",
"TestName", "Duration", "PassRate", "FailurePatternCount"))
.Append(_mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(
labelColumnName: "IsFlaky", featureColumnName: "Features"));
_model = pipeline.Fit(dataView);
// Evaluate model
var predictions = _model.Transform(dataView);
var metrics = _mlContext.BinaryClassification.Evaluate(predictions, labelColumnName: "IsFlaky");
Console.WriteLine($"Model Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"AUC: {metrics.AreaUnderRocCurve:P2}");
}
public FlakyTestPrediction PredictFlakiness(TestExecution test)
{
var predictionEngine = _mlContext.Model.CreatePredictionEngine<TestExecution, FlakyTestPrediction>(_model);
return predictionEngine.Predict(test);
}
}
public class TestExecution
{
public string TestName { get; set; }
public float Duration { get; set; }
public float PassRate { get; set; } // Historical pass rate (0.0 to 1.0)
public int FailurePatternCount { get; set; } // Number of intermittent failure patterns
[ColumnName("Label")]
public bool IsFlaky { get; set; }
}
public class FlakyTestPrediction
{
[ColumnName("PredictedLabel")]
public bool IsFlaky { get; set; }
public float Probability { get; set; }
public float Score { get; set; }
}
Success Metrics: - ML model accuracy > 90% - False positive rate < 5% - Auto-quarantine reduces flaky test failures by 50%
Q2 2025: Predictive Gate Failure Analysis (Pre-Commit Warnings)
Goal: Predict quality gate failures before commit using ML analysis of code changes.
Features: - Pre-Commit Hooks: Analyze code changes locally before commit - ML Predictions: Predict likelihood of gate failures (coverage, security, complexity) - IDE Integration: VS Code/Visual Studio extensions show gate predictions in real-time - Remediation Suggestions: AI suggests fixes (add tests, refactor complex methods, update dependencies)
Implementation (PowerShell pre-commit hook):
# .git/hooks/pre-commit — Predictive gate failure analysis
$ErrorActionPreference = "Stop"
Write-Host "Running pre-commit gate analysis..." -ForegroundColor Cyan
# Get staged files
$stagedFiles = git diff --cached --name-only --diff-filter=ACM | Where-Object { $_ -match '\.cs$' }
if ($stagedFiles.Count -eq 0) {
Write-Host "No C# files staged for commit" -ForegroundColor Gray
exit 0
}
# Analyze code changes
$predictions = @()
foreach ($file in $stagedFiles) {
# Call ML API to predict gate failures
$response = Invoke-RestMethod -Uri "https://atp-ml-api.azurewebsites.net/predict/gate-failures" `
-Method Post `
-Body (@{
filePath = $file
changes = (git diff --cached $file)
} | ConvertTo-Json) `
-ContentType "application/json"
if ($response.predictions.Count -gt 0) {
$predictions += $response.predictions
}
}
# Display predictions
if ($predictions.Count -gt 0) {
Write-Host "" -ForegroundColor Yellow
Write-Host "⚠️ Predicted Quality Gate Failures:" -ForegroundColor Yellow
foreach ($prediction in $predictions) {
Write-Host " ❌ $($prediction.gateType): $($prediction.reason)" -ForegroundColor Red
Write-Host " File: $($prediction.file):$($prediction.line)" -ForegroundColor Gray
Write-Host " Confidence: $($prediction.confidence)%" -ForegroundColor Gray
Write-Host " Suggestion: $($prediction.suggestion)" -ForegroundColor Cyan
Write-Host "" -ForegroundColor White
}
# Prompt user
$response = Read-Host "Proceed with commit anyway? (y/N)"
if ($response -ne "y") {
Write-Host "Commit aborted. Fix predicted issues and try again." -ForegroundColor Red
exit 1
}
}
Write-Host "✅ Pre-commit analysis passed" -ForegroundColor Green
exit 0
Success Metrics: - 80% of predicted failures match actual failures - Developers fix 60% of predicted issues before commit - Average gate failure remediation time reduced by 40%
Q3 2025: Self-Healing Pipelines (Auto-Retry Transient Failures)
Goal: Automatically detect and retry transient failures (network timeouts, resource contention, flaky dependencies).
Features: - Failure Pattern Recognition: ML identifies transient vs. permanent failures - Smart Retry: Automatically retry with exponential backoff and jitter - Root Cause Logging: Log failure patterns for post-mortem analysis - Auto-Escalation: Escalate to human if retries exhausted
Implementation (Azure Pipelines YAML):
# Self-healing pipeline with smart retry
steps:
- task: DotNetCoreCLI@2
displayName: 'Run Integration Tests'
inputs:
command: 'test'
arguments: '--configuration Release --filter Category=Integration'
retryCountOnTaskFailure: 3 # Azure Pipelines native retry
env:
RETRY_STRATEGY: 'exponential' # Custom retry with exponential backoff
# Custom retry logic with ML-based transient failure detection
- task: PowerShell@2
displayName: 'Smart Retry on Transient Failures'
condition: failed() # Only run if previous step failed
inputs:
targetType: 'inline'
script: |
# Analyze failure logs
$failureLogs = Get-Content "$(Agent.TempDirectory)/test-logs.txt"
# Call ML API to classify failure (transient vs. permanent)
$response = Invoke-RestMethod -Uri "https://atp-ml-api.azurewebsites.net/classify/failure" `
-Method Post `
-Body (@{ logs = $failureLogs } | ConvertTo-Json) `
-ContentType "application/json"
if ($response.classification -eq "transient") {
Write-Host "Transient failure detected; retrying with exponential backoff..."
for ($i = 1; $i -le 3; $i++) {
$backoff = [Math]::Pow(2, $i) * (Get-Random -Minimum 1000 -Maximum 2000)
Write-Host "Retry attempt $i after ${backoff}ms..."
Start-Sleep -Milliseconds $backoff
# Retry test
dotnet test --configuration Release --filter Category=Integration
if ($LASTEXITCODE -eq 0) {
Write-Host "✅ Retry succeeded"
exit 0
}
}
Write-Host "❌ All retries exhausted; escalating to human"
exit 1
} else {
Write-Host "Permanent failure detected; no retry attempted"
exit 1
}
Success Metrics: - 70% of transient failures resolved by auto-retry - Average pipeline success rate increased from 95% to 98% - Manual intervention reduced by 50%
Q4 2025: AI-Assisted Code Quality Suggestions in PR Reviews
Goal: Provide real-time code quality suggestions in pull request comments using GPT-4 or similar LLMs.
Features: - Code Review Bot: Automatically reviews PRs and suggests improvements - Quality Gate Preview: Show predicted gate results before merge - Best Practice Suggestions: Recommend design patterns, refactorings, test coverage improvements - Security Vulnerability Detection: Identify potential security issues (SQL injection, XSS, etc.)
Implementation (GitHub Action / Azure DevOps Extension):
# .github/workflows/ai-code-review.yml
name: AI-Assisted Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
ai-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # Full history for diff analysis
- name: AI Code Review
uses: connectsoft/ai-code-review-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
model: 'gpt-4'
review-scope: 'diff' # Only review changed files
# Quality gates to analyze
gates:
- complexity
- test-coverage
- security
- best-practices
- performance
# Severity thresholds
fail-on: 'critical' # Block PR on critical issues
warn-on: 'high' # Comment warning on high issues
Example AI-Generated PR Comment:
## 🤖 AI Code Review — Ingestion Service PR #456
### ✅ Quality Gates Preview
- **Build Quality**: ✅ Pass (0 errors, 0 warnings)
- **Test Coverage**: ⚠️ Warning (68% line coverage, target 70%)
- **Security**: ✅ Pass (0 vulnerabilities)
- **Complexity**: ⚠️ Warning (1 method exceeds complexity threshold)
### 📝 Suggestions
**1. Test Coverage (Medium Priority)**
```csharp
// File: AuditRecordService.cs:45-78
public async Task<CreateAuditRecordResult> CreateAuditRecordAsync(CreateAuditRecordRequest request)
{
// ... implementation ...
}
Issue: This method has no unit tests.
Suggestion: Add unit tests for edge cases:
- Null request
- Invalid tenant ID
- Database connection failure
Example Test:
[Fact]
public async Task CreateAuditRecord_WithNullRequest_ThrowsArgumentNullException()
{
var service = new AuditRecordService();
await Assert.ThrowsAsync<ArgumentNullException>(() =>
service.CreateAuditRecordAsync(null));
}
2. Cyclomatic Complexity (Low Priority)
// File: QueryOptimizer.cs:142-198
private QueryPlan OptimizeQuery(Query query)
{
// ... complex logic with 15 decision points ...
}
Issue: Complexity = 18 (threshold: 15)
Suggestion: Extract sub-methods:
- OptimizeFilters(Query query)
- OptimizeJoins(Query query)
- OptimizeSorting(Query query)
3. Security Best Practice (Medium Priority)
// File: ExportService.cs:89
var sql = $"SELECT * FROM AuditRecords WHERE TenantId = '{request.TenantId}'";
Issue: String interpolation in SQL (potential SQL injection)
Suggestion: Use parameterized queries:
var sql = "SELECT * FROM AuditRecords WHERE TenantId = @TenantId";
var parameters = new { TenantId = request.TenantId };
📊 Code Health Score: 85/100 (Good)¶
Breakdown: - Maintainability: 90/100 - Reliability: 85/100 - Security: 80/100 - Test Coverage: 75/100
Overall: Code is in good shape. Address test coverage and SQL injection issue before merge.
**Success Metrics**:
- 90% of AI suggestions accepted by developers
- Code review time reduced by 30%
- Security vulnerabilities detected pre-merge increased by 50%
---
### Continuous Improvement Framework
**Purpose**: Establish **systematic process** for continuously improving quality gate effectiveness.
**Improvement Cycle** (Monthly):
```mermaid
graph TD
A[Collect Gate Metrics] --> B[Analyze Effectiveness]
B --> C{Issues Identified?}
C -->|No| D[Monitor & Continue]
C -->|Yes| E[Root Cause Analysis]
E --> F[Propose Improvements]
F --> G[Prioritize by Impact]
G --> H[Implement Changes]
H --> I[Deploy to Dev]
I --> J[Validate 2 Weeks]
J --> K{Effective?}
K -->|No| L[Rollback & Revise]
K -->|Yes| M[Deploy to Prod]
L --> F
M --> D
style L fill:#ff6b6b
style M fill:#90EE90
Improvement Backlog (Azure DevOps Board):
| Priority | Improvement | Owner | Effort | Expected Impact |
|---|---|---|---|---|
| P1 | Reduce SonarQube false positives for S3776 | Tech Lead | 2 weeks | High (reduce noise) |
| P1 | Add Roslyn analyzer for async/await patterns | Architect | 3 weeks | High (prevent deadlocks) |
| P2 | Improve coverage exclusion documentation | QA Lead | 1 week | Medium (reduce confusion) |
| P2 | Automate flaky test detection (ML) | SRE Lead | 8 weeks | High (Q1 2025 roadmap) |
| P3 | Add custom SonarQube rules for ATP patterns | Tech Lead | 4 weeks | Medium (ATP-specific quality) |
Quality Gate Retrospective Template:
# Quality Gate Retrospective — [Month Year]
## Metrics Review
| Gate Type | Pass Rate | Avg Failure Time (MTTR) | False Positive Rate | Developer Satisfaction |
|-----------|-----------|-------------------------|---------------------|------------------------|
| Build Quality | 97.5% | 8 min | 2.1% | 8/10 |
| Test Coverage | 95.2% | 12 min | 5.3% | 7/10 |
| Security | 98.1% | 45 min | 8.7% | 6/10 |
## What Went Well
- ✅ Security gate detected 3 critical CVEs before production
- ✅ Coverage gate pushed team to improve from 68% to 73%
- ✅ SBOM gate helped with license compliance audit
## What Needs Improvement
- ❌ SonarQube false positives frustrating developers
- ❌ Dependency check scan too slow (15min average)
- ❌ PII redaction validation has edge cases
## Action Items
- [ ] Tune SonarQube quality profile (remove S1135 TODO rule)
- [ ] Parallelize dependency check scan (target <5min)
- [ ] Improve PII regex patterns (add phone number formats)
- [ ] Document coverage exclusion process (add to developer guide)
## Gate Changes This Month
- ✅ Increased coverage threshold: 70% → 72% (staged rollout)
- ✅ Added ATP003 Roslyn analyzer for test naming
- ✅ Relaxed SonarQube complexity threshold: 10 → 15 (with ADR)
## Developer Feedback Highlights
> "Coverage gate is helpful but sometimes blocks hotfixes. Need emergency bypass process."
> — Developer A
> "SBOM generation is great for compliance but slows down builds. Can we cache?"
> — Developer B
## Next Month Focus
1. Address SonarQube false positives
2. Optimize dependency check performance
3. Pilot ML-based flaky test detection (Q1 2025 roadmap)
Summary¶
- Quality Gate Ownership: 9-gate ownership matrix (owner, reviewer, update frequency, escalation path), owner/reviewer responsibilities, change request process with Mermaid diagram, Azure DevOps change request template
- Evolution Roadmap: 4 quarters of innovation (Q1: ML flaky test detection, Q2: predictive gate failure analysis, Q3: self-healing pipelines, Q4: AI-assisted PR reviews), C#/PowerShell/YAML implementations, success metrics per quarter
- Continuous Improvement Framework: Monthly improvement cycle (Mermaid diagram), improvement backlog (Azure DevOps board), quality gate retrospective template
Appendix A — Quality Gate Summary Matrix¶
Purpose: Provide comprehensive reference for all quality gates with thresholds, enforcement points, blocker status, and applicable environments.
| Gate | Threshold | Enforcement | Blocker | Environment | Owner | Bypass Allowed |
|---|---|---|---|---|---|---|
| Build Errors | 0 | CI (Build) | ✅ Yes | All | Tech Lead | ❌ No |
| Build Warnings | 0 (TreatWarningsAsErrors) | CI (Build) | ✅ Yes | All | Tech Lead | ❌ No |
| Line Coverage | ≥70% (per service) | CI (Test) | ✅ Yes | All | QA Lead | ⚠️ Emergency only |
| Branch Coverage | ≥60% (per service) | CI (Test) | ✅ Yes | All | QA Lead | ⚠️ Emergency only |
| SonarQube Bugs | 0 | CI (Build) | ✅ Yes | All | Tech Lead | ❌ No |
| SonarQube Vulnerabilities | 0 | CI (Build) | ✅ Yes | All | Security Officer | ❌ No |
| SonarQube Code Smells | ≤10 (minor) | CI (Build) | ⚠️ Warning | All | Tech Lead | ✅ Yes (with review) |
| Critical CVEs (CVSS 9-10) | 0 | CI (Security Scan) | ✅ Yes | All | Security Officer | ❌ No |
| High CVEs (CVSS 7-8.9) | 0 | CI (Security Scan) | ✅ Yes | All | Security Officer | ⚠️ With risk acceptance |
| Medium CVEs (CVSS 4-6.9) | Fix within 30 days | CI (Security Scan) | ⚠️ Warning | All | Security Officer | ✅ Yes |
| Secrets Detected | 0 | CI (Security Scan) | ✅ Yes | All | Security Officer | ❌ No |
| SBOM Generated | Required (CycloneDX) | CI (Build) | ✅ Yes | All | Security Officer | ❌ No |
| Container Scan (Critical) | 0 Critical | CI (Build) | ✅ Yes | Staging/Prod | Security Officer | ❌ No |
| Container Scan (High) | 0 High | CI (Build) | ✅ Yes | Staging/Prod | Security Officer | ⚠️ With risk acceptance |
| API Breaking Changes | 0 | CI (Build) | ✅ Yes | All | Architect | ⚠️ With versioning |
| Message Schema Breaking Changes | 0 | CI (Build) | ✅ Yes | All | Architect | ⚠️ With versioning |
| PII in Logs | 0 | CI (Compliance) | ✅ Yes | All | Compliance Officer | ❌ No |
| Audit Logging Violations (ATP001) | 0 | CI (Compliance) | ✅ Yes | All | Compliance Officer | ❌ No |
| Data Classification Missing (ATP002) | 0 | CI (Compliance) | ✅ Yes | All | Compliance Officer | ❌ No |
| Test Naming Convention (ATP003) | 0 violations | CI (Test) | ⚠️ Warning | All | QA Lead | ✅ Yes |
| Unit Test Count | ≥50 per service | CI (Test) | ⚠️ Warning | All | QA Lead | ✅ Yes |
| Integration Test Count | ≥20 per service | CI (Test) | ⚠️ Warning | All | QA Lead | ✅ Yes |
| Flaky Test Rate | <5% | CI (Test) | ⚠️ Warning | All | QA Lead | ✅ Yes |
| Quarantined Tests | ≤3 | CI (Test) | ⚠️ Warning | All | QA Lead | ✅ Yes |
| p50 Latency | <100ms | Staging (Load Test) | ⚠️ Warning | Staging | SRE Lead | ✅ Yes |
| p95 Latency | <500ms | Staging (Load Test) | ✅ Yes (prod) | Staging | SRE Lead | ❌ No (prod) |
| p99 Latency | <1000ms | Staging (Load Test) | ⚠️ Warning | Staging | SRE Lead | ✅ Yes |
| Error Rate | <0.1% | Staging (Load Test) | ✅ Yes (prod) | Staging | SRE Lead | ❌ No (prod) |
| Throughput | ≥1000 RPS | Staging (Load Test) | ⚠️ Warning | Staging | SRE Lead | ✅ Yes |
| Chaos Test Pass Rate (Critical) | 100% | Staging (Chaos) | ✅ Yes (prod) | Staging | SRE Lead | ❌ No (prod) |
| Chaos Test Pass Rate (Non-Critical) | ≥95% | Staging (Chaos) | ⚠️ Warning | Staging | SRE Lead | ✅ Yes |
| Health Checks (Liveness) | 200 OK | All | ✅ Yes | All | SRE Lead | ❌ No |
| Health Checks (Readiness) | 200 OK | All | ✅ Yes | All | SRE Lead | ❌ No |
| OpenTelemetry Instrumentation | Required | CI (Observability) | ⚠️ Warning | All | SRE Lead | ✅ Yes |
| Manual Approval (Staging) | 1 approver | Pre-deploy | ✅ Yes | Staging | CAB | ❌ No |
| Manual Approval (Production) | 2 approvers | Pre-deploy | ✅ Yes | Production | CAB | ⚠️ Emergency only |
| Regression Test Pass Rate | 100% | Staging (Deploy) | ✅ Yes | Staging | QA Lead | ❌ No |
| Critical Scenario Pass Rate | 100% (@security, @compliance) | Staging (Deploy) | ✅ Yes | Staging | QA Lead | ❌ No |
Gate Bypass Approval Matrix:
| Bypass Type | Approver 1 | Approver 2 | Conditions | Documentation Required |
|---|---|---|---|---|
| Coverage (Emergency Hotfix) | Tech Lead | Architect | Active P1 incident | ADR + Incident ticket |
| High CVE (Risk Acceptance) | Security Officer | CISO | Mitigation controls in place | Risk acceptance form |
| API Breaking Change | Architect | VP Engineering | New major version (v2) | API versioning strategy |
| Manual Approval (Emergency) | CISO | CTO | Critical security patch | Emergency change request |
Appendix B — Example Pipeline with All Gates¶
Purpose: Provide complete reference implementation of Azure Pipelines with all quality gates integrated.
# azure-pipelines-complete.yml — Complete ATP Pipeline with All Quality Gates
name: $(majorMinorVersion).$(semanticVersion)
resources:
repositories:
- repository: templates
type: git
name: ConnectSoft/ConnectSoft.AzurePipelines
ref: refs/tags/v2.3.1
containers:
- container: redis
image: redis:7-alpine
ports: [6379:6379]
- container: mssql
image: mcr.microsoft.com/mssql/server:2022-latest
ports: [1433:1433]
env:
ACCEPT_EULA: Y
SA_PASSWORD: P@ssw0rd123!
- container: rabbitmq
image: rabbitmq:3-management-alpine
ports: [5672:5672, 15672:15672]
- container: otel-collector
image: otel/opentelemetry-collector:0.97.0
ports: [4317:4317]
pool:
vmImage: 'ubuntu-latest'
variables:
majorMinorVersion: 1.0
semanticVersion: $[counter(variables['majorMinorVersion'], 0)]
buildNumber: $(majorMinorVersion).$(semanticVersion)
solution: '**/*.slnx'
exactSolution: 'ConnectSoft.ATP.Ingestion.slnx'
buildConfiguration: 'Release'
codeCoverageThreshold: 75
restoreVstsFeed: 'e4c108b4-7989-4d22-93d6-391b77a39552'
trigger:
branches:
include: [master, main]
paths:
exclude: [README.md, docs/**]
#═══════════════════════════════════════════════════════════════════════════════
# Stage 1: CI (Build, Test, Security, Compliance)
#═══════════════════════════════════════════════════════════════════════════════
stages:
- stage: CI_Stage
displayName: 'CI — Build, Test, Security, Compliance'
jobs:
- job: Build_Test_Scan
displayName: 'Build, Test, Security & Compliance Scans'
timeoutInMinutes: 30
services:
redis: redis
mssql: mssql
rabbitmq: rabbitmq
otel: otel-collector
steps:
# ─────────────────────────────────────────────────────────────────────────
# Setup
# ─────────────────────────────────────────────────────────────────────────
- task: UseDotNet@2
displayName: 'Install .NET 8 SDK'
inputs:
version: '8.x'
- task: NuGetAuthenticate@1
displayName: 'Authenticate NuGet'
# ─────────────────────────────────────────────────────────────────────────
# BUILD QUALITY GATES
# ─────────────────────────────────────────────────────────────────────────
- task: DotNetCoreCLI@2
displayName: 'dotnet restore'
inputs:
command: 'restore'
projects: '$(exactSolution)'
feedsToUse: 'select'
vstsFeed: '$(restoreVstsFeed)'
- task: DotNetCoreCLI@2
displayName: '✅ BUILD QUALITY GATE: Zero Errors/Warnings'
inputs:
command: 'build'
projects: '$(exactSolution)'
arguments: >
--configuration $(buildConfiguration)
--no-restore
/p:TreatWarningsAsErrors=true
/p:EnforceCodeStyleInBuild=true
/p:Deterministic=true
/p:ContinuousIntegrationBuild=true
continueOnError: false # BLOCKER
# SonarQube analysis
- task: SonarCloudPrepare@1
displayName: 'Prepare SonarQube Analysis'
inputs:
SonarCloud: 'SonarCloud-ConnectSoft'
organization: 'connectsoft'
scannerMode: 'MSBuild'
projectKey: 'ConnectSoft_ATP_Ingestion'
projectName: 'ATP Ingestion Service'
- task: SonarCloudAnalyze@1
displayName: '✅ BUILD QUALITY GATE: SonarQube Analysis'
- task: SonarCloudPublish@1
displayName: 'Publish SonarQube Results'
inputs:
pollingTimeoutSec: '300'
# ─────────────────────────────────────────────────────────────────────────
# TEST COVERAGE GATES
# ─────────────────────────────────────────────────────────────────────────
- task: DotNetCoreCLI@2
displayName: 'Run Unit Tests'
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: >
--configuration $(buildConfiguration)
--no-build
--filter "Category=Unit"
--collect:"XPlat Code Coverage"
--settings:CodeCoverage.runsettings
--logger trx
publishTestResults: true
- task: PowerShell@2
displayName: '✅ TESTING QUALITY GATE: Unit Test Quality'
inputs:
filePath: 'scripts/Validate-UnitTestQuality.ps1'
arguments: >
-MinTests 50
-MaxDurationSeconds 30
-MaxFlakyRate 5.0
continueOnError: false # BLOCKER
- task: DotNetCoreCLI@2
displayName: 'Run Integration Tests'
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: >
--configuration $(buildConfiguration)
--no-build
--filter "Category=Integration"
--logger trx
publishTestResults: true
env:
ConnectionStrings__Redis: 'redis:6379'
ConnectionStrings__Database: 'Server=mssql;Database=ATP_Test;User=sa;Password=P@ssw0rd123!'
ConnectionStrings__RabbitMQ: 'amqp://guest:guest@rabbitmq:5672'
- task: PowerShell@2
displayName: '✅ TESTING QUALITY GATE: Integration Test Quality'
inputs:
filePath: 'scripts/Validate-IntegrationTestQuality.ps1'
arguments: >
-MinTests 20
-MaxDurationSeconds 300
continueOnError: false # BLOCKER
- task: PublishCodeCoverageResults@1
displayName: 'Publish Code Coverage'
inputs:
codeCoverageTool: 'Cobertura'
summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'
- task: BuildQualityChecks@8
displayName: '✅ TEST COVERAGE GATE: Coverage Threshold'
inputs:
checkCoverage: true
coverageThreshold: $(codeCoverageThreshold)
coverageFailOption: 'fixed'
coverageType: 'lines'
treatBuildWarningsAsErrors: true
baselineEnabled: true
baselineType: 'previous'
continueOnError: false # BLOCKER
# ─────────────────────────────────────────────────────────────────────────
# SECURITY GATES
# ─────────────────────────────────────────────────────────────────────────
- task: dependency-check-build-task@6
displayName: '✅ SECURITY GATE: Dependency Scan (OWASP)'
inputs:
projectName: 'ConnectSoft.ATP.Ingestion'
scanPath: '$(Build.SourcesDirectory)'
format: 'HTML,JSON,XML'
failOnCVSS: 7 # Block on High/Critical
suppressionFile: 'dependency-check-suppressions.xml'
continueOnError: false # BLOCKER
- task: CredScan@3
displayName: '✅ SECURITY GATE: Secrets Detection'
inputs:
toolMajorVersion: 'V2'
suppressionsFile: 'credscan-suppressions.json'
debugMode: false
continueOnError: false # BLOCKER
- script: |
trivy image --severity HIGH,CRITICAL --exit-code 1 \
connectsoft.azurecr.io/atp/ingestion:$(buildNumber)
displayName: '✅ SECURITY GATE: Container Image Scan (Trivy)'
continueOnError: false # BLOCKER (staging/prod only)
condition: or(eq(variables['Build.SourceBranch'], 'refs/heads/master'), startsWith(variables['Build.SourceBranch'], 'refs/tags/'))
# ─────────────────────────────────────────────────────────────────────────
# SBOM & SUPPLY CHAIN GATES
# ─────────────────────────────────────────────────────────────────────────
- task: CmdLine@2
displayName: '✅ SBOM GATE: Generate SBOM (CycloneDX)'
inputs:
script: |
dotnet tool install --global CycloneDX
dotnet CycloneDX $(exactSolution) -o $(Build.ArtifactStagingDirectory)/sbom -f json
continueOnError: false # BLOCKER
- task: PowerShell@2
displayName: '✅ SBOM GATE: Validate SBOM Content'
inputs:
filePath: 'scripts/Validate-SBOM.ps1'
arguments: '-SbomPath "$(Build.ArtifactStagingDirectory)/sbom"'
continueOnError: false # BLOCKER
- task: PublishBuildArtifacts@1
displayName: 'Publish SBOM Artifact'
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)/sbom'
ArtifactName: 'sbom'
# ─────────────────────────────────────────────────────────────────────────
# COMPLIANCE GATES
# ─────────────────────────────────────────────────────────────────────────
- task: PowerShell@2
displayName: '✅ COMPLIANCE GATE: PII Redaction Validation'
inputs:
filePath: 'scripts/Validate-PIIRedaction.ps1'
continueOnError: false # BLOCKER
- task: PowerShell@2
displayName: '✅ COMPLIANCE GATE: GDPR/HIPAA Checklist'
inputs:
filePath: 'scripts/Validate-ComplianceChecklist.ps1'
continueOnError: false # BLOCKER
# ─────────────────────────────────────────────────────────────────────────
# OBSERVABILITY GATES
# ─────────────────────────────────────────────────────────────────────────
- task: PowerShell@2
displayName: '✅ OBSERVABILITY GATE: OpenTelemetry Validation'
inputs:
filePath: 'scripts/Validate-OpenTelemetry.ps1'
continueOnError: false # WARNING (not blocker)
# ─────────────────────────────────────────────────────────────────────────
# CONTRACT & API GATES
# ─────────────────────────────────────────────────────────────────────────
- task: PowerShell@2
displayName: '✅ CONTRACT GATE: OpenAPI Breaking Change Detection'
inputs:
filePath: 'scripts/Validate-OpenAPIChanges.ps1'
continueOnError: false # BLOCKER
# ─────────────────────────────────────────────────────────────────────────
# Publish Artifacts
# ─────────────────────────────────────────────────────────────────────────
- task: PublishBuildArtifacts@1
displayName: 'Publish Build Artifacts'
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)'
ArtifactName: 'drop'
#═══════════════════════════════════════════════════════════════════════════════
# Stage 2: Deploy to Staging (with Performance & Regression Gates)
#═══════════════════════════════════════════════════════════════════════════════
- stage: Deploy_Staging
displayName: 'Deploy to Staging'
dependsOn: CI_Stage
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/master'))
jobs:
- deployment: DeployToStaging
displayName: 'Deploy ATP Ingestion to Staging'
environment: ATP-Staging # ✅ APPROVAL GATE: 1 approver required
strategy:
runOnce:
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: 'ConnectSoft-Production'
appName: 'atp-ingestion-staging'
package: '$(Pipeline.Workspace)/drop/*.zip'
# Wait for stabilization
- task: PowerShell@2
displayName: 'Wait for Service Stabilization'
inputs:
targetType: 'inline'
script: Start-Sleep -Seconds 60
# ─────────────────────────────────────────────────────────────────
# PERFORMANCE GATES
# ─────────────────────────────────────────────────────────────────
- task: LoadTest@1
displayName: '✅ PERFORMANCE GATE: Load Testing'
inputs:
testPlan: 'load-tests/staging-load-test.jmx'
thresholdP95: 500 # <500ms p95 latency
thresholdErrorRate: 0.1 # <0.1% error rate
thresholdThroughput: 1000 # ≥1000 RPS
continueOnError: false # BLOCKER for production
- task: ChaosTest@1
displayName: '✅ PERFORMANCE GATE: Chaos Engineering'
inputs:
testPlan: 'chaos-tests/staging-chaos-test.yaml'
criticalScenariosPassRate: 100 # All critical scenarios must pass
continueOnError: false # BLOCKER for production
# ─────────────────────────────────────────────────────────────────
# REGRESSION TEST GATES
# ─────────────────────────────────────────────────────────────────
- task: DotNetCoreCLI@2
displayName: 'Run Regression Tests'
inputs:
command: 'test'
projects: '**/*RegressionTests.csproj'
arguments: '--configuration Release --logger trx'
publishTestResults: true
env:
TestEnvironment: 'Staging'
BaseUrl: 'https://atp-ingestion-staging.azurewebsites.net'
- task: PowerShell@2
displayName: '✅ REGRESSION GATE: Regression Test Quality'
inputs:
filePath: 'scripts/Validate-RegressionTestQuality.ps1'
arguments: >
-RequiredPassRate 100
-Environment "Staging"
continueOnError: false # BLOCKER
# ─────────────────────────────────────────────────────────────────
# OBSERVABILITY GATES
# ─────────────────────────────────────────────────────────────────
- task: HttpTest@1
displayName: '✅ OBSERVABILITY GATE: Health Check Validation'
inputs:
url: 'https://atp-ingestion-staging.azurewebsites.net/health/ready'
expectedStatusCode: 200
retryCount: 3
continueOnError: false # BLOCKER
#═══════════════════════════════════════════════════════════════════════════════
# Stage 3: Deploy to Production (with Manual Approval + Canary)
#═══════════════════════════════════════════════════════════════════════════════
- stage: Deploy_Production
displayName: 'Deploy to Production'
dependsOn: Deploy_Staging
condition: and(succeeded(), eq(variables['Build.Reason'], 'Manual'))
jobs:
- deployment: DeployToProduction
displayName: 'Deploy ATP Ingestion to Production (Canary)'
environment: ATP-Production # ✅ APPROVAL GATE: 2 approvers + CAB required
strategy:
canary:
increments: [10, 25, 50] # 10% → 25% → 50% → 100%
preDeploy:
steps:
- script: echo "Pre-deployment validation..."
# Validate no active incidents
- task: AzureFunction@1
displayName: '✅ APPROVAL GATE: Check Active Incidents'
inputs:
function: 'ValidateNoActiveIncidents'
failOnError: true
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: 'ConnectSoft-Production'
appName: 'atp-ingestion-prod'
package: '$(Pipeline.Workspace)/drop/*.zip'
trafficPercentage: $(strategy.increment) # Canary traffic routing
postRouteTraffic:
steps:
# Monitor canary metrics
- task: PowerShell@2
displayName: '✅ CANARY GATE: Monitor Metrics'
inputs:
targetType: 'inline'
script: |
Start-Sleep -Seconds 600 # Wait 10 minutes
# Query Application Insights for error rate
$errorRate = az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 10m \
--query "value.segments[0]['requests/failed'].avg" -o tsv
if ($errorRate -gt 0.01) { # >1% error rate
Write-Error "Error rate too high: $errorRate%"
exit 1 # Trigger rollback
}
on:
failure:
steps:
- script: echo "🔴 Canary deployment failed; rolling back..."
- task: AzureAppServiceManage@0
inputs:
azureSubscription: 'ConnectSoft-Production'
action: 'Swap Slots'
webAppName: 'atp-ingestion-prod'
sourceSlot: 'production'
targetSlot: 'canary'
Appendix C — Cross-Reference Map¶
Purpose: Map quality gate topics to related documentation for comprehensive understanding.
| Topic | Primary Document | Related Documents | Notes |
|---|---|---|---|
| Azure Pipelines | ci-cd/azure-pipelines.md |
ci-cd/quality-gates.md, ci-cd/environments.md |
Pipeline configuration, stages, templates, deployment strategies |
| Environments | ci-cd/environments.md |
ci-cd/azure-pipelines.md, ci-cd/quality-gates.md |
Environment-specific thresholds, approvals, configuration management |
| Security & Compliance | platform/security-compliance.md |
ci-cd/quality-gates.md, platform/data-residency-retention.md |
Security controls, compliance frameworks (SOC 2, GDPR, HIPAA) |
| Testing Strategies | ci-cd/quality-gates.md (Section 15) |
implementation/template-integration.md, operations/progressive-rollout.md |
Unit, integration, regression tests per environment |
| Observability | operations/observability.md |
ci-cd/quality-gates.md (Section 9), ci-cd/environments.md |
OpenTelemetry validation, health checks, metrics, tracing |
| SBOM & Supply Chain | ci-cd/quality-gates.md (Section 6) |
platform/security-compliance.md, ci-cd/azure-pipelines.md |
SBOM generation, provenance, signing, SLSA, supply chain security |
| Code Coverage | ci-cd/quality-gates.md (Section 4) |
implementation/template-integration.md |
Coverage thresholds, baseline protection, exclusions, per-service config |
| SonarQube | ci-cd/quality-gates.md (Section 3) |
implementation/template-integration.md |
Static code analysis, quality profiles, Roslyn analyzers |
| Dependency Scanning | ci-cd/quality-gates.md (Section 5) |
platform/security-compliance.md |
OWASP Dependency-Check, CVE management, suppression workflow |
| Compliance Gates | ci-cd/quality-gates.md (Section 7) |
platform/data-residency-retention.md, platform/security-compliance.md |
Audit logging, PII redaction, GDPR/HIPAA checklists, data classification |
| Performance Gates | ci-cd/quality-gates.md (Section 8) |
operations/progressive-rollout.md, operations/runbook.md |
Load testing, chaos engineering, latency/throughput thresholds |
| API Contracts | ci-cd/quality-gates.md (Section 10) |
domain/contracts/rest-apis.md, domain/contracts/webhooks.md |
OpenAPI breaking change detection, message schema compatibility |
| Approval Gates | ci-cd/quality-gates.md (Section 11) |
ci-cd/environments.md, ci-cd/azure-pipelines.md |
Manual approvals, CAB process, emergency procedures |
| Deployment Strategies | operations/progressive-rollout.md |
ci-cd/azure-pipelines.md, ci-cd/environments.md |
Blue-green, canary, rolling deployments |
| Incident Management | operations/runbook.md |
ci-cd/quality-gates.md (Section 12), operations/progressive-rollout.md |
Rollback procedures, incident response, post-mortems |
| Infrastructure as Code | infrastructure/pulumi.md |
ci-cd/environments.md, ci-cd/azure-pipelines.md |
Pulumi/Bicep deployment, IaC overlays, drift detection |
| Data Residency | platform/data-residency-retention.md |
ci-cd/quality-gates.md (Section 7), platform/security-compliance.md |
Data classification, retention policies, GDPR/HIPAA compliance |
| Template Integration | implementation/template-integration.md |
ci-cd/azure-pipelines.md, ci-cd/quality-gates.md |
ConnectSoft microservice template usage, project structure |
| Development Plan | planning/index.md |
planning/status-tracking.md, planning/_baseline-roadmap.md |
Epic planning organized by bounded contexts, 30-cycle baseline roadmap |
Summary¶
- Governance & Continuous Evolution: 9-gate ownership matrix, owner/reviewer responsibilities, change request process (Mermaid diagram), Azure DevOps work item template, 2025 evolution roadmap (4 quarters), continuous improvement framework (monthly cycle, retrospective template)
- Appendix A: Complete quality gate summary matrix (40+ gates with thresholds, enforcement, blocker status, environments, bypass rules)
- Appendix B: Full Azure Pipelines YAML example (~280 lines) demonstrating all gates (build, test coverage, security, SBOM, compliance, performance, observability, contract, approval)
- Appendix C: Cross-reference map (18 topics mapped to primary and related documents across ci-cd, platform, operations, infrastructure, implementation, domain)