Skip to content

Backup, Restore, Disaster Recovery & eDiscovery

Purpose & Scope

Purpose: Comprehensive operational guide for ATP's backup, restore, disaster recovery (DR), and eDiscovery strategies, ensuring data durability, integrity, recoverability, and legal compliance across all environments.

Scope: This document covers:

  • Backup Strategies: Automated and on-demand backups for Azure SQL Database, Azure Blob Storage (WORM), Azure Cosmos DB, Redis, and Service Bus, with integrity verification, encryption, and region-coherent storage
  • Restore Procedures: Tenant-scoped and full-system restores, point-in-time recovery (PITR), staging/quarantine restores, integrity verification, and recovery drills
  • Disaster Recovery: Multi-region failover, RPO/RTO objectives per environment/edition, automated failover procedures, regional DR drills, and failback strategies
  • eDiscovery: Legal hold management, data subject access requests (DSAR), export workflows, signed manifests, tamper-evidence, compliance exports (SEC 17a-4, HIPAA, GDPR)
  • Operational Excellence: Recovery drills, backup validation, monitoring, alerting, runbooks, compliance evidence collection

Audience: Platform operators, SRE teams, compliance officers, legal/regulatory teams, incident responders, backup administrators

Relationship to Other Documents: - Architecture: See ../architecture/data-architecture.md for data model, WORM storage, and integrity patterns - Operations: See runbook.md for day-to-day operations, incident response, and troubleshooting - Monitoring: See monitoring.md for observability, metrics, and alerting - Security: See ../hardening/tamper-evidence.md for integrity proofs, hash chains, and digital signatures - Compliance: See ../platform/privacy-gdpr-hipaa-soc2.md for regulatory requirements


Table of Contents

  1. Overview & Principles
  2. RPO/RTO Objectives
  3. Backup Strategy
  4. Restore Procedures
  5. Disaster Recovery
  6. eDiscovery & Legal Hold
  7. Integrity & Verification
  8. Operational Procedures
  9. Monitoring & Alerting
  10. Compliance & Evidence
  11. Troubleshooting
  12. Runbooks & Checklists

Overview & Principles

Core Principles

  1. Data Durability: Zero data loss within RPO window; backups are immutable and cryptographically protected
  2. Region Coherence: Each tenant's authoritative data is backed up in the same region; cross-region replication only when residency policy allows
  3. Integrity First: All backups include Merkle proofs, hash chains, and digital signatures; restores are verified before promotion
  4. Legal Compliance: WORM storage, legal holds, tamper-evidence, signed manifests for regulatory compliance (SEC 17a-4, HIPAA, GDPR)
  5. Quarantine Before Promote: All restores land in read-only quarantine namespace until integrity verification passes
  6. Evidence-Based: All backup/restore/eDiscovery operations produce auditable evidence suitable for external review

Data Classifications

Data Class Backup Cadence Retention Storage Tier Legal Hold Support
Hot (Append/WORM) Hourly incremental, Daily full 7 years (configurable) Premium Blob (WORM) Yes
Warm (Read Models) Weekly snapshots (optional, rebuild-first preferred) 7 years Standard Blob Yes
Cold (Archives/Exports) On-demand, lifecycle transitions 7 years Archive Blob Yes
Projection DB Continuous (Azure SQL automated), Hourly incrementals 35 days PITR, 7 years LTR Azure SQL + Blob Yes
Search Index Rebuild from hot (preferred) or weekly snapshots 7 years Blob Yes
Cosmos DB Continuous (change feed) 30 days continuous, 7 years periodic Cosmos DB + Blob Yes

Backup Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Backup Service (Orchestrator)                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │  Scheduler  │  │   Workers   │  │  Catalog    │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘
         │                  │                  │
         ├──────────────────┼──────────────────┤
         │                  │                  │
    ┌────▼────┐      ┌─────▼─────┐      ┌─────▼─────┐
    │  Append │      │ Projection│      │   Search  │
    │  Store  │      │    DB     │      │   Index   │
    │  (WORM) │      │           │      │           │
    └────┬────┘      └─────┬─────┘      └─────┬─────┘
         │                  │                  │
         └──────────────────┼──────────────────┘
                    ┌───────▼────────┐
                    │ Integrity Svc  │
                    │ (Merkle/Hash)  │
                    └───────┬────────┘
                    ┌───────▼────────┐
                    │  Object Store  │
                    │ (WORM/Immutable│
                    │   + KMS)       │
                    └────────────────┘

RPO/RTO Objectives

Definitions

  • RPO (Recovery Point Objective): Maximum acceptable data loss (time window from last successful backup to failure)
  • RTO (Recovery Time Objective): Maximum acceptable downtime (time from failure declaration to service restoration)

RPO/RTO by Environment & Edition

Environment Edition RPO Target RTO Target Notes
Production Enterprise ≤ 5 minutes ≤ 30 minutes Multi-region active-active, continuous backups
Production Pro ≤ 15 minutes ≤ 30 minutes Single region, hourly incrementals
Production Free ≤ 24 hours ≤ 2 hours Daily backups, recreate from Git + IaC acceptable
Staging All ≤ 1 hour ≤ 1 hour Production-like validation
Test All ≤ 12 hours ≤ 2 hours Daily backups, restore within business hours
Dev All ≤ 24 hours ≤ 4 hours Recreate from Git + IaC preferred over restore

RPO/RTO by Scenario

Scenario RPO Target RTO Target Procedure
Zonal failure (within region) ~ 0 minutes ≤ 15 minutes ZRS + multi-zone node pools; AFD keeps region healthy
Regional outage (cross-region allowed) ≤ 5 minutes 15-60 minutes AFD cutover + ASB alias + HOT replication lag
Regional outage (strict residency pair) ≤ 15 minutes 1-4 hours Same-jurisdiction DR restore + replay WARM
AKS control-plane impairment ~ 0 minutes 30-60 minutes Shift to sibling cluster or re-create node pools
Key Vault/HSM incident ~ 0 minutes 1-2 hours Restore HSM backup; dual-sign window if needed
Data corruption Variable 30-60 minutes PITR to last known good point; tenant-scoped restore

Backup Strategy

Azure SQL Database Backups

Automated Backups

Production: - Full Backup: Weekly (Sunday 2:00 AM UTC) - Differential Backup: Every 12-24 hours (automated) - Transaction Log Backup: Every 5-10 minutes (automated) - Retention: - Short-term (PITR): 35 days - Long-term retention (LTR): Weekly/Monthly/Yearly backups retained up to 10 years - Redundancy: Geo-redundant (replicated to paired region)

Configuration:

# Configure Azure SQL Database backup retention
az sql db update \
  --name ATP_Prod \
  --resource-group ATP-Prod-RG \
  --server atp-sql-prod-eus \
  --backup-storage-redundancy Geo \
  --short-term-retention-policy retention-days=35

# Configure long-term retention policy
az sql db ltr-policy set \
  --database-name ATP_Prod \
  --server-name atp-sql-prod-eus \
  --resource-group ATP-Prod-RG \
  --weekly-retention "P4W" \
  --monthly-retention "P12M" \
  --yearly-retention "P10Y" \
  --week-of-year 1

Point-in-Time Restore (PITR):

# Restore to specific point in time
az sql db restore \
  --dest-name ATP_Prod_Restore \
  --name ATP_Prod \
  --resource-group ATP-Prod-RG \
  --server atp-sql-prod-eus \
  --time "2025-10-30T14:30:00Z"

Manual Backups (BACPAC Export)

Weekly Full Backup:

#!/bin/bash
# weekly-backup-prod.sh

BACKUP_DATE=$(date +%Y%m%d)
BACKUP_FILE="prod-full-${BACKUP_DATE}.bacpac"
STORAGE_URI="https://atpstorageprodeus.blob.core.windows.net/backups/weekly/${BACKUP_FILE}"

echo "Creating weekly full backup: ${BACKUP_FILE}"

az sql db export \
  --name ATP_Prod \
  --resource-group ATP-Prod-RG \
  --server atp-sql-prod-eus \
  --admin-user $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminUser --query value -o tsv) \
  --admin-password $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminPassword --query value -o tsv) \
  --storage-key $(az storage account keys list --account-name atpstorageprodeus --query "[0].value" -o tsv) \
  --storage-key-type StorageAccessKey \
  --storage-uri "${STORAGE_URI}"

# Tag backup with metadata
az storage blob metadata update \
  --account-name atpstorageprodeus \
  --container-name backups \
  --name "weekly/${BACKUP_FILE}" \
  --metadata \
    type=full \
    createdAt=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
    retentionYears=7 \
    environment=Production

echo "✅ Weekly full backup completed: ${BACKUP_FILE}"

Azure Blob Storage Backups (WORM)

WORM Storage Configuration

Production Container Setup:

# Enable versioning and immutable storage
az storage account blob-service-properties update \
  --account-name atpstorageprodeus \
  --enable-versioning true \
  --enable-delete-retention true \
  --delete-retention-days 90

# Create container with immutable storage enabled
az storage container create \
  --account-name atpstorageprodeus \
  --name audit-events \
  --public-access off \
  --metadata \
    purpose="audit-trail-worm" \
    compliance="SEC-17a-4" \
    retentionYears="7"

# Enable immutable storage with versioning
az storage container immutability-policy create \
  --account-name atpstorageprodeus \
  --container-name audit-events \
  --period 2555 \
  --allow-protected-append-writes false

# Lock immutability policy (irreversible)
az storage container immutability-policy lock \
  --account-name atpstorageprodeus \
  --container-name audit-events

Segment Backup Strategy

Segment Snapshot Workflow:

// ConnectSoft.ATP.Backup/Services/SegmentBackupService.cs
public class SegmentBackupService
{
    private readonly IBlobContainerClient _blobContainer;
    private readonly IIntegrityService _integrityService;
    private readonly IBackupCatalog _catalog;

    public async Task<BackupResult> CreateSegmentBackupAsync(
        string tenantId,
        string segmentId,
        Stream segmentData,
        CancellationToken cancellationToken)
    {
        // 1. Compute segment hash
        var segmentHash = await ComputeSha256HashAsync(segmentData);
        segmentData.Position = 0;

        // 2. Create segment blob with WORM policy
        var blobName = $"segments/{tenantId}/{segmentId}.jsonl.gz";
        var blobClient = _blobContainer.GetBlobClient(blobName);

        var uploadOptions = new BlobUploadOptions
        {
            Metadata = new Dictionary<string, string>
            {
                ["tenantId"] = tenantId,
                ["segmentId"] = segmentId,
                ["createdAt"] = DateTimeOffset.UtcNow.ToString("O"),
                ["segmentHash"] = segmentHash,
                ["format"] = "jsonl.gz"
            }
        };

        await blobClient.UploadAsync(segmentData, uploadOptions, cancellationToken);

        // 3. Create manifest blob
        var manifest = new SegmentManifest
        {
            SegmentId = segmentId,
            TenantId = tenantId,
            SegmentHash = segmentHash,
            CreatedAt = DateTimeOffset.UtcNow,
            BlobUri = blobClient.Uri.ToString(),
            RecordCount = await CountRecordsAsync(segmentData)
        };

        var manifestBlobName = $"manifests/{tenantId}/{segmentId}.manifest.json";
        var manifestBlob = _blobContainer.GetBlobClient(manifestBlobName);
        await manifestBlob.UploadAsync(
            new BinaryData(JsonSerializer.Serialize(manifest)),
            cancellationToken: cancellationToken);

        // 4. Register in backup catalog
        await _catalog.RegisterRecoveryPointAsync(new RecoveryPoint
        {
            Id = $"RP-{DateTimeOffset.UtcNow:yyyyMMddTHHmmssZ}-{tenantId}-{segmentId}",
            TenantId = tenantId,
            SegmentId = segmentId,
            Type = BackupType.Incremental,
            CreatedAt = DateTimeOffset.UtcNow,
            SegmentHash = segmentHash,
            ManifestUri = manifestBlob.Uri.ToString()
        }, cancellationToken);

        return new BackupResult
        {
            Success = true,
            SegmentHash = segmentHash,
            BlobUri = blobClient.Uri.ToString(),
            ManifestUri = manifestBlob.Uri.ToString()
        };
    }
}

Azure Cosmos DB Backups

Continuous Backup Mode

# Enable continuous backup mode
az cosmosdb sql database create \
  --account-name atp-cosmos-prod \
  --resource-group ATP-Shared-RG \
  --name ATP_Database \
  --continuous-tier Continuous30Days

# Restore from continuous backup
az cosmosdb sql database restore \
  --account-name atp-cosmos-prod \
  --resource-group ATP-Shared-RG \
  --database-name ATP_Database \
  --restore-timestamp "2025-10-30T14:30:00Z"

Redis Backups

Production (Premium tier):

# Configure RDB snapshots
az redis update \
  --name atp-redis-prod \
  --resource-group ATP-Shared-RG \
  --set \
    redisConfiguration.rdb-backup-enabled=true \
    redisConfiguration.rdb-backup-frequency=15 \
    redisConfiguration.rdb-backup-max-snapshot-count=7

Backup Catalog

Catalog Schema:

CREATE TABLE BackupCatalog (
    RecoveryPointId VARCHAR(128) PRIMARY KEY,
    TenantId VARCHAR(128) NOT NULL,
    Region VARCHAR(64) NOT NULL,
    BackupType VARCHAR(32) NOT NULL, -- 'full', 'incremental', 'point-in-time'
    CreatedAt TIMESTAMP NOT NULL,
    CompletedAt TIMESTAMP NULL,
    Status VARCHAR(32) NOT NULL, -- 'in_progress', 'completed', 'failed'
    DataClasses JSONB NOT NULL, -- ['hot', 'warm', 'cold']
    Packages JSONB NOT NULL, -- [{name, uri, hash, bytes}, ...]
    MerkleRoot VARCHAR(64) NULL,
    Signature TEXT NULL,
    KeyId VARCHAR(256) NULL,
    PolicyVersion VARCHAR(32) NOT NULL,
    INDEX idx_tenant_created (TenantId, CreatedAt),
    INDEX idx_region_status (Region, Status, CreatedAt)
);

Restore Procedures

Tenant-Scoped Restore

Restore Request Workflow

sequenceDiagram
    autonumber
    participant OP as Operator
    participant CAT as Backup Catalog
    participant STO as Backup Store
    participant RST as Restore Controller
    participant INT as Integrity Verifier
    participant QUAR as Quarantine Namespace

    OP->>CAT: Request restore {tenant, timeRange, scope}
    CAT-->>OP: Best recovery point + manifests
    OP->>RST: Start restore(jobId, recoveryPointId)
    RST->>STO: Fetch packages (Private Link)
    RST->>INT: Verify checksums, Merkle roots, signatures
    INT-->>RST: Verification result + evidence
    alt Verification passed
        RST->>QUAR: Restore to quarantine (read-only)
        RST-->>OP: Restore completed, ready for validation
        OP->>RST: Validate restored data (sample queries)
        RST-->>OP: Validation results
        alt Validation passed
            OP->>RST: Approve promotion (two-person approval)
            RST->>QUAR: Promote to active namespace
            RST-->>OP: Promotion completed
        end
    else Verification failed
        RST-->>OP: Restore failed, integrity check failed
    end
Hold "Alt" / "Option" to enable pan & zoom

Restore Command Example

#!/bin/bash
# restore-tenant.sh

TENANT_ID="acme-corp"
RESTORE_FROM="2025-10-01T00:00:00Z"
RESTORE_TO="2025-10-31T23:59:59Z"
RECOVERY_POINT_ID="RP-20251027T080000Z-eu-west-acme"

echo "Starting tenant restore: ${TENANT_ID}"

# 1. Request restore
RESTORE_JOB_ID=$(az rest --method POST \
  --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/restores" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
  --body '{
    "recoveryPointId": "'"${RECOVERY_POINT_ID}"'",
    "mode": "sandbox",
    "target": {
      "tenantId": "'"${TENANT_ID}"'",
      "region": "westeurope"
    },
    "scope": {
      "timeRange": {
        "from": "'"${RESTORE_FROM}"'",
        "to": "'"${RESTORE_TO}"'"
      },
      "dataClasses": ["hot", "warm"]
    },
    "verifyPolicy": {
      "rowCounts": true,
      "samplePercent": 5,
      "proofs": true,
      "checksums": true
    }
  }' \
  --query "restoreJobId" -o tsv)

echo "Restore job started: ${RESTORE_JOB_ID}"

# 2. Monitor restore progress
while true; do
  STATUS=$(az rest --method GET \
    --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/restores/${RESTORE_JOB_ID}" \
    --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
    --query "status" -o tsv)

  echo "Restore status: ${STATUS}"

  if [ "${STATUS}" == "completed" ]; then
    echo "✅ Restore completed successfully"
    break
  elif [ "${STATUS}" == "failed" ]; then
    echo "❌ Restore failed"
    exit 1
  fi

  sleep 10
done

# 3. Validate restored data
echo "Validating restored data..."

# Sample query to verify data integrity
RECORD_COUNT=$(az sql db query \
  --server atp-sql-prod-eus \
  --database ATP_Prod_Restore_Temp \
  --query-text "SELECT COUNT(*) FROM AuditRecords WHERE TenantId = '${TENANT_ID}' AND CreatedAt >= '${RESTORE_FROM}' AND CreatedAt <= '${RESTORE_TO}'" \
  --query "[0].['']" -o tsv)

echo "Restored records: ${RECORD_COUNT}"

# 4. Approve promotion (if validation passed)
read -p "Approve promotion to production? (yes/no): " APPROVE
if [ "${APPROVE}" == "yes" ]; then
  az rest --method POST \
    --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/restores/${RESTORE_JOB_ID}/promote" \
    --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
    --body '{
      "approvedBy": ["operator1@connectsoft.dev", "operator2@connectsoft.dev"],
      "approvalTimestamp": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"
    }'

  echo "✅ Promotion approved and completed"
fi

Point-in-Time Restore (PITR)

SQL Database PITR

#!/bin/bash
# pitr-restore.sh

RESTORE_TIME="2025-10-30T14:30:00Z"
TARGET_DB_NAME="ATP_Prod_Restore_$(date +%Y%m%d_%H%M%S)"

echo "Restoring to point in time: ${RESTORE_TIME}"

az sql db restore \
  --dest-name "${TARGET_DB_NAME}" \
  --name ATP_Prod \
  --resource-group ATP-Prod-RG \
  --server atp-sql-prod-eus \
  --time "${RESTORE_TIME}"

echo "✅ PITR restore completed: ${TARGET_DB_NAME}"

# Verify restore
RECORD_COUNT=$(az sql db query \
  --server atp-sql-prod-eus \
  --database "${TARGET_DB_NAME}" \
  --query-text "SELECT COUNT(*) FROM AuditRecords" \
  --query "[0].['']" -o tsv)

echo "Restored records: ${RECORD_COUNT}"

Full System Restore (Disaster Recovery)

See Disaster Recovery section below.


Disaster Recovery

Multi-Region Failover

Failover Decision Matrix

Trigger Condition Action RTO Target
AFD Health Probe Primary region unhealthy > 5 minutes Automated failover to DR region ≤ 15 minutes
SLO Burn Rate Error budget exhausted > threshold Manual failover decision ≤ 30 minutes
Operator Declaration Incident commander declares DR Manual failover execution ≤ 60 minutes
Regional Outage Azure region status = Down Automated failover ≤ 15 minutes

Failover Orchestration

sequenceDiagram
    autonumber
    participant Mon as Monitor/SLO
    participant IC as Incident Commander
    participant AFD as Azure Front Door
    participant BUS as Service Bus (Geo-DR)
    participant REG as Registry
    participant PDP as Policy Engine
    participant CRDB as Database

    Mon-->>IC: Region A unhealthy / SLO burn
    IC->>AFD: Disable Region A origins; 100% to Region B
    IC->>BUS: Flip Geo-DR alias to Namespace B
    IC->>REG: Set tenant mode = read-only for homeRegion=A
    alt Extended outage > 30 min
        IC->>CRDB: Re-pin affected tenants to Region B
        REG-->>PDP: Emit TenantRehomed obligations
    end
    IC->>REG: Trigger warm-up (keyheads) + Refresh broadcast
    Mon-->>IC: SLOs recovered
    IC->>Comms: Resolved update; start failback plan
Hold "Alt" / "Option" to enable pan & zoom

Failover Script

#!/bin/bash
# failover-to-dr.sh

PRIMARY_REGION="eastus"
DR_REGION="westus"
PRIMARY_AFD_PROFILE="atp-frontdoor-prod"

echo "=== DISASTER RECOVERY: Failover to DR Region ==="

read -p "Confirm failover to ${DR_REGION}? (yes/no): " CONFIRM
if [ "${CONFIRM}" != "yes" ]; then
  echo "Failover cancelled"
  exit 0
fi

# 1. Update Azure Front Door routing
echo "Updating AFD routing to DR region..."
az afd origin-group update \
  --profile-name "${PRIMARY_AFD_PROFILE}" \
  --resource-group ATP-Prod-RG \
  --origin-group-name atp-origin-group \
  --origins '[{
    "name": "atp-primary-eastus",
    "enabled": false,
    "priority": 2
  }, {
    "name": "atp-dr-westus",
    "enabled": true,
    "priority": 1
  }]'

# 2. Failover Service Bus Geo-DR alias
echo "Failing over Service Bus alias..."
az servicebus georecovery-alias fail-over \
  --namespace-name atp-sb-prod-eastus \
  --resource-group ATP-Prod-RG \
  --alias atp-sb-prod-alias

# 3. Update tenant registry (set read-only for primary region tenants)
echo "Updating tenant registry..."
az rest --method POST \
  --uri "https://atp-gateway-prod.${DR_REGION}.cloudapp.azure.com/ops/v1/tenants/mode" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
  --body "{
    \"homeRegion\": \"${PRIMARY_REGION}\",
    \"mode\": \"read-only\",
    \"reason\": \"DR failover\"
  }"

# 4. Trigger cache warm-up and projection refresh
echo "Triggering cache warm-up..."
az rest --method POST \
  --uri "https://atp-gateway-prod.${DR_REGION}.cloudapp.azure.com/ops/v1/cache/warmup" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}"

# 5. Validate SLOs
echo "Waiting for SLO validation..."
sleep 60

# Check SLO metrics
SLO_STATUS=$(az monitor metrics list \
  --resource /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/ATP-Prod-RG/providers/Microsoft.Network/frontdoors/${PRIMARY_AFD_PROFILE} \
  --metric "RequestCount" \
  --query "[0].timeseries[0].data[-1].total" -o tsv)

if [ "${SLO_STATUS}" -gt 0 ]; then
  echo "✅ Failover completed successfully"
  echo "Primary region: ${PRIMARY_REGION} (read-only)"
  echo "DR region: ${DR_REGION} (active)"
else
  echo "❌ Failover validation failed"
  exit 1
fi

Failback Procedures

#!/bin/bash
# failback-to-primary.sh

PRIMARY_REGION="eastus"
DR_REGION="westus"

echo "=== DISASTER RECOVERY: Failback to Primary Region ==="

# 1. Verify primary region is healthy
echo "Verifying primary region health..."
PRIMARY_HEALTH=$(az monitor metrics list \
  --resource /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/ATP-Prod-RG/providers/Microsoft.ContainerService/managedClusters/atp-aks-prod-${PRIMARY_REGION} \
  --metric "kube_pod_status_ready" \
  --query "[0].timeseries[0].data[-1].average" -o tsv)

if [ "${PRIMARY_HEALTH}" -lt 0.95 ]; then
  echo "❌ Primary region not healthy (${PRIMARY_HEALTH}), aborting failback"
  exit 1
fi

# 2. Sync data from DR to primary (if needed)
echo "Syncing data from DR to primary..."
# (Implementation depends on data sync strategy)

# 3. Update AFD routing (gradual traffic shift)
echo "Shifting traffic back to primary (10% increment)..."
for PERCENTAGE in 10 25 50 75 100; do
  echo "Shifting ${PERCENTAGE}% traffic to primary..."
  # Update AFD routing weights
  sleep 300  # Wait 5 minutes between increments
done

# 4. Failback Service Bus alias
echo "Failing back Service Bus alias..."
az servicebus georecovery-alias fail-over \
  --namespace-name atp-sb-prod-${PRIMARY_REGION} \
  --resource-group ATP-Prod-RG \
  --alias atp-sb-prod-alias

# 5. Update tenant registry
echo "Updating tenant registry (restore primary mode)..."
az rest --method POST \
  --uri "https://atp-gateway-prod.${PRIMARY_REGION}.cloudapp.azure.com/ops/v1/tenants/mode" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
  --body "{
    \"homeRegion\": \"${PRIMARY_REGION}\",
    \"mode\": \"read-write\",
    \"reason\": \"Failback completed\"
  }"

echo "✅ Failback completed successfully"

DR Drill Procedures

Monthly Drill Checklist:

  • Schedule drill during maintenance window
  • Notify stakeholders (status page, email)
  • Backup current state (catalogs, configs)
  • Execute failover to DR region
  • Validate SLOs in DR region (latency, availability, correctness)
  • Execute sample restore (tenant-scoped)
  • Verify integrity (Merkle roots, signatures)
  • Execute failback to primary
  • Validate SLOs after failback
  • Document findings (RTO/RPO actuals, issues, improvements)
  • Update runbooks based on learnings

Type Typical Trigger Blocks Expires
LegalHold Litigation/Discovery Purge/Redact/Delete Manual release only
RegulatorExtension Regulator directive (e.g., retention+) Purge/Delete (may allow Redact) Date- or directive-bound
InvestigationHold Security/forensics Purge/Delete (optional Redact) Time-bound with review
// ConnectSoft.ATP.Platform/Models/LegalHold.cs
public class LegalHold
{
    public string HoldId { get; set; }  // e.g., "lh-01J9ZN5W"
    public string TenantId { get; set; }
    public string Stream { get; set; }  // e.g., "audit.default"
    public HoldPredicate Predicate { get; set; }  // { action: ["Export.Requested"], timeRange: {...} }
    public HoldState State { get; set; }  // Active, Released
    public List<string> Approvers { get; set; }
    public DateTimeOffset CreatedAt { get; set; }
    public string EvidenceRef { get; set; }  // Blob URI to manifest
}

public class HoldPredicate
{
    public List<string> Actions { get; set; }
    public TimeRange TimeRange { get; set; }
    public Dictionary<string, object> Attributes { get; set; }  // KQL/SQL-lite filters
}
#!/bin/bash
# apply-legal-hold.sh

TENANT_ID="acme-corp"
HOLD_ID="lh-$(date +%Y%m%d%H%M%S)"
CASE_ID="litigation-2025-001"

echo "Applying legal hold: ${HOLD_ID}"

# 1. Create legal hold record
az rest --method POST \
  --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/legal-holds" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
  --body '{
    "holdId": "'"${HOLD_ID}"'",
    "tenantId": "'"${TENANT_ID}"'",
    "stream": "audit.default",
    "predicate": {
      "timeRange": {
        "from": "2025-01-01T00:00:00Z",
        "to": "2025-12-31T23:59:59Z"
      },
      "actions": ["Export.Requested", "Export.Completed"]
    },
    "state": "Active",
    "approvers": ["legal@connectsoft.dev", "owner@acme-corp.com"],
    "reason": "Litigation: '"${CASE_ID}"'"
  }'

# 2. Apply legal hold to Azure Blob Storage container
az storage container legal-hold set \
  --account-name atpstorageprodeus \
  --container-name "atp-${TENANT_ID}-hot" \
  --tags "${HOLD_ID}" "${CASE_ID}" \
  --allow-protected-append-writes-all false

echo "✅ Legal hold applied: ${HOLD_ID}"
#!/bin/bash
# release-legal-hold.sh

HOLD_ID="lh-20251030120000"
APPROVER="legal@connectsoft.dev"

echo "Releasing legal hold: ${HOLD_ID}"

# 1. Release legal hold (requires approver authorization)
az rest --method POST \
  --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/legal-holds/${HOLD_ID}/release" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
  --body '{
    "releasedBy": "'"${APPROVER}"'",
    "reason": "Litigation resolved",
    "releaseTimestamp": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"
  }'

# 2. Remove legal hold tags from blob container
az storage container legal-hold clear \
  --account-name atpstorageprodeus \
  --container-name "atp-${TENANT_ID}-hot" \
  --tags "${HOLD_ID}"

echo "✅ Legal hold released: ${HOLD_ID}"

Data Subject Access Requests (DSAR)

DSAR Workflow

sequenceDiagram
    autonumber
    participant U as User/Admin
    participant GW as Gateway
    participant DSR as DSAR Orchestrator
    participant SVC as Domain Services
    participant EXP as Export Service
    participant REV as Review Lane

    U->>GW: Submit DSAR (access/erasure)
    GW->>DSR: create_case(tenantId, subjectId, type)
    DSR->>SVC: fanout(workflow tasks per context)
    SVC-->>DSR: status/proofs (export bundle, erasure markers)
    DSR->>EXP: Generate export bundle (with redaction)
    EXP-->>DSR: Export bundle + signed manifest
    DSR->>REV: Route to review lane
    REV->>REV: Review export (minimize PII)
    alt Approved
        REV->>DSR: Approve export
        DSR-->>U: Deliver export (presigned URL)
    else Rejected
        REV->>DSR: Request revision
        DSR->>EXP: Regenerate with additional redaction
    end
Hold "Alt" / "Option" to enable pan & zoom

DSAR Case Model

// ConnectSoft.ATP.Platform/Models/DsarCase.cs
public class DsarCase
{
    public string CaseId { get; set; }  // e.g., "dsar-241"
    public string TenantId { get; set; }
    public DsarSubject Subjects { get; set; }  // { email: [...], phone: [...] }
    public TimeRange TimeRange { get; set; }
    public Dictionary<string, object> Filters { get; set; }  // { actions: [...] }
    public string Purpose { get; set; }  // "Data Subject Access Request"
    public DsarState State { get; set; }  // Opened, Discovery, Review, Approved, Exported, Closed
    public List<string> Reviewers { get; set; }
    public string PolicyVersion { get; set; }
}

public class DsarSubject
{
    public List<string> Emails { get; set; }
    public List<string> Phones { get; set; }
    public List<string> SubjectIds { get; set; }
}

Create DSAR Export

#!/bin/bash
# create-dsar-export.sh

CASE_ID="dsar-241"
TENANT_ID="acme-corp"
SUBJECT_EMAIL="john@example.com"

echo "Creating DSAR export for case: ${CASE_ID}"

EXPORT_ID=$(az rest --method POST \
  --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/v1/audit/exports" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
  --body '{
    "tenantId": "'"${TENANT_ID}"'",
    "caseId": "'"${CASE_ID}"'",
    "query": "subject.email == '"'"'${SUBJECT_EMAIL}'"'"' AND time >= 2025-09-01 AND time <= 2025-10-31",
    "format": "parquet",
    "redaction": {
      "hashSubjects": true,
      "truncateIp": true,
      "dropFields": ["resource.path", "payload.sensitive"]
    },
    "sign": true,
    "encryption": "kms://key-ref",
    "notify": ["privacy@acme-corp.com"]
  }' \
  --query "exportId" -o tsv)

echo "DSAR export created: ${EXPORT_ID}"

# Monitor export progress
while true; do
  STATUS=$(az rest --method GET \
    --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/v1/audit/exports/${EXPORT_ID}/status" \
    --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
    --query "status" -o tsv)

  echo "Export status: ${STATUS}"

  if [ "${STATUS}" == "completed" ]; then
    DOWNLOAD_URL=$(az rest --method GET \
      --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/v1/audit/exports/${EXPORT_ID}/download" \
      --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
      --query "downloadUrl" -o tsv)

    echo "✅ Export completed"
    echo "Download URL: ${DOWNLOAD_URL}"
    break
  elif [ "${STATUS}" == "failed" ]; then
    echo "❌ Export failed"
    exit 1
  fi

  sleep 10
done

Export Manifest Structure

{
  "manifestId": "man_01HZK...",
  "tenantId": "acme-corp",
  "region": "westeurope",
  "format": "parquet",
  "schemaVersion": "2.4.1",
  "counts": {
    "rows": 5423188,
    "files": 12
  },
  "bytes": 87331002881,
  "hashes": [
    {
      "file": "part-0000.parquet",
      "sha256": "b3f3a1b2c4d5e6f7..."
    }
  ],
  "proofRefs": [
    {
      "stream": "aud.gateway",
      "fromSeg": "000130",
      "toSeg": "000145"
    }
  ],
  "encryption": {
    "keyId": "hsm-eu-01",
    "keyVersion": "8"
  },
  "policyVersion": "3.1.0",
  "tsa": {
    "type": "rfc3161",
    "token": "b64:MEUCIQ..."
  },
  "signature": "MEUCIQ..."
}

Integrity & Verification

Merkle Proof Verification

// ConnectSoft.ATP.Integrity/Services/MerkleVerifier.cs
public class MerkleVerifier
{
    public async Task<VerificationResult> VerifySegmentAsync(
        string segmentId,
        Stream segmentData,
        string expectedRootHash,
        List<MerklePathNode> merklePath,
        CancellationToken cancellationToken)
    {
        // 1. Compute leaf hash
        var leafHash = await ComputeSha256HashAsync(segmentData);
        segmentData.Position = 0;

        // 2. Recompute root hash using Merkle path
        var computedRoot = await RecomputeMerkleRootAsync(leafHash, merklePath);

        // 3. Compare with expected root
        if (computedRoot != expectedRootHash)
        {
            return new VerificationResult
            {
                Success = false,
                Reason = $"Merkle root mismatch: expected {expectedRootHash}, got {computedRoot}"
            };
        }

        // 4. Verify digital signature (if provided)
        // (Implementation depends on signature scheme)

        return new VerificationResult
        {
            Success = true,
            LeafHash = leafHash,
            RootHash = computedRoot
        };
    }
}

Backup Integrity Check

#!/bin/bash
# verify-backup-integrity.sh

RECOVERY_POINT_ID="RP-20251027T080000Z-eu-west-acme"

echo "Verifying backup integrity: ${RECOVERY_POINT_ID}"

# 1. Fetch manifest
MANIFEST=$(az rest --method GET \
  --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/backups/${RECOVERY_POINT_ID}/manifest" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}")

# 2. Download packages
PACKAGES=$(echo "${MANIFEST}" | jq -r '.packages[]')

for PACKAGE in ${PACKAGES}; do
  PACKAGE_NAME=$(echo "${PACKAGE}" | jq -r '.name')
  PACKAGE_URI=$(echo "${PACKAGE}" | jq -r '.uri')
  EXPECTED_HASH=$(echo "${PACKAGE}" | jq -r '.sha256')

  echo "Verifying package: ${PACKAGE_NAME}"

  # Download package
  az storage blob download \
    --account-name atpstorageprodeus \
    --container-name backups \
    --name "${PACKAGE_NAME}" \
    --file "/tmp/${PACKAGE_NAME}"

  # Compute hash
  ACTUAL_HASH=$(sha256sum "/tmp/${PACKAGE_NAME}" | cut -d' ' -f1)

  # Compare hashes
  if [ "${ACTUAL_HASH}" != "${EXPECTED_HASH}" ]; then
    echo "❌ Hash mismatch for ${PACKAGE_NAME}"
    echo "  Expected: ${EXPECTED_HASH}"
    echo "  Actual: ${ACTUAL_HASH}"
    exit 1
  else
    echo "✅ Hash verified: ${PACKAGE_NAME}"
  fi
done

# 3. Verify Merkle root
EXPECTED_MERKLE_ROOT=$(echo "${MANIFEST}" | jq -r '.merkleRoot')
# (Merkle root verification logic)

# 4. Verify digital signature
SIGNATURE=$(echo "${MANIFEST}" | jq -r '.signature')
KEY_ID=$(echo "${MANIFEST}" | jq -r '.keyId')
# (Digital signature verification logic)

echo "✅ Backup integrity verified: ${RECOVERY_POINT_ID}"

Operational Procedures

Daily Backup Validation

#!/bin/bash
# daily-backup-validation.sh

echo "=== Daily Backup Validation ==="

# 1. Verify SQL automated backups exist
LATEST_SQL_BACKUP=$(az sql db list-restorable-dropped \
  --server atp-sql-prod-eus \
  --resource-group ATP-Prod-RG \
  --query "[0].earliestRestoreDate" -o tsv)

if [ -z "${LATEST_SQL_BACKUP}" ]; then
  echo "❌ No SQL backups found"
  exit 1
fi

echo "✅ Latest SQL backup: ${LATEST_SQL_BACKUP}"

# 2. Verify Cosmos DB continuous backup mode
COSMOS_BACKUP_MODE=$(az cosmosdb show \
  --name atp-cosmos-prod \
  --resource-group ATP-Shared-RG \
  --query "backupPolicy.type" -o tsv)

if [ "${COSMOS_BACKUP_MODE}" != "Continuous" ]; then
  echo "❌ Cosmos DB not in continuous backup mode"
  exit 1
fi

echo "✅ Cosmos DB continuous backup enabled"

# 3. Verify blob geo-replication status
GEO_REPL_STATUS=$(az storage account show \
  --name atpstorageprodeus \
  --resource-group ATP-Prod-RG \
  --query "geoReplicationStats.status" -o tsv)

if [ "${GEO_REPL_STATUS}" != "Live" ]; then
  echo "❌ Blob geo-replication not live"
  exit 1
fi

echo "✅ Blob geo-replication live"

# 4. Check backup catalog for recent recovery points
RECENT_BACKUPS=$(az sql query \
  --server atp-sql-prod-eus \
  --database ATP_BackupCatalog \
  --query-text "SELECT COUNT(*) FROM BackupCatalog WHERE CreatedAt >= DATEADD(day, -1, GETUTCDATE()) AND Status = 'completed'" \
  --query "[0].['']" -o tsv)

if [ "${RECENT_BACKUPS}" -lt 20 ]; then
  echo "❌ Insufficient recent backups (expected >= 20, got ${RECENT_BACKUPS})"
  exit 1
fi

echo "✅ Recent backups: ${RECENT_BACKUPS}"

echo "✅ Daily backup validation passed"

Monthly Recovery Drill

#!/bin/bash
# monthly-recovery-drill.sh

echo "=== Monthly Recovery Drill ==="

# 1. Select random tenant and recovery point
TENANT_ID="acme-corp"
RECOVERY_POINT_ID="RP-$(date -d '7 days ago' +%Y%m%dT%H%M%SZ)-eu-west-${TENANT_ID}"

echo "Selected recovery point: ${RECOVERY_POINT_ID}"

# 2. Execute restore to sandbox
RESTORE_JOB_ID=$(az rest --method POST \
  --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/restores" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
  --body '{
    "recoveryPointId": "'"${RECOVERY_POINT_ID}"'",
    "mode": "sandbox",
    "target": {
      "tenantId": "'"${TENANT_ID}"'",
      "region": "westeurope"
    },
    "verifyPolicy": {
      "rowCounts": true,
      "samplePercent": 10,
      "proofs": true
    }
  }' \
  --query "restoreJobId" -o tsv)

echo "Restore job started: ${RESTORE_JOB_ID}"

# 3. Monitor restore
# (Wait for completion, log RTO)

# 4. Validate restored data
# (Sample queries, integrity checks)

# 5. Document results
echo "Recording drill results..."
az rest --method POST \
  --uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/drills/record" \
  --headers "Authorization: Bearer ${ACCESS_TOKEN}" \
  --body '{
    "drillId": "drill-'"$(date +%Y%m%d)"'",
    "recoveryPointId": "'"${RECOVERY_POINT_ID}"'",
    "rtoActual": 45,
    "rpoActual": 8,
    "result": "passed",
    "notes": "All checks passed"
  }'

echo "✅ Monthly recovery drill completed"

Monitoring & Alerting

Key Metrics

Metric Type Description Alert Threshold
backup_runs_total{result} Counter Backups by result (success/failed) Failures > baseline (e.g., > 2 in 24h)
backup_bytes_total Counter Total bytes uploaded Sudden drop/spike (> 50% change)
backup_duration_seconds Histogram Backup wall time p95 > SLO (e.g., > 2 hours for full)
restore_duration_seconds Histogram Drill/restore time p95 > RTO target
backup_proof_failures_total Counter Integrity verification failures Any > 0
rpo_effective_seconds Gauge Now − last successful cutover > RPO target (e.g., > 15 minutes)
rto_drill_pass_rate Gauge % drills meeting RTO < target (e.g., < 95%)

Prometheus Alert Rules

# alerts-backup-restore.yml
groups:
  - name: backup_restore
    interval: 30s
    rules:
      - alert: BackupFailureRate
        expr: rate(backup_runs_total{result="failed"}[1h]) > 0.1
        for: 15m
        labels:
          severity: critical
          component: backup
        annotations:
          summary: "Backup failure rate exceeded threshold"
          description: "Backup failure rate is {{ $value }} failures/hour (threshold: 0.1)"

      - alert: BackupRPOExceeded
        expr: rpo_effective_seconds > 900  # 15 minutes
        for: 10m
        labels:
          severity: warning
          component: backup
        annotations:
          summary: "RPO exceeded"
          description: "Current RPO is {{ $value }}s (target: 900s)"

      - alert: RestoreRTOExceeded
        expr: restore_duration_seconds{p95} > 1800  # 30 minutes
        for: 5m
        labels:
          severity: critical
          component: restore
        annotations:
          summary: "Restore RTO exceeded"
          description: "Restore p95 duration is {{ $value }}s (target: 1800s)"

      - alert: BackupIntegrityFailure
        expr: backup_proof_failures_total > 0
        for: 0m
        labels:
          severity: critical
          component: integrity
        annotations:
          summary: "Backup integrity verification failed"
          description: "Backup proof verification failed for recovery point"

Compliance & Evidence

Compliance Requirements

Framework Requirement Control
SEC 17a-4 Immutable records, 7-year retention WORM storage, legal hold, signed manifests
HIPAA Audit logs, integrity, retention Encryption, access controls, audit trails
GDPR Data subject rights, erasure DSAR workflows, legal hold, export capabilities
SOC 2 Availability, integrity, confidentiality Backup/restore procedures, access controls, monitoring

Evidence Collection

Backup Evidence Pack: - Recovery point manifest - Merkle root proofs - Digital signatures - TSA tokens (if applicable) - Backup catalog entries - Integrity verification results

Restore Evidence Pack: - Restore log (immutable) - Verification results - Approval records (two-person) - Promotion timestamps - SLO validation results

eDiscovery Evidence Pack: - Export manifest (signed) - Redaction policies applied - Chain of custody logs - Delivery receipts - Legal hold references


Troubleshooting

Common Issues

Issue: Backup Failed - Storage Unavailable

Symptoms: - Backup job status = failed - Error: 503 Service Unavailable from object store

Resolution: 1. Check Azure Storage account status 2. Verify network connectivity (Private Link, firewall rules) 3. Retry backup with exponential backoff 4. If persistent, escalate to Azure support

Issue: Restore Failed - Integrity Verification Failed

Symptoms: - Restore job status = failed - Error: 422 Unprocessable Entity - Hash mismatch

Resolution: 1. Review integrity verification logs 2. Compare expected vs. actual hashes 3. Re-download packages and re-verify 4. If corruption confirmed, use earlier recovery point 5. Escalate to integrity service team

Symptoms: - Retention policy cannot delete records - Error: 409 Conflict - Legal hold active

Resolution: 1. Review active legal holds for tenant/scope 2. Verify hold expiration dates 3. If hold should be released, follow release procedure (requires approver) 4. Do not force deletion (violates compliance)


Runbooks & Checklists

Backup Runbook

  1. Verify Prerequisites
  2. Object store accessible
  3. Integrity service healthy
  4. Backup catalog available
  5. Sufficient storage quota

  6. Execute Backup

  7. Trigger backup (scheduled or on-demand)
  8. Monitor backup progress
  9. Verify completion status

  10. Validate Backup

  11. Check backup catalog entry
  12. Verify package hashes
  13. Confirm Merkle root exists
  14. Validate digital signature

  15. Document Results

  16. Record recovery point ID
  17. Log backup duration
  18. Note any issues/warnings

Restore Runbook

  1. Request Restore
  2. Identify recovery point
  3. Define scope (tenant, time range, data classes)
  4. Select target (sandbox or production)

  5. Execute Restore

  6. Submit restore request
  7. Monitor restore progress
  8. Wait for completion

  9. Verify Integrity

  10. Check segment checksums
  11. Verify Merkle roots
  12. Validate signatures
  13. Confirm journal continuity

  14. Validate Data

  15. Sample queries
  16. Row count verification
  17. Policy enforcement checks

  18. Promote (if production)

  19. Two-person approval
  20. Execute promotion
  21. Validate SLOs
  22. Document promotion

DR Failover Runbook

  1. Declare DR
  2. Confirm incident scope
  3. Notify stakeholders
  4. Activate incident response

  5. Execute Failover

  6. Update AFD routing
  7. Failover Service Bus alias
  8. Update tenant registry
  9. Trigger cache warm-up

  10. Validate DR Region

  11. Check SLOs
  12. Verify data accessibility
  13. Test critical workflows

  14. Monitor & Communicate

  15. Update status page
  16. Send tenant notifications
  17. Document failover timeline

  18. Plan Failback

  19. Verify primary region health
  20. Sync data (if needed)
  21. Execute gradual failback
  22. Validate post-failback SLOs

Summary

This document provides comprehensive operational guidance for ATP's backup, restore, disaster recovery, and eDiscovery strategies. Key takeaways:

  • RPO/RTO: Targets vary by environment/edition (Enterprise: ≤5min RPO, ≤30min RTO)
  • Backups: Automated and on-demand, with integrity verification, encryption, and WORM storage
  • Restores: Tenant-scoped and full-system, with quarantine validation before promotion
  • DR: Multi-region failover with automated orchestration and gradual failback
  • eDiscovery: Legal holds, DSAR workflows, signed exports, compliance evidence
  • Integrity: Merkle proofs, hash chains, digital signatures, tamper-evidence
  • Compliance: SEC 17a-4, HIPAA, GDPR, SOC 2 controls and evidence collection

Next Steps: - Review and customize RPO/RTO targets for your organization - Schedule regular recovery drills (monthly) - Establish legal hold and DSAR procedures - Configure monitoring and alerting - Train operations team on runbooks


Document Version: 1.0
Last Updated: 2025-10-30
Maintained By: Platform Operations Team