Backup, Restore, Disaster Recovery & eDiscovery¶
Purpose & Scope¶
Purpose: Comprehensive operational guide for ATP's backup, restore, disaster recovery (DR), and eDiscovery strategies, ensuring data durability, integrity, recoverability, and legal compliance across all environments.
Scope: This document covers:
- Backup Strategies: Automated and on-demand backups for Azure SQL Database, Azure Blob Storage (WORM), Azure Cosmos DB, Redis, and Service Bus, with integrity verification, encryption, and region-coherent storage
- Restore Procedures: Tenant-scoped and full-system restores, point-in-time recovery (PITR), staging/quarantine restores, integrity verification, and recovery drills
- Disaster Recovery: Multi-region failover, RPO/RTO objectives per environment/edition, automated failover procedures, regional DR drills, and failback strategies
- eDiscovery: Legal hold management, data subject access requests (DSAR), export workflows, signed manifests, tamper-evidence, compliance exports (SEC 17a-4, HIPAA, GDPR)
- Operational Excellence: Recovery drills, backup validation, monitoring, alerting, runbooks, compliance evidence collection
Audience: Platform operators, SRE teams, compliance officers, legal/regulatory teams, incident responders, backup administrators
Relationship to Other Documents:
- Architecture: See ../architecture/data-architecture.md for data model, WORM storage, and integrity patterns
- Operations: See runbook.md for day-to-day operations, incident response, and troubleshooting
- Monitoring: See monitoring.md for observability, metrics, and alerting
- Security: See ../hardening/tamper-evidence.md for integrity proofs, hash chains, and digital signatures
- Compliance: See ../platform/privacy-gdpr-hipaa-soc2.md for regulatory requirements
Table of Contents¶
- Overview & Principles
- RPO/RTO Objectives
- Backup Strategy
- Restore Procedures
- Disaster Recovery
- eDiscovery & Legal Hold
- Integrity & Verification
- Operational Procedures
- Monitoring & Alerting
- Compliance & Evidence
- Troubleshooting
- Runbooks & Checklists
Overview & Principles¶
Core Principles¶
- Data Durability: Zero data loss within RPO window; backups are immutable and cryptographically protected
- Region Coherence: Each tenant's authoritative data is backed up in the same region; cross-region replication only when residency policy allows
- Integrity First: All backups include Merkle proofs, hash chains, and digital signatures; restores are verified before promotion
- Legal Compliance: WORM storage, legal holds, tamper-evidence, signed manifests for regulatory compliance (SEC 17a-4, HIPAA, GDPR)
- Quarantine Before Promote: All restores land in read-only quarantine namespace until integrity verification passes
- Evidence-Based: All backup/restore/eDiscovery operations produce auditable evidence suitable for external review
Data Classifications¶
| Data Class | Backup Cadence | Retention | Storage Tier | Legal Hold Support |
|---|---|---|---|---|
| Hot (Append/WORM) | Hourly incremental, Daily full | 7 years (configurable) | Premium Blob (WORM) | Yes |
| Warm (Read Models) | Weekly snapshots (optional, rebuild-first preferred) | 7 years | Standard Blob | Yes |
| Cold (Archives/Exports) | On-demand, lifecycle transitions | 7 years | Archive Blob | Yes |
| Projection DB | Continuous (Azure SQL automated), Hourly incrementals | 35 days PITR, 7 years LTR | Azure SQL + Blob | Yes |
| Search Index | Rebuild from hot (preferred) or weekly snapshots | 7 years | Blob | Yes |
| Cosmos DB | Continuous (change feed) | 30 days continuous, 7 years periodic | Cosmos DB + Blob | Yes |
Backup Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Backup Service (Orchestrator) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Scheduler │ │ Workers │ │ Catalog │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ │ │
├──────────────────┼──────────────────┤
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Append │ │ Projection│ │ Search │
│ Store │ │ DB │ │ Index │
│ (WORM) │ │ │ │ │
└────┬────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────────┼──────────────────┘
│
┌───────▼────────┐
│ Integrity Svc │
│ (Merkle/Hash) │
└───────┬────────┘
│
┌───────▼────────┐
│ Object Store │
│ (WORM/Immutable│
│ + KMS) │
└────────────────┘
RPO/RTO Objectives¶
Definitions¶
- RPO (Recovery Point Objective): Maximum acceptable data loss (time window from last successful backup to failure)
- RTO (Recovery Time Objective): Maximum acceptable downtime (time from failure declaration to service restoration)
RPO/RTO by Environment & Edition¶
| Environment | Edition | RPO Target | RTO Target | Notes |
|---|---|---|---|---|
| Production | Enterprise | ≤ 5 minutes | ≤ 30 minutes | Multi-region active-active, continuous backups |
| Production | Pro | ≤ 15 minutes | ≤ 30 minutes | Single region, hourly incrementals |
| Production | Free | ≤ 24 hours | ≤ 2 hours | Daily backups, recreate from Git + IaC acceptable |
| Staging | All | ≤ 1 hour | ≤ 1 hour | Production-like validation |
| Test | All | ≤ 12 hours | ≤ 2 hours | Daily backups, restore within business hours |
| Dev | All | ≤ 24 hours | ≤ 4 hours | Recreate from Git + IaC preferred over restore |
RPO/RTO by Scenario¶
| Scenario | RPO Target | RTO Target | Procedure |
|---|---|---|---|
| Zonal failure (within region) | ~ 0 minutes | ≤ 15 minutes | ZRS + multi-zone node pools; AFD keeps region healthy |
| Regional outage (cross-region allowed) | ≤ 5 minutes | 15-60 minutes | AFD cutover + ASB alias + HOT replication lag |
| Regional outage (strict residency pair) | ≤ 15 minutes | 1-4 hours | Same-jurisdiction DR restore + replay WARM |
| AKS control-plane impairment | ~ 0 minutes | 30-60 minutes | Shift to sibling cluster or re-create node pools |
| Key Vault/HSM incident | ~ 0 minutes | 1-2 hours | Restore HSM backup; dual-sign window if needed |
| Data corruption | Variable | 30-60 minutes | PITR to last known good point; tenant-scoped restore |
Backup Strategy¶
Azure SQL Database Backups¶
Automated Backups¶
Production: - Full Backup: Weekly (Sunday 2:00 AM UTC) - Differential Backup: Every 12-24 hours (automated) - Transaction Log Backup: Every 5-10 minutes (automated) - Retention: - Short-term (PITR): 35 days - Long-term retention (LTR): Weekly/Monthly/Yearly backups retained up to 10 years - Redundancy: Geo-redundant (replicated to paired region)
Configuration:
# Configure Azure SQL Database backup retention
az sql db update \
--name ATP_Prod \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--backup-storage-redundancy Geo \
--short-term-retention-policy retention-days=35
# Configure long-term retention policy
az sql db ltr-policy set \
--database-name ATP_Prod \
--server-name atp-sql-prod-eus \
--resource-group ATP-Prod-RG \
--weekly-retention "P4W" \
--monthly-retention "P12M" \
--yearly-retention "P10Y" \
--week-of-year 1
Point-in-Time Restore (PITR):
# Restore to specific point in time
az sql db restore \
--dest-name ATP_Prod_Restore \
--name ATP_Prod \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--time "2025-10-30T14:30:00Z"
Manual Backups (BACPAC Export)¶
Weekly Full Backup:
#!/bin/bash
# weekly-backup-prod.sh
BACKUP_DATE=$(date +%Y%m%d)
BACKUP_FILE="prod-full-${BACKUP_DATE}.bacpac"
STORAGE_URI="https://atpstorageprodeus.blob.core.windows.net/backups/weekly/${BACKUP_FILE}"
echo "Creating weekly full backup: ${BACKUP_FILE}"
az sql db export \
--name ATP_Prod \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--admin-user $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminUser --query value -o tsv) \
--admin-password $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminPassword --query value -o tsv) \
--storage-key $(az storage account keys list --account-name atpstorageprodeus --query "[0].value" -o tsv) \
--storage-key-type StorageAccessKey \
--storage-uri "${STORAGE_URI}"
# Tag backup with metadata
az storage blob metadata update \
--account-name atpstorageprodeus \
--container-name backups \
--name "weekly/${BACKUP_FILE}" \
--metadata \
type=full \
createdAt=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
retentionYears=7 \
environment=Production
echo "✅ Weekly full backup completed: ${BACKUP_FILE}"
Azure Blob Storage Backups (WORM)¶
WORM Storage Configuration¶
Production Container Setup:
# Enable versioning and immutable storage
az storage account blob-service-properties update \
--account-name atpstorageprodeus \
--enable-versioning true \
--enable-delete-retention true \
--delete-retention-days 90
# Create container with immutable storage enabled
az storage container create \
--account-name atpstorageprodeus \
--name audit-events \
--public-access off \
--metadata \
purpose="audit-trail-worm" \
compliance="SEC-17a-4" \
retentionYears="7"
# Enable immutable storage with versioning
az storage container immutability-policy create \
--account-name atpstorageprodeus \
--container-name audit-events \
--period 2555 \
--allow-protected-append-writes false
# Lock immutability policy (irreversible)
az storage container immutability-policy lock \
--account-name atpstorageprodeus \
--container-name audit-events
Segment Backup Strategy¶
Segment Snapshot Workflow:
// ConnectSoft.ATP.Backup/Services/SegmentBackupService.cs
public class SegmentBackupService
{
private readonly IBlobContainerClient _blobContainer;
private readonly IIntegrityService _integrityService;
private readonly IBackupCatalog _catalog;
public async Task<BackupResult> CreateSegmentBackupAsync(
string tenantId,
string segmentId,
Stream segmentData,
CancellationToken cancellationToken)
{
// 1. Compute segment hash
var segmentHash = await ComputeSha256HashAsync(segmentData);
segmentData.Position = 0;
// 2. Create segment blob with WORM policy
var blobName = $"segments/{tenantId}/{segmentId}.jsonl.gz";
var blobClient = _blobContainer.GetBlobClient(blobName);
var uploadOptions = new BlobUploadOptions
{
Metadata = new Dictionary<string, string>
{
["tenantId"] = tenantId,
["segmentId"] = segmentId,
["createdAt"] = DateTimeOffset.UtcNow.ToString("O"),
["segmentHash"] = segmentHash,
["format"] = "jsonl.gz"
}
};
await blobClient.UploadAsync(segmentData, uploadOptions, cancellationToken);
// 3. Create manifest blob
var manifest = new SegmentManifest
{
SegmentId = segmentId,
TenantId = tenantId,
SegmentHash = segmentHash,
CreatedAt = DateTimeOffset.UtcNow,
BlobUri = blobClient.Uri.ToString(),
RecordCount = await CountRecordsAsync(segmentData)
};
var manifestBlobName = $"manifests/{tenantId}/{segmentId}.manifest.json";
var manifestBlob = _blobContainer.GetBlobClient(manifestBlobName);
await manifestBlob.UploadAsync(
new BinaryData(JsonSerializer.Serialize(manifest)),
cancellationToken: cancellationToken);
// 4. Register in backup catalog
await _catalog.RegisterRecoveryPointAsync(new RecoveryPoint
{
Id = $"RP-{DateTimeOffset.UtcNow:yyyyMMddTHHmmssZ}-{tenantId}-{segmentId}",
TenantId = tenantId,
SegmentId = segmentId,
Type = BackupType.Incremental,
CreatedAt = DateTimeOffset.UtcNow,
SegmentHash = segmentHash,
ManifestUri = manifestBlob.Uri.ToString()
}, cancellationToken);
return new BackupResult
{
Success = true,
SegmentHash = segmentHash,
BlobUri = blobClient.Uri.ToString(),
ManifestUri = manifestBlob.Uri.ToString()
};
}
}
Azure Cosmos DB Backups¶
Continuous Backup Mode¶
# Enable continuous backup mode
az cosmosdb sql database create \
--account-name atp-cosmos-prod \
--resource-group ATP-Shared-RG \
--name ATP_Database \
--continuous-tier Continuous30Days
# Restore from continuous backup
az cosmosdb sql database restore \
--account-name atp-cosmos-prod \
--resource-group ATP-Shared-RG \
--database-name ATP_Database \
--restore-timestamp "2025-10-30T14:30:00Z"
Redis Backups¶
Production (Premium tier):
# Configure RDB snapshots
az redis update \
--name atp-redis-prod \
--resource-group ATP-Shared-RG \
--set \
redisConfiguration.rdb-backup-enabled=true \
redisConfiguration.rdb-backup-frequency=15 \
redisConfiguration.rdb-backup-max-snapshot-count=7
Backup Catalog¶
Catalog Schema:
CREATE TABLE BackupCatalog (
RecoveryPointId VARCHAR(128) PRIMARY KEY,
TenantId VARCHAR(128) NOT NULL,
Region VARCHAR(64) NOT NULL,
BackupType VARCHAR(32) NOT NULL, -- 'full', 'incremental', 'point-in-time'
CreatedAt TIMESTAMP NOT NULL,
CompletedAt TIMESTAMP NULL,
Status VARCHAR(32) NOT NULL, -- 'in_progress', 'completed', 'failed'
DataClasses JSONB NOT NULL, -- ['hot', 'warm', 'cold']
Packages JSONB NOT NULL, -- [{name, uri, hash, bytes}, ...]
MerkleRoot VARCHAR(64) NULL,
Signature TEXT NULL,
KeyId VARCHAR(256) NULL,
PolicyVersion VARCHAR(32) NOT NULL,
INDEX idx_tenant_created (TenantId, CreatedAt),
INDEX idx_region_status (Region, Status, CreatedAt)
);
Restore Procedures¶
Tenant-Scoped Restore¶
Restore Request Workflow¶
sequenceDiagram
autonumber
participant OP as Operator
participant CAT as Backup Catalog
participant STO as Backup Store
participant RST as Restore Controller
participant INT as Integrity Verifier
participant QUAR as Quarantine Namespace
OP->>CAT: Request restore {tenant, timeRange, scope}
CAT-->>OP: Best recovery point + manifests
OP->>RST: Start restore(jobId, recoveryPointId)
RST->>STO: Fetch packages (Private Link)
RST->>INT: Verify checksums, Merkle roots, signatures
INT-->>RST: Verification result + evidence
alt Verification passed
RST->>QUAR: Restore to quarantine (read-only)
RST-->>OP: Restore completed, ready for validation
OP->>RST: Validate restored data (sample queries)
RST-->>OP: Validation results
alt Validation passed
OP->>RST: Approve promotion (two-person approval)
RST->>QUAR: Promote to active namespace
RST-->>OP: Promotion completed
end
else Verification failed
RST-->>OP: Restore failed, integrity check failed
end
Restore Command Example¶
#!/bin/bash
# restore-tenant.sh
TENANT_ID="acme-corp"
RESTORE_FROM="2025-10-01T00:00:00Z"
RESTORE_TO="2025-10-31T23:59:59Z"
RECOVERY_POINT_ID="RP-20251027T080000Z-eu-west-acme"
echo "Starting tenant restore: ${TENANT_ID}"
# 1. Request restore
RESTORE_JOB_ID=$(az rest --method POST \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/restores" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body '{
"recoveryPointId": "'"${RECOVERY_POINT_ID}"'",
"mode": "sandbox",
"target": {
"tenantId": "'"${TENANT_ID}"'",
"region": "westeurope"
},
"scope": {
"timeRange": {
"from": "'"${RESTORE_FROM}"'",
"to": "'"${RESTORE_TO}"'"
},
"dataClasses": ["hot", "warm"]
},
"verifyPolicy": {
"rowCounts": true,
"samplePercent": 5,
"proofs": true,
"checksums": true
}
}' \
--query "restoreJobId" -o tsv)
echo "Restore job started: ${RESTORE_JOB_ID}"
# 2. Monitor restore progress
while true; do
STATUS=$(az rest --method GET \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/restores/${RESTORE_JOB_ID}" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--query "status" -o tsv)
echo "Restore status: ${STATUS}"
if [ "${STATUS}" == "completed" ]; then
echo "✅ Restore completed successfully"
break
elif [ "${STATUS}" == "failed" ]; then
echo "❌ Restore failed"
exit 1
fi
sleep 10
done
# 3. Validate restored data
echo "Validating restored data..."
# Sample query to verify data integrity
RECORD_COUNT=$(az sql db query \
--server atp-sql-prod-eus \
--database ATP_Prod_Restore_Temp \
--query-text "SELECT COUNT(*) FROM AuditRecords WHERE TenantId = '${TENANT_ID}' AND CreatedAt >= '${RESTORE_FROM}' AND CreatedAt <= '${RESTORE_TO}'" \
--query "[0].['']" -o tsv)
echo "Restored records: ${RECORD_COUNT}"
# 4. Approve promotion (if validation passed)
read -p "Approve promotion to production? (yes/no): " APPROVE
if [ "${APPROVE}" == "yes" ]; then
az rest --method POST \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/restores/${RESTORE_JOB_ID}/promote" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body '{
"approvedBy": ["operator1@connectsoft.dev", "operator2@connectsoft.dev"],
"approvalTimestamp": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"
}'
echo "✅ Promotion approved and completed"
fi
Point-in-Time Restore (PITR)¶
SQL Database PITR¶
#!/bin/bash
# pitr-restore.sh
RESTORE_TIME="2025-10-30T14:30:00Z"
TARGET_DB_NAME="ATP_Prod_Restore_$(date +%Y%m%d_%H%M%S)"
echo "Restoring to point in time: ${RESTORE_TIME}"
az sql db restore \
--dest-name "${TARGET_DB_NAME}" \
--name ATP_Prod \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--time "${RESTORE_TIME}"
echo "✅ PITR restore completed: ${TARGET_DB_NAME}"
# Verify restore
RECORD_COUNT=$(az sql db query \
--server atp-sql-prod-eus \
--database "${TARGET_DB_NAME}" \
--query-text "SELECT COUNT(*) FROM AuditRecords" \
--query "[0].['']" -o tsv)
echo "Restored records: ${RECORD_COUNT}"
Full System Restore (Disaster Recovery)¶
See Disaster Recovery section below.
Disaster Recovery¶
Multi-Region Failover¶
Failover Decision Matrix¶
| Trigger | Condition | Action | RTO Target |
|---|---|---|---|
| AFD Health Probe | Primary region unhealthy > 5 minutes | Automated failover to DR region | ≤ 15 minutes |
| SLO Burn Rate | Error budget exhausted > threshold | Manual failover decision | ≤ 30 minutes |
| Operator Declaration | Incident commander declares DR | Manual failover execution | ≤ 60 minutes |
| Regional Outage | Azure region status = Down | Automated failover | ≤ 15 minutes |
Failover Orchestration¶
sequenceDiagram
autonumber
participant Mon as Monitor/SLO
participant IC as Incident Commander
participant AFD as Azure Front Door
participant BUS as Service Bus (Geo-DR)
participant REG as Registry
participant PDP as Policy Engine
participant CRDB as Database
Mon-->>IC: Region A unhealthy / SLO burn
IC->>AFD: Disable Region A origins; 100% to Region B
IC->>BUS: Flip Geo-DR alias to Namespace B
IC->>REG: Set tenant mode = read-only for homeRegion=A
alt Extended outage > 30 min
IC->>CRDB: Re-pin affected tenants to Region B
REG-->>PDP: Emit TenantRehomed obligations
end
IC->>REG: Trigger warm-up (keyheads) + Refresh broadcast
Mon-->>IC: SLOs recovered
IC->>Comms: Resolved update; start failback plan
Failover Script¶
#!/bin/bash
# failover-to-dr.sh
PRIMARY_REGION="eastus"
DR_REGION="westus"
PRIMARY_AFD_PROFILE="atp-frontdoor-prod"
echo "=== DISASTER RECOVERY: Failover to DR Region ==="
read -p "Confirm failover to ${DR_REGION}? (yes/no): " CONFIRM
if [ "${CONFIRM}" != "yes" ]; then
echo "Failover cancelled"
exit 0
fi
# 1. Update Azure Front Door routing
echo "Updating AFD routing to DR region..."
az afd origin-group update \
--profile-name "${PRIMARY_AFD_PROFILE}" \
--resource-group ATP-Prod-RG \
--origin-group-name atp-origin-group \
--origins '[{
"name": "atp-primary-eastus",
"enabled": false,
"priority": 2
}, {
"name": "atp-dr-westus",
"enabled": true,
"priority": 1
}]'
# 2. Failover Service Bus Geo-DR alias
echo "Failing over Service Bus alias..."
az servicebus georecovery-alias fail-over \
--namespace-name atp-sb-prod-eastus \
--resource-group ATP-Prod-RG \
--alias atp-sb-prod-alias
# 3. Update tenant registry (set read-only for primary region tenants)
echo "Updating tenant registry..."
az rest --method POST \
--uri "https://atp-gateway-prod.${DR_REGION}.cloudapp.azure.com/ops/v1/tenants/mode" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body "{
\"homeRegion\": \"${PRIMARY_REGION}\",
\"mode\": \"read-only\",
\"reason\": \"DR failover\"
}"
# 4. Trigger cache warm-up and projection refresh
echo "Triggering cache warm-up..."
az rest --method POST \
--uri "https://atp-gateway-prod.${DR_REGION}.cloudapp.azure.com/ops/v1/cache/warmup" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}"
# 5. Validate SLOs
echo "Waiting for SLO validation..."
sleep 60
# Check SLO metrics
SLO_STATUS=$(az monitor metrics list \
--resource /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/ATP-Prod-RG/providers/Microsoft.Network/frontdoors/${PRIMARY_AFD_PROFILE} \
--metric "RequestCount" \
--query "[0].timeseries[0].data[-1].total" -o tsv)
if [ "${SLO_STATUS}" -gt 0 ]; then
echo "✅ Failover completed successfully"
echo "Primary region: ${PRIMARY_REGION} (read-only)"
echo "DR region: ${DR_REGION} (active)"
else
echo "❌ Failover validation failed"
exit 1
fi
Failback Procedures¶
#!/bin/bash
# failback-to-primary.sh
PRIMARY_REGION="eastus"
DR_REGION="westus"
echo "=== DISASTER RECOVERY: Failback to Primary Region ==="
# 1. Verify primary region is healthy
echo "Verifying primary region health..."
PRIMARY_HEALTH=$(az monitor metrics list \
--resource /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/ATP-Prod-RG/providers/Microsoft.ContainerService/managedClusters/atp-aks-prod-${PRIMARY_REGION} \
--metric "kube_pod_status_ready" \
--query "[0].timeseries[0].data[-1].average" -o tsv)
if [ "${PRIMARY_HEALTH}" -lt 0.95 ]; then
echo "❌ Primary region not healthy (${PRIMARY_HEALTH}), aborting failback"
exit 1
fi
# 2. Sync data from DR to primary (if needed)
echo "Syncing data from DR to primary..."
# (Implementation depends on data sync strategy)
# 3. Update AFD routing (gradual traffic shift)
echo "Shifting traffic back to primary (10% increment)..."
for PERCENTAGE in 10 25 50 75 100; do
echo "Shifting ${PERCENTAGE}% traffic to primary..."
# Update AFD routing weights
sleep 300 # Wait 5 minutes between increments
done
# 4. Failback Service Bus alias
echo "Failing back Service Bus alias..."
az servicebus georecovery-alias fail-over \
--namespace-name atp-sb-prod-${PRIMARY_REGION} \
--resource-group ATP-Prod-RG \
--alias atp-sb-prod-alias
# 5. Update tenant registry
echo "Updating tenant registry (restore primary mode)..."
az rest --method POST \
--uri "https://atp-gateway-prod.${PRIMARY_REGION}.cloudapp.azure.com/ops/v1/tenants/mode" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body "{
\"homeRegion\": \"${PRIMARY_REGION}\",
\"mode\": \"read-write\",
\"reason\": \"Failback completed\"
}"
echo "✅ Failback completed successfully"
DR Drill Procedures¶
Monthly Drill Checklist:
- Schedule drill during maintenance window
- Notify stakeholders (status page, email)
- Backup current state (catalogs, configs)
- Execute failover to DR region
- Validate SLOs in DR region (latency, availability, correctness)
- Execute sample restore (tenant-scoped)
- Verify integrity (Merkle roots, signatures)
- Execute failback to primary
- Validate SLOs after failback
- Document findings (RTO/RPO actuals, issues, improvements)
- Update runbooks based on learnings
eDiscovery & Legal Hold¶
Legal Hold Management¶
Legal Hold Types¶
| Type | Typical Trigger | Blocks | Expires |
|---|---|---|---|
| LegalHold | Litigation/Discovery | Purge/Redact/Delete | Manual release only |
| RegulatorExtension | Regulator directive (e.g., retention+) | Purge/Delete (may allow Redact) | Date- or directive-bound |
| InvestigationHold | Security/forensics | Purge/Delete (optional Redact) | Time-bound with review |
Legal Hold Record Model¶
// ConnectSoft.ATP.Platform/Models/LegalHold.cs
public class LegalHold
{
public string HoldId { get; set; } // e.g., "lh-01J9ZN5W"
public string TenantId { get; set; }
public string Stream { get; set; } // e.g., "audit.default"
public HoldPredicate Predicate { get; set; } // { action: ["Export.Requested"], timeRange: {...} }
public HoldState State { get; set; } // Active, Released
public List<string> Approvers { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public string EvidenceRef { get; set; } // Blob URI to manifest
}
public class HoldPredicate
{
public List<string> Actions { get; set; }
public TimeRange TimeRange { get; set; }
public Dictionary<string, object> Attributes { get; set; } // KQL/SQL-lite filters
}
Apply Legal Hold¶
#!/bin/bash
# apply-legal-hold.sh
TENANT_ID="acme-corp"
HOLD_ID="lh-$(date +%Y%m%d%H%M%S)"
CASE_ID="litigation-2025-001"
echo "Applying legal hold: ${HOLD_ID}"
# 1. Create legal hold record
az rest --method POST \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/legal-holds" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body '{
"holdId": "'"${HOLD_ID}"'",
"tenantId": "'"${TENANT_ID}"'",
"stream": "audit.default",
"predicate": {
"timeRange": {
"from": "2025-01-01T00:00:00Z",
"to": "2025-12-31T23:59:59Z"
},
"actions": ["Export.Requested", "Export.Completed"]
},
"state": "Active",
"approvers": ["legal@connectsoft.dev", "owner@acme-corp.com"],
"reason": "Litigation: '"${CASE_ID}"'"
}'
# 2. Apply legal hold to Azure Blob Storage container
az storage container legal-hold set \
--account-name atpstorageprodeus \
--container-name "atp-${TENANT_ID}-hot" \
--tags "${HOLD_ID}" "${CASE_ID}" \
--allow-protected-append-writes-all false
echo "✅ Legal hold applied: ${HOLD_ID}"
Release Legal Hold¶
#!/bin/bash
# release-legal-hold.sh
HOLD_ID="lh-20251030120000"
APPROVER="legal@connectsoft.dev"
echo "Releasing legal hold: ${HOLD_ID}"
# 1. Release legal hold (requires approver authorization)
az rest --method POST \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/legal-holds/${HOLD_ID}/release" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body '{
"releasedBy": "'"${APPROVER}"'",
"reason": "Litigation resolved",
"releaseTimestamp": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"
}'
# 2. Remove legal hold tags from blob container
az storage container legal-hold clear \
--account-name atpstorageprodeus \
--container-name "atp-${TENANT_ID}-hot" \
--tags "${HOLD_ID}"
echo "✅ Legal hold released: ${HOLD_ID}"
Data Subject Access Requests (DSAR)¶
DSAR Workflow¶
sequenceDiagram
autonumber
participant U as User/Admin
participant GW as Gateway
participant DSR as DSAR Orchestrator
participant SVC as Domain Services
participant EXP as Export Service
participant REV as Review Lane
U->>GW: Submit DSAR (access/erasure)
GW->>DSR: create_case(tenantId, subjectId, type)
DSR->>SVC: fanout(workflow tasks per context)
SVC-->>DSR: status/proofs (export bundle, erasure markers)
DSR->>EXP: Generate export bundle (with redaction)
EXP-->>DSR: Export bundle + signed manifest
DSR->>REV: Route to review lane
REV->>REV: Review export (minimize PII)
alt Approved
REV->>DSR: Approve export
DSR-->>U: Deliver export (presigned URL)
else Rejected
REV->>DSR: Request revision
DSR->>EXP: Regenerate with additional redaction
end
DSAR Case Model¶
// ConnectSoft.ATP.Platform/Models/DsarCase.cs
public class DsarCase
{
public string CaseId { get; set; } // e.g., "dsar-241"
public string TenantId { get; set; }
public DsarSubject Subjects { get; set; } // { email: [...], phone: [...] }
public TimeRange TimeRange { get; set; }
public Dictionary<string, object> Filters { get; set; } // { actions: [...] }
public string Purpose { get; set; } // "Data Subject Access Request"
public DsarState State { get; set; } // Opened, Discovery, Review, Approved, Exported, Closed
public List<string> Reviewers { get; set; }
public string PolicyVersion { get; set; }
}
public class DsarSubject
{
public List<string> Emails { get; set; }
public List<string> Phones { get; set; }
public List<string> SubjectIds { get; set; }
}
Create DSAR Export¶
#!/bin/bash
# create-dsar-export.sh
CASE_ID="dsar-241"
TENANT_ID="acme-corp"
SUBJECT_EMAIL="john@example.com"
echo "Creating DSAR export for case: ${CASE_ID}"
EXPORT_ID=$(az rest --method POST \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/v1/audit/exports" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body '{
"tenantId": "'"${TENANT_ID}"'",
"caseId": "'"${CASE_ID}"'",
"query": "subject.email == '"'"'${SUBJECT_EMAIL}'"'"' AND time >= 2025-09-01 AND time <= 2025-10-31",
"format": "parquet",
"redaction": {
"hashSubjects": true,
"truncateIp": true,
"dropFields": ["resource.path", "payload.sensitive"]
},
"sign": true,
"encryption": "kms://key-ref",
"notify": ["privacy@acme-corp.com"]
}' \
--query "exportId" -o tsv)
echo "DSAR export created: ${EXPORT_ID}"
# Monitor export progress
while true; do
STATUS=$(az rest --method GET \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/v1/audit/exports/${EXPORT_ID}/status" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--query "status" -o tsv)
echo "Export status: ${STATUS}"
if [ "${STATUS}" == "completed" ]; then
DOWNLOAD_URL=$(az rest --method GET \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/v1/audit/exports/${EXPORT_ID}/download" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--query "downloadUrl" -o tsv)
echo "✅ Export completed"
echo "Download URL: ${DOWNLOAD_URL}"
break
elif [ "${STATUS}" == "failed" ]; then
echo "❌ Export failed"
exit 1
fi
sleep 10
done
Export Manifest Structure¶
{
"manifestId": "man_01HZK...",
"tenantId": "acme-corp",
"region": "westeurope",
"format": "parquet",
"schemaVersion": "2.4.1",
"counts": {
"rows": 5423188,
"files": 12
},
"bytes": 87331002881,
"hashes": [
{
"file": "part-0000.parquet",
"sha256": "b3f3a1b2c4d5e6f7..."
}
],
"proofRefs": [
{
"stream": "aud.gateway",
"fromSeg": "000130",
"toSeg": "000145"
}
],
"encryption": {
"keyId": "hsm-eu-01",
"keyVersion": "8"
},
"policyVersion": "3.1.0",
"tsa": {
"type": "rfc3161",
"token": "b64:MEUCIQ..."
},
"signature": "MEUCIQ..."
}
Integrity & Verification¶
Merkle Proof Verification¶
// ConnectSoft.ATP.Integrity/Services/MerkleVerifier.cs
public class MerkleVerifier
{
public async Task<VerificationResult> VerifySegmentAsync(
string segmentId,
Stream segmentData,
string expectedRootHash,
List<MerklePathNode> merklePath,
CancellationToken cancellationToken)
{
// 1. Compute leaf hash
var leafHash = await ComputeSha256HashAsync(segmentData);
segmentData.Position = 0;
// 2. Recompute root hash using Merkle path
var computedRoot = await RecomputeMerkleRootAsync(leafHash, merklePath);
// 3. Compare with expected root
if (computedRoot != expectedRootHash)
{
return new VerificationResult
{
Success = false,
Reason = $"Merkle root mismatch: expected {expectedRootHash}, got {computedRoot}"
};
}
// 4. Verify digital signature (if provided)
// (Implementation depends on signature scheme)
return new VerificationResult
{
Success = true,
LeafHash = leafHash,
RootHash = computedRoot
};
}
}
Backup Integrity Check¶
#!/bin/bash
# verify-backup-integrity.sh
RECOVERY_POINT_ID="RP-20251027T080000Z-eu-west-acme"
echo "Verifying backup integrity: ${RECOVERY_POINT_ID}"
# 1. Fetch manifest
MANIFEST=$(az rest --method GET \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/backups/${RECOVERY_POINT_ID}/manifest" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}")
# 2. Download packages
PACKAGES=$(echo "${MANIFEST}" | jq -r '.packages[]')
for PACKAGE in ${PACKAGES}; do
PACKAGE_NAME=$(echo "${PACKAGE}" | jq -r '.name')
PACKAGE_URI=$(echo "${PACKAGE}" | jq -r '.uri')
EXPECTED_HASH=$(echo "${PACKAGE}" | jq -r '.sha256')
echo "Verifying package: ${PACKAGE_NAME}"
# Download package
az storage blob download \
--account-name atpstorageprodeus \
--container-name backups \
--name "${PACKAGE_NAME}" \
--file "/tmp/${PACKAGE_NAME}"
# Compute hash
ACTUAL_HASH=$(sha256sum "/tmp/${PACKAGE_NAME}" | cut -d' ' -f1)
# Compare hashes
if [ "${ACTUAL_HASH}" != "${EXPECTED_HASH}" ]; then
echo "❌ Hash mismatch for ${PACKAGE_NAME}"
echo " Expected: ${EXPECTED_HASH}"
echo " Actual: ${ACTUAL_HASH}"
exit 1
else
echo "✅ Hash verified: ${PACKAGE_NAME}"
fi
done
# 3. Verify Merkle root
EXPECTED_MERKLE_ROOT=$(echo "${MANIFEST}" | jq -r '.merkleRoot')
# (Merkle root verification logic)
# 4. Verify digital signature
SIGNATURE=$(echo "${MANIFEST}" | jq -r '.signature')
KEY_ID=$(echo "${MANIFEST}" | jq -r '.keyId')
# (Digital signature verification logic)
echo "✅ Backup integrity verified: ${RECOVERY_POINT_ID}"
Operational Procedures¶
Daily Backup Validation¶
#!/bin/bash
# daily-backup-validation.sh
echo "=== Daily Backup Validation ==="
# 1. Verify SQL automated backups exist
LATEST_SQL_BACKUP=$(az sql db list-restorable-dropped \
--server atp-sql-prod-eus \
--resource-group ATP-Prod-RG \
--query "[0].earliestRestoreDate" -o tsv)
if [ -z "${LATEST_SQL_BACKUP}" ]; then
echo "❌ No SQL backups found"
exit 1
fi
echo "✅ Latest SQL backup: ${LATEST_SQL_BACKUP}"
# 2. Verify Cosmos DB continuous backup mode
COSMOS_BACKUP_MODE=$(az cosmosdb show \
--name atp-cosmos-prod \
--resource-group ATP-Shared-RG \
--query "backupPolicy.type" -o tsv)
if [ "${COSMOS_BACKUP_MODE}" != "Continuous" ]; then
echo "❌ Cosmos DB not in continuous backup mode"
exit 1
fi
echo "✅ Cosmos DB continuous backup enabled"
# 3. Verify blob geo-replication status
GEO_REPL_STATUS=$(az storage account show \
--name atpstorageprodeus \
--resource-group ATP-Prod-RG \
--query "geoReplicationStats.status" -o tsv)
if [ "${GEO_REPL_STATUS}" != "Live" ]; then
echo "❌ Blob geo-replication not live"
exit 1
fi
echo "✅ Blob geo-replication live"
# 4. Check backup catalog for recent recovery points
RECENT_BACKUPS=$(az sql query \
--server atp-sql-prod-eus \
--database ATP_BackupCatalog \
--query-text "SELECT COUNT(*) FROM BackupCatalog WHERE CreatedAt >= DATEADD(day, -1, GETUTCDATE()) AND Status = 'completed'" \
--query "[0].['']" -o tsv)
if [ "${RECENT_BACKUPS}" -lt 20 ]; then
echo "❌ Insufficient recent backups (expected >= 20, got ${RECENT_BACKUPS})"
exit 1
fi
echo "✅ Recent backups: ${RECENT_BACKUPS}"
echo "✅ Daily backup validation passed"
Monthly Recovery Drill¶
#!/bin/bash
# monthly-recovery-drill.sh
echo "=== Monthly Recovery Drill ==="
# 1. Select random tenant and recovery point
TENANT_ID="acme-corp"
RECOVERY_POINT_ID="RP-$(date -d '7 days ago' +%Y%m%dT%H%M%SZ)-eu-west-${TENANT_ID}"
echo "Selected recovery point: ${RECOVERY_POINT_ID}"
# 2. Execute restore to sandbox
RESTORE_JOB_ID=$(az rest --method POST \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/restores" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body '{
"recoveryPointId": "'"${RECOVERY_POINT_ID}"'",
"mode": "sandbox",
"target": {
"tenantId": "'"${TENANT_ID}"'",
"region": "westeurope"
},
"verifyPolicy": {
"rowCounts": true,
"samplePercent": 10,
"proofs": true
}
}' \
--query "restoreJobId" -o tsv)
echo "Restore job started: ${RESTORE_JOB_ID}"
# 3. Monitor restore
# (Wait for completion, log RTO)
# 4. Validate restored data
# (Sample queries, integrity checks)
# 5. Document results
echo "Recording drill results..."
az rest --method POST \
--uri "https://atp-gateway-prod.westeurope.cloudapp.azure.com/ops/v1/drills/record" \
--headers "Authorization: Bearer ${ACCESS_TOKEN}" \
--body '{
"drillId": "drill-'"$(date +%Y%m%d)"'",
"recoveryPointId": "'"${RECOVERY_POINT_ID}"'",
"rtoActual": 45,
"rpoActual": 8,
"result": "passed",
"notes": "All checks passed"
}'
echo "✅ Monthly recovery drill completed"
Monitoring & Alerting¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
backup_runs_total{result} |
Counter | Backups by result (success/failed) | Failures > baseline (e.g., > 2 in 24h) |
backup_bytes_total |
Counter | Total bytes uploaded | Sudden drop/spike (> 50% change) |
backup_duration_seconds |
Histogram | Backup wall time | p95 > SLO (e.g., > 2 hours for full) |
restore_duration_seconds |
Histogram | Drill/restore time | p95 > RTO target |
backup_proof_failures_total |
Counter | Integrity verification failures | Any > 0 |
rpo_effective_seconds |
Gauge | Now − last successful cutover | > RPO target (e.g., > 15 minutes) |
rto_drill_pass_rate |
Gauge | % drills meeting RTO | < target (e.g., < 95%) |
Prometheus Alert Rules¶
# alerts-backup-restore.yml
groups:
- name: backup_restore
interval: 30s
rules:
- alert: BackupFailureRate
expr: rate(backup_runs_total{result="failed"}[1h]) > 0.1
for: 15m
labels:
severity: critical
component: backup
annotations:
summary: "Backup failure rate exceeded threshold"
description: "Backup failure rate is {{ $value }} failures/hour (threshold: 0.1)"
- alert: BackupRPOExceeded
expr: rpo_effective_seconds > 900 # 15 minutes
for: 10m
labels:
severity: warning
component: backup
annotations:
summary: "RPO exceeded"
description: "Current RPO is {{ $value }}s (target: 900s)"
- alert: RestoreRTOExceeded
expr: restore_duration_seconds{p95} > 1800 # 30 minutes
for: 5m
labels:
severity: critical
component: restore
annotations:
summary: "Restore RTO exceeded"
description: "Restore p95 duration is {{ $value }}s (target: 1800s)"
- alert: BackupIntegrityFailure
expr: backup_proof_failures_total > 0
for: 0m
labels:
severity: critical
component: integrity
annotations:
summary: "Backup integrity verification failed"
description: "Backup proof verification failed for recovery point"
Compliance & Evidence¶
Compliance Requirements¶
| Framework | Requirement | Control |
|---|---|---|
| SEC 17a-4 | Immutable records, 7-year retention | WORM storage, legal hold, signed manifests |
| HIPAA | Audit logs, integrity, retention | Encryption, access controls, audit trails |
| GDPR | Data subject rights, erasure | DSAR workflows, legal hold, export capabilities |
| SOC 2 | Availability, integrity, confidentiality | Backup/restore procedures, access controls, monitoring |
Evidence Collection¶
Backup Evidence Pack: - Recovery point manifest - Merkle root proofs - Digital signatures - TSA tokens (if applicable) - Backup catalog entries - Integrity verification results
Restore Evidence Pack: - Restore log (immutable) - Verification results - Approval records (two-person) - Promotion timestamps - SLO validation results
eDiscovery Evidence Pack: - Export manifest (signed) - Redaction policies applied - Chain of custody logs - Delivery receipts - Legal hold references
Troubleshooting¶
Common Issues¶
Issue: Backup Failed - Storage Unavailable¶
Symptoms:
- Backup job status = failed
- Error: 503 Service Unavailable from object store
Resolution: 1. Check Azure Storage account status 2. Verify network connectivity (Private Link, firewall rules) 3. Retry backup with exponential backoff 4. If persistent, escalate to Azure support
Issue: Restore Failed - Integrity Verification Failed¶
Symptoms:
- Restore job status = failed
- Error: 422 Unprocessable Entity - Hash mismatch
Resolution: 1. Review integrity verification logs 2. Compare expected vs. actual hashes 3. Re-download packages and re-verify 4. If corruption confirmed, use earlier recovery point 5. Escalate to integrity service team
Issue: Legal Hold Prevents Deletion¶
Symptoms:
- Retention policy cannot delete records
- Error: 409 Conflict - Legal hold active
Resolution: 1. Review active legal holds for tenant/scope 2. Verify hold expiration dates 3. If hold should be released, follow release procedure (requires approver) 4. Do not force deletion (violates compliance)
Runbooks & Checklists¶
Backup Runbook¶
- Verify Prerequisites
- Object store accessible
- Integrity service healthy
- Backup catalog available
-
Sufficient storage quota
-
Execute Backup
- Trigger backup (scheduled or on-demand)
- Monitor backup progress
-
Verify completion status
-
Validate Backup
- Check backup catalog entry
- Verify package hashes
- Confirm Merkle root exists
-
Validate digital signature
-
Document Results
- Record recovery point ID
- Log backup duration
- Note any issues/warnings
Restore Runbook¶
- Request Restore
- Identify recovery point
- Define scope (tenant, time range, data classes)
-
Select target (sandbox or production)
-
Execute Restore
- Submit restore request
- Monitor restore progress
-
Wait for completion
-
Verify Integrity
- Check segment checksums
- Verify Merkle roots
- Validate signatures
-
Confirm journal continuity
-
Validate Data
- Sample queries
- Row count verification
-
Policy enforcement checks
-
Promote (if production)
- Two-person approval
- Execute promotion
- Validate SLOs
- Document promotion
DR Failover Runbook¶
- Declare DR
- Confirm incident scope
- Notify stakeholders
-
Activate incident response
-
Execute Failover
- Update AFD routing
- Failover Service Bus alias
- Update tenant registry
-
Trigger cache warm-up
-
Validate DR Region
- Check SLOs
- Verify data accessibility
-
Test critical workflows
-
Monitor & Communicate
- Update status page
- Send tenant notifications
-
Document failover timeline
-
Plan Failback
- Verify primary region health
- Sync data (if needed)
- Execute gradual failback
- Validate post-failback SLOs
Summary¶
This document provides comprehensive operational guidance for ATP's backup, restore, disaster recovery, and eDiscovery strategies. Key takeaways:
- RPO/RTO: Targets vary by environment/edition (Enterprise: ≤5min RPO, ≤30min RTO)
- Backups: Automated and on-demand, with integrity verification, encryption, and WORM storage
- Restores: Tenant-scoped and full-system, with quarantine validation before promotion
- DR: Multi-region failover with automated orchestration and gradual failback
- eDiscovery: Legal holds, DSAR workflows, signed exports, compliance evidence
- Integrity: Merkle proofs, hash chains, digital signatures, tamper-evidence
- Compliance: SEC 17a-4, HIPAA, GDPR, SOC 2 controls and evidence collection
Next Steps: - Review and customize RPO/RTO targets for your organization - Schedule regular recovery drills (monthly) - Establish legal hold and DSAR procedures - Configure monitoring and alerting - Train operations team on runbooks
Document Version: 1.0
Last Updated: 2025-10-30
Maintained By: Platform Operations Team