Kubernetes backup and disaster recovery (DR) are critical components of any production environment. As organizations increasingly rely on Kubernetes for mission-critical applications, having robust backup and recovery mechanisms becomes essential to ensure business continuity in the face of unexpected failures, data corruption, or cluster outages.
A well-designed backup and DR strategy should address not only the Kubernetes cluster state but also the application data and supporting infrastructure.
graph TD
A[Kubernetes Backup & DR] --> B[Cluster State Backup]
A --> C[Application Data Backup]
A --> D[DR Planning]
A --> E[Testing & Validation]
B --> B1[etcd Backup]
B --> B2[Resource Definitions]
B --> B3[Certificates & Auth]
C --> C1[PV Snapshots]
C --> C2[Database Backups]
C --> C3[External Storage]
D --> D1[RTO Planning]
D --> D2[RPO Planning]
D --> D3[Runbook Creation]
E --> E1[Scheduled Tests]
E --> E2[Simulated Failures]
E --> E3[Process Validation]
The Kubernetes cluster state comprises several critical components that need to be included in your backup strategy:
etcd Database : The central datastore for all cluster configuration and stateAPI Server Configuration : Including certificates, authentication, and authorization settingsController Manager Settings : Configuration for core controllersScheduler Configuration : Policies and configurations that determine pod schedulingCustom Resource Definitions (CRDs) : Extensions to the Kubernetes APIRBAC Configuration : Role-based access control policies and bindingsNamespace Configurations : Resource quotas, network policies, and other namespace-specific settingsApplication data includes all the information that your applications need to function properly:
Persistent Volumes (PVs) : Storage resources provisioned by the clusterPersistentVolumeClaims (PVCs) : Storage requests by applicationsConfigMaps and Secrets : Application configuration and sensitive dataStatefulSet Data : Ordered deployment and scaling of pods with persistent storageDatabase Contents : Backup of databases running in the clusterApplication-specific Files : Custom data files that applications might generateIn-memory Data : Consider how to handle data that exists only in memoryDon't forget the infrastructure components that support your Kubernetes environment:
Cloud Provider Resources : Load balancers, security groups, and other cloud resourcesNetwork Configuration : DNS settings, ingress controllers, and network policiesStorage Infrastructure : Storage classes, provisioners, and external storage systemsAuthentication Systems : External identity providers and authentication mechanismsMonitoring and Logging Systems : Metrics, alerts, and log aggregation configurationsCI/CD Pipelines : Deployment configurations and automation scriptsCustom Scripts and Tools : Any custom scripts or tools used to manage the clusterThe etcd database is the primary datastore for all Kubernetes objects and cluster state. Regular backups of etcd are crucial:
# Create a snapshot of etcd
ETCDCTL_API = 3 etcdctl --endpoints=https://[ENDPOINT]:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/etcd-snapshot- $( date +%Y-%m-%d-%H-%M-%S ) .db
# Verify the snapshot
ETCDCTL_API = 3 etcdctl --write-out=table snapshot status \
/backup/etcd-snapshot-latest.db
Setting up automated snapshots using a CronJob:
apiVersion : batch/v1
kind : CronJob
metadata :
name : etcd-backup
namespace : kube-system
spec :
schedule : "0 */6 * * *" # Every 6 hours
jobTemplate :
spec :
template :
spec :
containers :
- name : etcd-backup
image : k8s.gcr.io/etcd:3.5.1-0
command :
- /bin/sh
- -c
- |
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d-%H-%M-%S).db && \
echo "Backup completed successfully"
volumeMounts :
- name : etcd-certs
mountPath : /etc/kubernetes/pki/etcd
readOnly : true
- name : backup
mountPath : /backup
restartPolicy : OnFailure
hostNetwork : true
volumes :
- name : etcd-certs
hostPath :
path : /etc/kubernetes/pki/etcd
type : DirectoryOrCreate
- name : backup
hostPath :
path : /var/lib/etcd-backup
type : DirectoryOrCreate
For persistent volumes, you can leverage the VolumeSnapshot API to create point-in-time snapshots:
apiVersion : snapshot.storage.k8s.io/v1
kind : VolumeSnapshot
metadata :
name : postgres-data-snapshot
spec :
volumeSnapshotClassName : csi-hostpath-snapclass
source :
persistentVolumeClaimName : postgres-data-pvc
Velero is a powerful open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes:
# Install Velero CLI
brew install velero # macOS
# or
wget https://github.com/vmware-tanzu/velero/releases/download/v1.9.0/velero-v1.9.0-linux-amd64.tar.gz # Linux
# Install Velero in your cluster with AWS S3 storage
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.5.0 \
--bucket velero-backup \
--backup-location-config region=us-west-2 \
--snapshot-location-config region=us-west-2 \
--secret-file ./credentials-velero
Kasten K10 is a Kubernetes-native data management platform that provides enterprise operations teams with easy-to-use, scalable, and secure backup/restore, disaster recovery, and application mobility:
# Add Kasten Helm repository
helm repo add kasten https://charts.kasten.io/
# Install Kasten K10
helm install k10 kasten/k10 --namespace=kasten-io --create-namespace
# Create a backup policy
kubectl apply -f - << EOF
apiVersion: config.kio.kasten.io/v1alpha1
kind: Policy
metadata:
name: daily-backup
namespace: kasten-io
spec:
frequency: "@daily"
retention:
daily: 7
weekly: 4
monthly: 12
yearly: 7
selector:
matchExpressions:
- key: k10.kasten.io/backup
operator: In
values:
- "true"
actions:
- action: backup
EOF
Stash is a Kubernetes operator that takes backup of your volumes, databases and clusters in cloud native way:
# Install Stash
kubectl create -f https://github.com/stashed/installer/raw/master/crds/stash-catalog-crds.yaml
helm repo add appscode https://charts.appscode.com/stable/
helm repo update
helm install stash appscode/stash \
--version v2022.02.22 \
--namespace kube-system
# Create a backup configuration
kubectl apply -f - << EOF
apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
name: deployment-backup
namespace: demo
spec:
schedule: "*/5 * * * *"
repository:
name: deployment-backup
target:
ref:
apiVersion: apps/v1
kind: Deployment
name: stash-demo
retentionPolicy:
name: keep-last-5
keepLast: 5
prune: true
EOF
A comprehensive disaster recovery plan should be established before disasters occur. This planning involves:
Proper disaster recovery planning can mean the difference between minutes of downtime and days of service interruption.
Risk Assessment : Identify potential disaster scenarios and their impactRecovery Objectives : Define Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)Backup Strategy : Determine what to back up, how often, and where to store backupsRecovery Procedures : Document step-by-step recovery procedures for different scenariosTesting Plan : Establish regular testing of backup and recovery proceduresTeam Responsibilities : Define roles and responsibilities during a disasterCommunication Plan : Establish communication protocols during recovery operationsExternal Dependencies : Document dependencies on external services and providersRPO defines the maximum acceptable amount of data loss measured in time. Factors to consider:
Data Change Rate : How quickly does your data change?Business Impact : What is the cost of lost data?Backup Frequency : How often should backups be performed?Storage Requirements : How much storage is needed for frequent backups?Network Bandwidth : Can your network support your backup frequency?Not all data has the same importance. Classify your data:
Critical Data : Essential for business operations (RPO: minutes to hours)Important Data : Necessary but not immediately critical (RPO: hours to a day)Archival Data : Historical data needed for compliance (RPO: days to weeks)Consider a multi-tiered storage approach:
On-site Backups : For quick recovery from common failuresOff-site Backups : Protection against site-wide disastersMulti-region Backups : For geographic redundancyAir-gapped Backups : Protection against ransomware and malicious attacksBackup Encryption : Ensure data security at rest and in transitRTO defines how quickly you need to recover after a disaster:
apiVersion : velero.io/v1
kind : Schedule
metadata :
name : app-daily-backup
namespace : velero
spec :
schedule : "0 1 * * *" # Daily at 1 AM
template :
includedNamespaces :
- app-namespace
includedResources :
- deployments
- statefulsets
- services
- configmaps
- secrets
storageLocation : default
volumeSnapshotLocations :
- default
Implement a regular backup schedule for critical components:
Cluster State Backups : Daily etcd snapshotsApplication State : Hourly or daily based on criticalityPV Data : Volume snapshots aligned with application backupConfiguration Backups : After any significant changesVelero schedule example for namespace backup:
apiVersion : velero.io/v1
kind : Schedule
metadata :
name : daily-namespace-backup
spec :
schedule : "0 0 * * *" # Every day at midnight
template :
includedNamespaces :
- production
- database
- middleware
Setting up backup rotation to manage storage efficiently:
# Velero backup retention
velero backup create daily-backup- $( date +%Y-%m-%d ) \
--include-namespaces=production \
--ttl 720h # 30 days retention
# Cleanup script example
#!/bin/bash
# Delete backups older than 7 days but keep weekly backups
for backup in $( velero backup get | grep daily | awk '{print $1}' ); do
creation_date = $( velero backup get $backup -o json | jq -r '.metadata.creationTimestamp' )
# Complex logic to determine retention based on age and day of week
done
Before initiating recovery, take these preparatory steps:
Assessment : Evaluate the extent and nature of the disasterTeam Assembly : Gather the recovery team as defined in your DR planCommunication : Notify stakeholders of the incident and expected recovery timeResource Allocation : Ensure necessary resources are available for recoveryBackup Verification : Verify the integrity of the backup to be usedEnvironment Preparation : Prepare the target environment for restorationRecovery Plan Review : Review the specific recovery procedures to be usedExecute the recovery process following these steps:
Infrastructure Restoration : Set up the necessary infrastructure componentsetcd Restoration : Restore the etcd database if neededCluster State Recovery : Apply the backed-up Kubernetes objectsVolume Restoration : Restore persistent volumesApplication Deployment : Deploy applications in the correct orderConfiguration Application : Apply ConfigMaps and SecretsConnectivity Verification : Ensure all components can communicate properlyAfter the recovery is complete:
Validation : Verify application functionality and data integrityPerformance Check : Ensure the restored system performs adequatelySecurity Verification : Confirm security controls are properly restoredDocumentation : Document the recovery process, including any issues encounteredRoot Cause Analysis : Identify the cause of the disaster to prevent recurrenceImprovement Planning : Identify areas for improvement in the recovery processStakeholder Communication : Inform stakeholders of recovery completionRegularly test your backups to ensure they can be successfully restored:
Scheduled Testing : Implement regular backup restoration testsIntegrity Checks : Verify backup data integrityRestoration Time Measurement : Track how long restores take to validate RTOApplication Functionality : Test restored applications for correct functionalityData Consistency : Verify data consistency after restoration
# Velero backup validation
velero backup describe my-backup
velero restore create --from-backup my-backup --namespace-mappings source-ns:test-ns
# Script to check restoration status
#!/bin/bash
restore_name = "test-restore-$( date +%Y%m%d)"
velero restore create $restore_name --from-backup daily-backup --namespace-mappings production:validation
# Wait for restore to complete
while [[ $( velero restore get $restore_name -o jsonpath='{.status.phase}' ) != "Completed" ]]; do
echo "Waiting for restore to complete..."
sleep 30
done
# Validate restoration
kubectl get pods -n validation
# Run application-specific validation tests
Conduct regular disaster recovery tests:
Tabletop Exercises : Walk through recovery procedures without actual executionFunctional Tests : Test specific recovery components in isolationFull Recovery Tests : Periodically perform complete recovery exercisesChaos Engineering : Introduce controlled failures to test recovery proceduresImplement a regular testing schedule:
Weekly : Automated backup verification testsMonthly : Functional recovery of critical componentsQuarterly : Full disaster recovery exercisesAnnually : Major disaster simulation involving all teamsInclude various test scenarios in your testing plan:
Node Failure : Simulate the failure of one or more nodesStorage Failure : Test recovery from storage system failuresNetwork Partition : Simulate network segmentation issuesData Corruption : Test recovery from corrupted dataFull Cluster Loss : Practice recovery from complete cluster failureMulti-region Failover : Test geographic failover proceduresSecurity Incident Recovery : Practice recovery from security breachesFor geographic redundancy and recovery:
apiVersion : velero.io/v1
kind : BackupStorageLocation
metadata :
name : aws-us-east-1
namespace : velero
spec :
provider : aws
objectStorage :
bucket : velero-backup
config :
region : us-east-1
s3ForcePathStyle : "true"
Setting up multi-region replicated backups:
apiVersion : velero.io/v1
kind : Restore
metadata :
name : cross-region-recovery
namespace : velero
spec :
backupName : production-backup
includedNamespaces :
- production
excludedNamespaces :
- kube-system
- velero
- monitoring
Restoring workloads across clusters:
apiVersion : velero.io/v1
kind : Restore
metadata :
name : cluster-migration
namespace : velero
spec :
backupName : source-cluster-backup
includedNamespaces :
- production
excludedNamespaces :
- kube-system
- velero
- monitoring
namespaceMapping :
production : production-dr
3-2-1 Backup Strategy : Maintain at least 3 copies of data, on 2 different storage types, with 1 copy off-siteImmutable Backups : Implement write-once-read-many (WORM) backup storageEncrypted Backups : Always encrypt backup data at rest and in transitRegular Testing : Test recovery processes regularlyAutomated Validation : Implement automated backup validationDocumentation : Maintain up-to-date recovery documentationTeam Training : Ensure team members are trained on recovery proceduresBackup Access Control : Implement strict access controls for backup dataBackup Monitoring : Set up alerting for backup failuresRecovery Simulation : Conduct periodic recovery simulationsDetailed etcd backup script with health verification:
#!/bin/bash
# Comprehensive etcd backup script with validation
# Set variables
BACKUP_DIR = "/var/etcd-backup"
BACKUP_COUNT = 7 # Keep last 7 backups
DATE = $( date +%Y-%m-%d-%H%M%S )
BACKUP_FILE = "${ BACKUP_DIR }/etcd-snapshot-${ DATE }.db"
LOG_FILE = "${ BACKUP_DIR }/backup-${ DATE }.log"
# Ensure backup directory exists
mkdir -p ${BACKUP_DIR}
# Log function
log () {
echo "[$( date +%Y-%m-%d-%H:%M:%S)] $1 " | tee -a ${LOG_FILE}
}
log "Starting etcd backup process"
# Check etcd health before backup
log "Checking etcd health"
ETCDCTL_API = 3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
if [ $? -ne 0 ]; then
log "ERROR: etcd is not healthy, aborting backup"
exit 1
fi
# Create snapshot
log "Creating etcd snapshot: ${ BACKUP_FILE }"
ETCDCTL_API = 3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save ${BACKUP_FILE}
if [ $? -ne 0 ]; then
log "ERROR: Failed to create etcd snapshot"
exit 1
fi
# Verify snapshot
log "Verifying snapshot"
ETCDCTL_API = 3 etcdctl --write-out=table snapshot status ${BACKUP_FILE}
if [ $? -ne 0 ]; then
log "WARNING: Snapshot verification failed"
else
log "Snapshot verification successful"
fi
# Encrypt backup
log "Encrypting backup"
openssl enc -aes-256-cbc -salt -in ${BACKUP_FILE} -out ${BACKUP_FILE} .enc -k ${ENCRYPTION_KEY}
if [ $? -eq 0 ]; then
log "Backup encrypted successfully"
# Remove unencrypted backup
rm ${BACKUP_FILE}
else
log "WARNING: Encryption failed, keeping unencrypted backup"
fi
# Cleanup old backups
log "Cleaning up old backups"
ls -1tr ${BACKUP_DIR} /etcd-snapshot- * .db * | head -n -${ BACKUP_COUNT } | xargs -r rm
ls -1tr ${BACKUP_DIR} /backup- * .log | head -n -${ BACKUP_COUNT } | xargs -r rm
log "Backup process completed"
# Copy to remote storage (optional)
if [ -n "${ REMOTE_STORAGE }" ]; then
log "Copying backup to remote storage"
# Add your preferred remote copy method here (aws s3 cp, gsutil cp, etc.)
fi
Setting up Velero with multi-location backup and retention policies:
apiVersion : velero.io/v1
kind : BackupStorageLocation
metadata :
name : aws-primary
namespace : velero
spec :
provider : aws
config :
region : us-west-2
profile : "default"
Configure multiple storage locations:
apiVersion : velero.io/v1
kind : BackupStorageLocation
metadata :
name : aws-secondary
namespace : velero
spec :
provider : aws
objectStorage :
bucket : velero-backup-dr
prefix : cluster-1
config :
region : us-east-1
profile : "dr-profile"
Configure a comprehensive backup schedule with hooks:
apiVersion : velero.io/v1
kind : Schedule
metadata :
name : comprehensive-backup
namespace : velero
spec :
schedule : "0 1 * * *" # Daily at 1 AM
template :
includedNamespaces :
- default
excludedNamespaces :
- kube-system
includedResources :
- "*"
excludedResources :
- "nodes"
- "events"
labelSelector :
matchExpressions :
- key : backup
operator : In
values :
- "true"
snapshotVolumes : true
storageLocation : aws-primary
volumeSnapshotLocations :
- aws-primary
hooks :
resources :
name : database-backup-hooks
includedNamespaces :
- database
excludedNamespaces :
- kube-system
labelSelector :
matchLabels :
backup-hook : "true"
pre :
exec :
container : database
command :
- /bin/bash
- -c
- "pg_dump -U postgres -d mydb > /backup/db.sql"
onError : Fail
timeout : 5m
post :
exec :
container : database
command :
- /bin/bash
- -c
- "echo 'Backup completed' >> /backup/log.txt"
onError : Continue
timeout : 1m
Implementing RTO-driven recovery plans:
Critical Services (RTO < 1 hour) :Deploy automation for immediate restoration Maintain warm standby environments Implement automated health checks and failover Configure auto-scaling for rapid capacity restoration Essential Services (RTO < 4 hours) :Prepare semi-automated recovery scripts Document manual intervention steps Ensure backup availability in multiple regions Test recovery procedures quarterly Non-Critical Services (RTO < 24 hours) :Document manual recovery procedures Include in batch recovery plans Test recovery procedures bi-annually
# Example automation script for critical service recovery
#!/bin/bash
# Quick recovery script for critical services
# Set environment variables
export KUBECONFIG = /path/to/dr-cluster-kubeconfig
# Restore critical namespaces first
velero restore create critical-restore \
--from-backup latest-backup \
--include-namespaces critical-services \
--wait
# Check restoration status
if [ $? -eq 0 ]; then
echo "Critical services restored successfully"
# Verify critical pods are running
RUNNING_PODS = $( kubectl get pods -n critical-services | grep Running | wc -l )
EXPECTED_PODS = 10 # Adjust based on your environment
if [ $RUNNING_PODS -ge $EXPECTED_PODS ]; then
echo "Critical service verification passed"
# Notify success
send_notification "Critical services restored successfully"
else
echo "Critical service verification failed"
# Trigger manual intervention
send_alert "Critical service restoration needs manual verification"
fi
else
echo "Critical service restoration failed"
# Trigger manual intervention
send_alert "Critical service restoration failed, immediate action required"
fi
Tiered backup scheduling based on RPO requirements:
apiVersion : velero.io/v1
kind : Schedule
metadata :
name : critical-data-backup
namespace : velero
spec :
schedule : "*/15 * * * *" # Every 15 minutes
template :
includedNamespaces :
- financial-data
- customer-records
labelSelector :
matchLabels :
criticality : "highest"
For data with less stringent RPO:
apiVersion : velero.io/v1
kind : Schedule
metadata :
name : standard-data-backup
namespace : velero
spec :
schedule : "0 */4 * * *" # Every 4 hours
template :
includedNamespaces :
- marketing
- analytics
excludedNamespaces :
- testing
- development
- staging
Setting up cross-region replication for disaster recovery:
apiVersion : velero.io/v1
kind : BackupStorageLocation
metadata :
name : multi-region
namespace : velero
spec :
provider : aws
objectStorage :
bucket : primary-backup-bucket
prefix : cluster-main
config :
region : us-west-2
s3ForcePathStyle : "true"
s3Url : "https://s3.us-west-2.amazonaws.com"
replication :
- region : us-east-1
bucket : dr-backup-bucket
- region : eu-west-1
bucket : eu-backup-bucket
Implementing a custom backup job for critical application state:
apiVersion : batch/v1
kind : CronJob
metadata :
name : app-state-backup
spec :
template :
spec :
containers :
- name : backup-container
image : custom-backup-tool:v1.2
command : [ "/scripts/backup.sh" ]
env :
- name : BACKUP_DESTINATION
value : "s3://app-state-backups"
- name : APP_NAMESPACE
value : "production"
- name : ENCRYPTION_KEY
value : "base64:a2V5MjAyMw=="
restartPolicy : OnFailure
Example of PostgreSQL operator with built-in backup and recovery:
apiVersion : acid.zalan.do/v1
kind : postgresql
metadata :
name : acid-postgresql-cluster
spec :
teamId : "database"
volume :
size : 10Gi
numberOfInstances : 3
users :
zalando : # database owner
- superuser
- createdb
databases :
foo : zalando
postgresql :
version : "14"
parameters :
shared_buffers : "32MB"
max_connections : "100"
archive_mode : "on"
archive_command : "wal-g wal-push %p"
archive_timeout : "60"
backup :
schedule : "0 3 * * *"
retention : 10
storageConfiguration :
s3 :
bucket : "postgres-backups"
region : "eu-central-1"
aws_access_key_id : "AKIAIOSFODNN7EXAMPLE"
aws_secret_access_key : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
Creating a Python-based validation framework for testing recovery:
#!/usr/bin/env python3
# recovery_validator.py - Automated recovery validation framework
import subprocess
import logging
import json
import time
from datetime import datetime
class RecoveryValidator:
def __init__(self, namespace, backup_name):
self.namespace = namespace
self.backup_name = backup_name
self._setup_logging()
def _setup_logging(self):
"""Configure logging for the validator"""
self.logger = logging.getLogger("recovery-validator")
self.logger.setLevel(logging.INFO)
handler = logging.FileHandler(f"recovery-test-{datetime.now().strftime('%Y%m%d-%H%M%S')}.log")
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
self.logger.addHandler(handler)
def validate_recovery(self):
"""Run the recovery validation process"""
self.logger.info(f"Starting recovery validation for backup {self.backup_name} in namespace {self.namespace}")
# Step 1: Create a test namespace
test_namespace = f"recovery-test-{int(time.time())}"
self.logger.info(f"Creating test namespace {test_namespace}")
subprocess.run(["kubectl", "create", "namespace", test_namespace])
# Step 2: Restore from backup to test namespace
self.logger.info(f"Restoring backup {self.backup_name} to namespace {test_namespace}")
restore_name = f"test-restore-{int(time.time())}"
subprocess.run([
"velero", "restore", "create", restore_name,
"--from-backup", self.backup_name,
"--namespace-mappings", f"{self.namespace}:{test_namespace}"
])
# Step 3: Wait for restore to complete
self.logger.info("Waiting for restore to complete...")
while True:
result = subprocess.run(
["velero", "restore", "get", restore_name, "-o", "json"],
capture_output=True, text=True
)
restore_info = json.loads(result.stdout)
phase = restore_info.get("status", {}).get("phase", "")
if phase == "Completed":
self.logger.info("Restore completed successfully")
break
elif phase in ["PartiallyFailed", "Failed"]:
self.logger.error(f"Restore failed with phase: {phase}")
break
self.logger.info(f"Restore in progress, current phase: {phase}")
time.sleep(30)
# Step 4: Validate application functionality
self.logger.info(f"Validating application functionality...")
# Add your application-specific validation logic here
# For example, checking if pods are running, services are responding, etc.
pod_status = subprocess.run(
["kubectl", "get", "pods", "-n", test_namespace],
capture_output=True, text=True
)
self.logger.info(f"Pod status in test namespace:\n{pod_status.stdout}")
# Step 5: Cleanup
self.logger.info(f"Cleaning up test namespace {test_namespace}")
subprocess.run(["kubectl", "delete", "namespace", test_namespace])
self.logger.info("Recovery validation completed")
return True
Regular testing ensures your recovery procedures work when needed:
apiVersion : batch/v1
kind : CronJob
metadata :
name : recovery-test
spec :
schedule : "0 0 * * 0" # Weekly on Sunday
jobTemplate :
spec :
template :
spec :
containers :
- name : recovery-test
image : recovery-test:v1
env :
- name : TEST_NAMESPACE
value : recovery-test
- name : BACKUP_NAME
value : latest
- name : SLACK_WEBHOOK
valueFrom :
secretKeyRef :
name : notifications
key : slack-webhook
volumeMounts :
- name : kubeconfig
mountPath : /root/.kube
readOnly : true
volumes :
- name : kubeconfig
secret :
secretName : recovery-test-kubeconfig
restartPolicy : OnFailure
Setting up Prometheus alerts for backup status:
apiVersion : monitoring.coreos.com/v1
kind : PrometheusRule
metadata :
name : backup-monitoring
spec :
groups :
- name : backup.rules
rules :
- alert : BackupFailed
expr : |
velero_backup_failure_total > 0
for : 1h
labels :
severity : critical
team : sre
annotations :
description : "Backup {{ $labels.backup }} has failed"
runbook_url : "https://wiki.example.com/backup-failure"
- alert : BackupTooOld
expr : |
time() - max(velero_backup_last_successful_timestamp) > 86400
for : 1h
labels :
severity : warning
team : sre
annotations :
description : "No successful backup in the last 24 hours"
runbook_url : "https://wiki.example.com/backup-age"
- alert : RestoreFailed
expr : |
velero_restore_failure_total > 0
for : 10m
labels :
severity : critical
team : sre
annotations :
description : "Restore {{ $labels.restore }} has failed"
runbook_url : "https://wiki.example.com/restore-failure"
Creating a Grafana dashboard for backup visualization:
apiVersion : integreatly.org/v1alpha1
kind : GrafanaDashboard
metadata :
name : backup-dashboard
spec :
json : |
{
"title": "Backup Status Dashboard",
"panels": [
{
"title": "Backup Success Rate",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(velero_backup_success_total[24h])) / sum(rate(velero_backup_attempt_total[24h]))",
"legendFormat": "Success Rate"
}
]
},
{
"title": "Backup Duration",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "velero_backup_duration_seconds",
"legendFormat": "{{backup}}"
}
]
},
{
"title": "Backup Frequency",
"type": "stat",
"datasource": "Prometheus",
"targets": [
{
"expr": "time() - max(velero_backup_last_successful_timestamp)",
"legendFormat": "Time Since Last Backup"
}
]
}
]
}
# Disaster Recovery Runbook
## Prerequisites
- Access to backup storage
- Kubernetes cluster admin credentials
- DNS management access
- Cloud provider access
## Recovery Steps
1. **Initial Assessment**
```bash
# Check backup status
velero backup get
# Verify latest backup
velero backup describe < backup-nam e >
Infrastructure Setup
# Create new cluster if needed
terraform apply -var-file=dr.tfvars
# Configure kubectl
aws eks update-kubeconfig --name dr-cluster
Data Restoration
# Restore from backup
velero restore create --from-backup < backup-nam e >
# Monitor restoration
velero restore logs < restore-nam e >
Validation
# Run validation script
./validate-recovery.py
# Check critical services
kubectl get pods -n critical-ns
Traffic Cutover
# Update DNS
aws route53 change-resource-record-sets \
--hosted-zone-id < zone-i d > \
--change-batch file://dns-update.json
Backup restoration failsCheck storage permissions Verify backup integrity Check resource quotas Services not startingCheck pod logs Verify config maps and secrets Check network policies Primary: @oncall-sre Secondary: @backup-team Escalation: @cto Document incident Update runbook if needed Schedule post-mortem Review backup strategy
## Conclusion
A comprehensive backup and disaster recovery strategy is essential for maintaining business continuity in Kubernetes environments. By following the practices outlined in this guide, organizations can ensure they're prepared to recover quickly from various failure scenarios, minimizing downtime and data loss.
::alert{type="success"}
Remember to regularly test your backup and recovery procedures through automated and manual processes. Documentation and team training are crucial components of a successful DR strategy.
::
::alert{type="warning"}
The most effective disaster recovery plan is the one that has been thoroughly tested before a real disaster occurs. Never assume your backups will work without validation.
::