Welcome to from-docker-to-kubernetes

Backup and Disaster Recovery

Implementing comprehensive backup solutions and disaster recovery strategies for Kubernetes clusters

Introduction to Kubernetes Backup and DR

Kubernetes backup and disaster recovery (DR) are critical components of any production environment. As organizations increasingly rely on Kubernetes for mission-critical applications, having robust backup and recovery mechanisms becomes essential to ensure business continuity in the face of unexpected failures, data corruption, or cluster outages.

graph TD
    A[Kubernetes Backup & DR] --> B[Cluster State Backup]
    A --> C[Application Data Backup]
    A --> D[DR Planning]
    A --> E[Testing & Validation]
    
    B --> B1[etcd Backup]
    B --> B2[Resource Definitions]
    B --> B3[Certificates & Auth]
    
    C --> C1[PV Snapshots]
    C --> C2[Database Backups]
    C --> C3[External Storage]
    
    D --> D1[RTO Planning]
    D --> D2[RPO Planning]
    D --> D3[Runbook Creation]
    
    E --> E1[Scheduled Tests]
    E --> E2[Simulated Failures]
    E --> E3[Process Validation]

Core Components to Backup

Cluster State

The Kubernetes cluster state comprises several critical components that need to be included in your backup strategy:

  1. etcd Database: The central datastore for all cluster configuration and state
  2. API Server Configuration: Including certificates, authentication, and authorization settings
  3. Controller Manager Settings: Configuration for core controllers
  4. Scheduler Configuration: Policies and configurations that determine pod scheduling
  5. Custom Resource Definitions (CRDs): Extensions to the Kubernetes API
  6. RBAC Configuration: Role-based access control policies and bindings
  7. Namespace Configurations: Resource quotas, network policies, and other namespace-specific settings

Application Data

Application data includes all the information that your applications need to function properly:

  1. Persistent Volumes (PVs): Storage resources provisioned by the cluster
  2. PersistentVolumeClaims (PVCs): Storage requests by applications
  3. ConfigMaps and Secrets: Application configuration and sensitive data
  4. StatefulSet Data: Ordered deployment and scaling of pods with persistent storage
  5. Database Contents: Backup of databases running in the cluster
  6. Application-specific Files: Custom data files that applications might generate
  7. In-memory Data: Consider how to handle data that exists only in memory

Supporting Infrastructure

Don't forget the infrastructure components that support your Kubernetes environment:

  1. Cloud Provider Resources: Load balancers, security groups, and other cloud resources
  2. Network Configuration: DNS settings, ingress controllers, and network policies
  3. Storage Infrastructure: Storage classes, provisioners, and external storage systems
  4. Authentication Systems: External identity providers and authentication mechanisms
  5. Monitoring and Logging Systems: Metrics, alerts, and log aggregation configurations
  6. CI/CD Pipelines: Deployment configurations and automation scripts
  7. Custom Scripts and Tools: Any custom scripts or tools used to manage the cluster

Backup Strategies

etcd Backup

The etcd database is the primary datastore for all Kubernetes objects and cluster state. Regular backups of etcd are crucial:

# Create a snapshot of etcd
ETCDCTL_API=3 etcdctl --endpoints=https://[ENDPOINT]:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d-%H-%M-%S).db

# Verify the snapshot
ETCDCTL_API=3 etcdctl --write-out=table snapshot status \
  /backup/etcd-snapshot-latest.db

Setting up automated snapshots using a CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: etcd-backup
            image: k8s.gcr.io/etcd:3.5.1-0
            command:
            - /bin/sh
            - -c
            - |
              ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
                --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                --cert=/etc/kubernetes/pki/etcd/server.crt \
                --key=/etc/kubernetes/pki/etcd/server.key \
                snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d-%H-%M-%S).db && \
              echo "Backup completed successfully"
            volumeMounts:
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
              readOnly: true
            - name: backup
              mountPath: /backup
          restartPolicy: OnFailure
          hostNetwork: true
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
              type: DirectoryOrCreate
          - name: backup
            hostPath:
              path: /var/lib/etcd-backup
              type: DirectoryOrCreate

Persistent Volume Backup

For persistent volumes, you can leverage the VolumeSnapshot API to create point-in-time snapshots:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snapshot
spec:
  volumeSnapshotClassName: csi-hostpath-snapclass
  source:
    persistentVolumeClaimName: postgres-data-pvc

Backup Tools and Solutions

Velero

Velero is a powerful open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes:

# Install Velero CLI
brew install velero  # macOS
# or
wget https://github.com/vmware-tanzu/velero/releases/download/v1.9.0/velero-v1.9.0-linux-amd64.tar.gz  # Linux

# Install Velero in your cluster with AWS S3 storage
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.5.0 \
  --bucket velero-backup \
  --backup-location-config region=us-west-2 \
  --snapshot-location-config region=us-west-2 \
  --secret-file ./credentials-velero

Kasten K10

Kasten K10 is a Kubernetes-native data management platform that provides enterprise operations teams with easy-to-use, scalable, and secure backup/restore, disaster recovery, and application mobility:

# Add Kasten Helm repository
helm repo add kasten https://charts.kasten.io/

# Install Kasten K10
helm install k10 kasten/k10 --namespace=kasten-io --create-namespace

# Create a backup policy
kubectl apply -f - <<EOF
apiVersion: config.kio.kasten.io/v1alpha1
kind: Policy
metadata:
  name: daily-backup
  namespace: kasten-io
spec:
  frequency: "@daily"
  retention:
    daily: 7
    weekly: 4
    monthly: 12
    yearly: 7
  selector:
    matchExpressions:
      - key: k10.kasten.io/backup
        operator: In
        values:
          - "true"
  actions:
    - action: backup
EOF

Stash

Stash is a Kubernetes operator that takes backup of your volumes, databases and clusters in cloud native way:

# Install Stash
kubectl create -f https://github.com/stashed/installer/raw/master/crds/stash-catalog-crds.yaml
helm repo add appscode https://charts.appscode.com/stable/
helm repo update
helm install stash appscode/stash \
  --version v2022.02.22 \
  --namespace kube-system

# Create a backup configuration
kubectl apply -f - <<EOF
apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
  name: deployment-backup
  namespace: demo
spec:
  schedule: "*/5 * * * *"
  repository:
    name: deployment-backup
  target:
    ref:
      apiVersion: apps/v1
      kind: Deployment
      name: stash-demo
  retentionPolicy:
    name: keep-last-5
    keepLast: 5
    prune: true
EOF

Disaster Recovery Planning

A comprehensive disaster recovery plan should be established before disasters occur. This planning involves:

  1. Risk Assessment: Identify potential disaster scenarios and their impact
  2. Recovery Objectives: Define Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)
  3. Backup Strategy: Determine what to back up, how often, and where to store backups
  4. Recovery Procedures: Document step-by-step recovery procedures for different scenarios
  5. Testing Plan: Establish regular testing of backup and recovery procedures
  6. Team Responsibilities: Define roles and responsibilities during a disaster
  7. Communication Plan: Establish communication protocols during recovery operations
  8. External Dependencies: Document dependencies on external services and providers

Recovery Point Objective (RPO)

RPO defines the maximum acceptable amount of data loss measured in time. Factors to consider:

  1. Data Change Rate: How quickly does your data change?
  2. Business Impact: What is the cost of lost data?
  3. Backup Frequency: How often should backups be performed?
  4. Storage Requirements: How much storage is needed for frequent backups?
  5. Network Bandwidth: Can your network support your backup frequency?

Data Criticality Assessment

Not all data has the same importance. Classify your data:

  1. Critical Data: Essential for business operations (RPO: minutes to hours)
  2. Important Data: Necessary but not immediately critical (RPO: hours to a day)
  3. Archival Data: Historical data needed for compliance (RPO: days to weeks)

Backup Storage Strategy

Consider a multi-tiered storage approach:

  1. On-site Backups: For quick recovery from common failures
  2. Off-site Backups: Protection against site-wide disasters
  3. Multi-region Backups: For geographic redundancy
  4. Air-gapped Backups: Protection against ransomware and malicious attacks
  5. Backup Encryption: Ensure data security at rest and in transit

Recovery Time Objective (RTO)

RTO defines how quickly you need to recover after a disaster:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: app-daily-backup
  namespace: velero
spec:
  schedule: "0 1 * * *"  # Daily at 1 AM
  template:
    includedNamespaces:
    - app-namespace
    includedResources:
    - deployments
    - statefulsets
    - services
    - configmaps
    - secrets
    storageLocation: default
    volumeSnapshotLocations:
    - default

Implementation Guide

Setting Up Automated Backups

Implement a regular backup schedule for critical components:

  1. Cluster State Backups: Daily etcd snapshots
  2. Application State: Hourly or daily based on criticality
  3. PV Data: Volume snapshots aligned with application backup
  4. Configuration Backups: After any significant changes

Velero schedule example for namespace backup:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-namespace-backup
spec:
  schedule: "0 0 * * *"  # Every day at midnight
  template:
    includedNamespaces:
    - production
    - database
    - middleware

Setting up backup rotation to manage storage efficiently:

# Velero backup retention
velero backup create daily-backup-$(date +%Y-%m-%d) \
  --include-namespaces=production \
  --ttl 720h  # 30 days retention

# Cleanup script example
#!/bin/bash
# Delete backups older than 7 days but keep weekly backups
for backup in $(velero backup get | grep daily | awk '{print $1}'); do
  creation_date=$(velero backup get $backup -o json | jq -r '.metadata.creationTimestamp')
  # Complex logic to determine retention based on age and day of week
done

Disaster Recovery Procedures

Preparation Phase

Before initiating recovery, take these preparatory steps:

  1. Assessment: Evaluate the extent and nature of the disaster
  2. Team Assembly: Gather the recovery team as defined in your DR plan
  3. Communication: Notify stakeholders of the incident and expected recovery time
  4. Resource Allocation: Ensure necessary resources are available for recovery
  5. Backup Verification: Verify the integrity of the backup to be used
  6. Environment Preparation: Prepare the target environment for restoration
  7. Recovery Plan Review: Review the specific recovery procedures to be used

Recovery Execution

Execute the recovery process following these steps:

  1. Infrastructure Restoration: Set up the necessary infrastructure components
  2. etcd Restoration: Restore the etcd database if needed
  3. Cluster State Recovery: Apply the backed-up Kubernetes objects
  4. Volume Restoration: Restore persistent volumes
  5. Application Deployment: Deploy applications in the correct order
  6. Configuration Application: Apply ConfigMaps and Secrets
  7. Connectivity Verification: Ensure all components can communicate properly

Post-Recovery

After the recovery is complete:

  1. Validation: Verify application functionality and data integrity
  2. Performance Check: Ensure the restored system performs adequately
  3. Security Verification: Confirm security controls are properly restored
  4. Documentation: Document the recovery process, including any issues encountered
  5. Root Cause Analysis: Identify the cause of the disaster to prevent recurrence
  6. Improvement Planning: Identify areas for improvement in the recovery process
  7. Stakeholder Communication: Inform stakeholders of recovery completion

Testing and Validation

Backup Validation

Regularly test your backups to ensure they can be successfully restored:

  1. Scheduled Testing: Implement regular backup restoration tests
  2. Integrity Checks: Verify backup data integrity
  3. Restoration Time Measurement: Track how long restores take to validate RTO
  4. Application Functionality: Test restored applications for correct functionality
  5. Data Consistency: Verify data consistency after restoration
# Velero backup validation
velero backup describe my-backup
velero restore create --from-backup my-backup --namespace-mappings source-ns:test-ns

# Script to check restoration status
#!/bin/bash
restore_name="test-restore-$(date +%Y%m%d)"
velero restore create $restore_name --from-backup daily-backup --namespace-mappings production:validation

# Wait for restore to complete
while [[ $(velero restore get $restore_name -o jsonpath='{.status.phase}') != "Completed" ]]; do
  echo "Waiting for restore to complete..."
  sleep 30
done

# Validate restoration
kubectl get pods -n validation
# Run application-specific validation tests

Recovery Testing

Conduct regular disaster recovery tests:

  1. Tabletop Exercises: Walk through recovery procedures without actual execution
  2. Functional Tests: Test specific recovery components in isolation
  3. Full Recovery Tests: Periodically perform complete recovery exercises
  4. Chaos Engineering: Introduce controlled failures to test recovery procedures

Regular Testing Schedule

Implement a regular testing schedule:

  1. Weekly: Automated backup verification tests
  2. Monthly: Functional recovery of critical components
  3. Quarterly: Full disaster recovery exercises
  4. Annually: Major disaster simulation involving all teams

Test Scenarios

Include various test scenarios in your testing plan:

  1. Node Failure: Simulate the failure of one or more nodes
  2. Storage Failure: Test recovery from storage system failures
  3. Network Partition: Simulate network segmentation issues
  4. Data Corruption: Test recovery from corrupted data
  5. Full Cluster Loss: Practice recovery from complete cluster failure
  6. Multi-region Failover: Test geographic failover procedures
  7. Security Incident Recovery: Practice recovery from security breaches

Advanced Recovery Scenarios

Cross-Region Recovery

For geographic redundancy and recovery:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: aws-us-east-1
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero-backup
  config:
    region: us-east-1
    s3ForcePathStyle: "true"

Setting up multi-region replicated backups:

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: cross-region-recovery
  namespace: velero
spec:
  backupName: production-backup
  includedNamespaces:
  - production
  excludedNamespaces:
  - kube-system
  - velero
  - monitoring

Multi-Cluster Recovery

Restoring workloads across clusters:

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: cluster-migration
  namespace: velero
spec:
  backupName: source-cluster-backup
  includedNamespaces:
  - production
  excludedNamespaces:
  - kube-system
  - velero
  - monitoring
  namespaceMapping:
    production: production-dr

Best Practices and Recommendations

  1. 3-2-1 Backup Strategy: Maintain at least 3 copies of data, on 2 different storage types, with 1 copy off-site
  2. Immutable Backups: Implement write-once-read-many (WORM) backup storage
  3. Encrypted Backups: Always encrypt backup data at rest and in transit
  4. Regular Testing: Test recovery processes regularly
  5. Automated Validation: Implement automated backup validation
  6. Documentation: Maintain up-to-date recovery documentation
  7. Team Training: Ensure team members are trained on recovery procedures
  8. Backup Access Control: Implement strict access controls for backup data
  9. Backup Monitoring: Set up alerting for backup failures
  10. Recovery Simulation: Conduct periodic recovery simulations

Advanced Backup Strategies

Etcd Backup Implementation

Detailed etcd backup script with health verification:

#!/bin/bash
# Comprehensive etcd backup script with validation

# Set variables
BACKUP_DIR="/var/etcd-backup"
BACKUP_COUNT=7  # Keep last 7 backups
DATE=$(date +%Y-%m-%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/etcd-snapshot-${DATE}.db"
LOG_FILE="${BACKUP_DIR}/backup-${DATE}.log"

# Ensure backup directory exists
mkdir -p ${BACKUP_DIR}

# Log function
log() {
  echo "[$(date +%Y-%m-%d-%H:%M:%S)] $1" | tee -a ${LOG_FILE}
}

log "Starting etcd backup process"

# Check etcd health before backup
log "Checking etcd health"
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

if [ $? -ne 0 ]; then
  log "ERROR: etcd is not healthy, aborting backup"
  exit 1
fi

# Create snapshot
log "Creating etcd snapshot: ${BACKUP_FILE}"
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save ${BACKUP_FILE}

if [ $? -ne 0 ]; then
  log "ERROR: Failed to create etcd snapshot"
  exit 1
fi

# Verify snapshot
log "Verifying snapshot"
ETCDCTL_API=3 etcdctl --write-out=table snapshot status ${BACKUP_FILE}

if [ $? -ne 0 ]; then
  log "WARNING: Snapshot verification failed"
else
  log "Snapshot verification successful"
fi

# Encrypt backup
log "Encrypting backup"
openssl enc -aes-256-cbc -salt -in ${BACKUP_FILE} -out ${BACKUP_FILE}.enc -k ${ENCRYPTION_KEY}

if [ $? -eq 0 ]; then
  log "Backup encrypted successfully"
  # Remove unencrypted backup
  rm ${BACKUP_FILE}
else
  log "WARNING: Encryption failed, keeping unencrypted backup"
fi

# Cleanup old backups
log "Cleaning up old backups"
ls -1tr ${BACKUP_DIR}/etcd-snapshot-*.db* | head -n -${BACKUP_COUNT} | xargs -r rm
ls -1tr ${BACKUP_DIR}/backup-*.log | head -n -${BACKUP_COUNT} | xargs -r rm

log "Backup process completed"

# Copy to remote storage (optional)
if [ -n "${REMOTE_STORAGE}" ]; then
  log "Copying backup to remote storage"
  # Add your preferred remote copy method here (aws s3 cp, gsutil cp, etc.)
fi

Velero Advanced Configuration

Setting up Velero with multi-location backup and retention policies:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: aws-primary
  namespace: velero
spec:
  provider: aws
  config:
    region: us-west-2
    profile: "default"

Configure multiple storage locations:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: aws-secondary
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero-backup-dr
    prefix: cluster-1
  config:
    region: us-east-1
    profile: "dr-profile"

Configure a comprehensive backup schedule with hooks:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: comprehensive-backup
  namespace: velero
spec:
  schedule: "0 1 * * *"  # Daily at 1 AM
  template:
    includedNamespaces:
    - default
    excludedNamespaces:
    - kube-system
    includedResources:
    - "*"
    excludedResources:
    - "nodes"
    - "events"
    labelSelector:
      matchExpressions:
      - key: backup
        operator: In
        values:
        - "true"
    snapshotVolumes: true
    storageLocation: aws-primary
    volumeSnapshotLocations:
    - aws-primary
    hooks:
      resources:
        name: database-backup-hooks
        includedNamespaces:
        - database
        excludedNamespaces:
        - kube-system
        labelSelector:
          matchLabels:
            backup-hook: "true"
        pre:
          exec:
            container: database
            command:
            - /bin/bash
            - -c
            - "pg_dump -U postgres -d mydb > /backup/db.sql"
            onError: Fail
            timeout: 5m
        post:
          exec:
            container: database
            command:
            - /bin/bash
            - -c
            - "echo 'Backup completed' >> /backup/log.txt"
            onError: Continue
            timeout: 1m

Disaster Recovery Implementation

Recovery Time Objective (RTO) Planning

Implementing RTO-driven recovery plans:

  1. Critical Services (RTO < 1 hour):
    • Deploy automation for immediate restoration
    • Maintain warm standby environments
    • Implement automated health checks and failover
    • Configure auto-scaling for rapid capacity restoration
  2. Essential Services (RTO < 4 hours):
    • Prepare semi-automated recovery scripts
    • Document manual intervention steps
    • Ensure backup availability in multiple regions
    • Test recovery procedures quarterly
  3. Non-Critical Services (RTO < 24 hours):
    • Document manual recovery procedures
    • Include in batch recovery plans
    • Test recovery procedures bi-annually
# Example automation script for critical service recovery
#!/bin/bash
# Quick recovery script for critical services

# Set environment variables
export KUBECONFIG=/path/to/dr-cluster-kubeconfig

# Restore critical namespaces first
velero restore create critical-restore \
  --from-backup latest-backup \
  --include-namespaces critical-services \
  --wait

# Check restoration status
if [ $? -eq 0 ]; then
  echo "Critical services restored successfully"
  
  # Verify critical pods are running
  RUNNING_PODS=$(kubectl get pods -n critical-services | grep Running | wc -l)
  EXPECTED_PODS=10  # Adjust based on your environment
  
  if [ $RUNNING_PODS -ge $EXPECTED_PODS ]; then
    echo "Critical service verification passed"
    # Notify success
    send_notification "Critical services restored successfully"
  else
    echo "Critical service verification failed"
    # Trigger manual intervention
    send_alert "Critical service restoration needs manual verification"
  fi
else
  echo "Critical service restoration failed"
  # Trigger manual intervention
  send_alert "Critical service restoration failed, immediate action required"
fi

Recovery Point Objective (RPO) Implementation

Tiered backup scheduling based on RPO requirements:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: critical-data-backup
  namespace: velero
spec:
  schedule: "*/15 * * * *"  # Every 15 minutes
  template:
    includedNamespaces:
    - financial-data
    - customer-records
    labelSelector:
      matchLabels:
        criticality: "highest"

For data with less stringent RPO:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: standard-data-backup
  namespace: velero
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  template:
    includedNamespaces:
    - marketing
    - analytics
    excludedNamespaces:
    - testing
    - development
    - staging

Advanced Recovery Scenarios

Multi-Region Recovery

Setting up cross-region replication for disaster recovery:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: multi-region
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: primary-backup-bucket
    prefix: cluster-main
  config:
    region: us-west-2
    s3ForcePathStyle: "true"
    s3Url: "https://s3.us-west-2.amazonaws.com"
    replication:
      - region: us-east-1
        bucket: dr-backup-bucket
      - region: eu-west-1
        bucket: eu-backup-bucket

Implementing a custom backup job for critical application state:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: app-state-backup
spec:
  template:
    spec:
      containers:
      - name: backup-container
        image: custom-backup-tool:v1.2
        command: ["/scripts/backup.sh"]
        env:
        - name: BACKUP_DESTINATION
          value: "s3://app-state-backups"
        - name: APP_NAMESPACE
          value: "production"
        - name: ENCRYPTION_KEY
          value: "base64:a2V5MjAyMw=="
      restartPolicy: OnFailure

Database-Specific Recovery

Example of PostgreSQL operator with built-in backup and recovery:

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: acid-postgresql-cluster
spec:
  teamId: "database"
  volume:
    size: 10Gi
  numberOfInstances: 3
  users:
    zalando:  # database owner
    - superuser
    - createdb
  databases:
    foo: zalando
  postgresql:
    version: "14"
    parameters:
      shared_buffers: "32MB"
      max_connections: "100"
      archive_mode: "on"
      archive_command: "wal-g wal-push %p"
      archive_timeout: "60"
  backup:
    schedule: "0 3 * * *"
    retention: 10
    storageConfiguration:
      s3:
        bucket: "postgres-backups"
        region: "eu-central-1"
        aws_access_key_id: "AKIAIOSFODNN7EXAMPLE"
        aws_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

Automated Recovery Testing

Recovery Validation Framework

Creating a Python-based validation framework for testing recovery:

#!/usr/bin/env python3
# recovery_validator.py - Automated recovery validation framework

import subprocess
import logging
import json
import time
from datetime import datetime

class RecoveryValidator:
    def __init__(self, namespace, backup_name):
        self.namespace = namespace
        self.backup_name = backup_name
        self._setup_logging()
        
    def _setup_logging(self):
        """Configure logging for the validator"""
        self.logger = logging.getLogger("recovery-validator")
        self.logger.setLevel(logging.INFO)
        
        handler = logging.FileHandler(f"recovery-test-{datetime.now().strftime('%Y%m%d-%H%M%S')}.log")
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)
        
    def validate_recovery(self):
        """Run the recovery validation process"""
        self.logger.info(f"Starting recovery validation for backup {self.backup_name} in namespace {self.namespace}")
        
        # Step 1: Create a test namespace
        test_namespace = f"recovery-test-{int(time.time())}"
        self.logger.info(f"Creating test namespace {test_namespace}")
        subprocess.run(["kubectl", "create", "namespace", test_namespace])
        
        # Step 2: Restore from backup to test namespace
        self.logger.info(f"Restoring backup {self.backup_name} to namespace {test_namespace}")
        restore_name = f"test-restore-{int(time.time())}"
        subprocess.run([
            "velero", "restore", "create", restore_name,
            "--from-backup", self.backup_name,
            "--namespace-mappings", f"{self.namespace}:{test_namespace}"
        ])
        
        # Step 3: Wait for restore to complete
        self.logger.info("Waiting for restore to complete...")
        while True:
            result = subprocess.run(
                ["velero", "restore", "get", restore_name, "-o", "json"],
                capture_output=True, text=True
            )
            restore_info = json.loads(result.stdout)
            phase = restore_info.get("status", {}).get("phase", "")
            
            if phase == "Completed":
                self.logger.info("Restore completed successfully")
                break
            elif phase in ["PartiallyFailed", "Failed"]:
                self.logger.error(f"Restore failed with phase: {phase}")
                break
            
            self.logger.info(f"Restore in progress, current phase: {phase}")
            time.sleep(30)
        
        # Step 4: Validate application functionality
        self.logger.info(f"Validating application functionality...")
        # Add your application-specific validation logic here
        # For example, checking if pods are running, services are responding, etc.
        pod_status = subprocess.run(
            ["kubectl", "get", "pods", "-n", test_namespace],
            capture_output=True, text=True
        )
        self.logger.info(f"Pod status in test namespace:\n{pod_status.stdout}")
            
        # Step 5: Cleanup
        self.logger.info(f"Cleaning up test namespace {test_namespace}")
        subprocess.run(["kubectl", "delete", "namespace", test_namespace])
            
        self.logger.info("Recovery validation completed")
        return True

Automated Testing Schedule

Regular testing ensures your recovery procedures work when needed:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: recovery-test
spec:
  schedule: "0 0 * * 0"  # Weekly on Sunday
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: recovery-test
            image: recovery-test:v1
            env:
            - name: TEST_NAMESPACE
              value: recovery-test
            - name: BACKUP_NAME
              value: latest
            - name: SLACK_WEBHOOK
              valueFrom:
                secretKeyRef:
                  name: notifications
                  key: slack-webhook
            volumeMounts:
            - name: kubeconfig
              mountPath: /root/.kube
              readOnly: true
          volumes:
          - name: kubeconfig
            secret:
              secretName: recovery-test-kubeconfig
          restartPolicy: OnFailure

Monitoring and Alerting

Backup Status Monitoring

Setting up Prometheus alerts for backup status:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backup-monitoring
spec:
  groups:
  - name: backup.rules
    rules:
    - alert: BackupFailed
      expr: |
        velero_backup_failure_total > 0
      for: 1h
      labels:
        severity: critical
        team: sre
      annotations:
        description: "Backup {{ $labels.backup }} has failed"
        runbook_url: "https://wiki.example.com/backup-failure"
    
    - alert: BackupTooOld
      expr: |
        time() - max(velero_backup_last_successful_timestamp) > 86400
      for: 1h
      labels:
        severity: warning
        team: sre
      annotations:
        description: "No successful backup in the last 24 hours"
        runbook_url: "https://wiki.example.com/backup-age"
    
    - alert: RestoreFailed
      expr: |
        velero_restore_failure_total > 0
      for: 10m
      labels:
        severity: critical
        team: sre
      annotations:
        description: "Restore {{ $labels.restore }} has failed"
        runbook_url: "https://wiki.example.com/restore-failure"

Creating a Grafana dashboard for backup visualization:

apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
  name: backup-dashboard
spec:
  json: |
    {
      "title": "Backup Status Dashboard",
      "panels": [
        {
          "title": "Backup Success Rate",
          "type": "graph",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(rate(velero_backup_success_total[24h])) / sum(rate(velero_backup_attempt_total[24h]))",
              "legendFormat": "Success Rate"
            }
          ]
        },
        {
          "title": "Backup Duration",
          "type": "graph",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "velero_backup_duration_seconds",
              "legendFormat": "{{backup}}"
            }
          ]
        },
        {
          "title": "Backup Frequency",
          "type": "stat",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "time() - max(velero_backup_last_successful_timestamp)",
              "legendFormat": "Time Since Last Backup"
            }
          ]
        }
      ]
    }

Documentation and Runbooks

Recovery Runbook Template

# Disaster Recovery Runbook

## Prerequisites
- Access to backup storage
- Kubernetes cluster admin credentials
- DNS management access
- Cloud provider access

## Recovery Steps

1. **Initial Assessment**
```bash
# Check backup status
velero backup get
   
# Verify latest backup
velero backup describe <backup-name>
  1. Infrastructure Setup
# Create new cluster if needed
terraform apply -var-file=dr.tfvars
   
# Configure kubectl
aws eks update-kubeconfig --name dr-cluster
  1. Data Restoration
# Restore from backup
velero restore create --from-backup <backup-name>
   
# Monitor restoration
velero restore logs <restore-name>
  1. Validation
# Run validation script
./validate-recovery.py
   
# Check critical services
kubectl get pods -n critical-ns
  1. Traffic Cutover
# Update DNS
aws route53 change-resource-record-sets \
 --hosted-zone-id <zone-id> \
 --change-batch file://dns-update.json

Troubleshooting

Common Issues

  1. Backup restoration fails
    • Check storage permissions
    • Verify backup integrity
    • Check resource quotas
  2. Services not starting
    • Check pod logs
    • Verify config maps and secrets
    • Check network policies

Contacts

  • Primary: @oncall-sre
  • Secondary: @backup-team
  • Escalation: @cto

Post-Recovery Tasks

  1. Document incident
  2. Update runbook if needed
  3. Schedule post-mortem
  4. Review backup strategy

## Conclusion

A comprehensive backup and disaster recovery strategy is essential for maintaining business continuity in Kubernetes environments. By following the practices outlined in this guide, organizations can ensure they're prepared to recover quickly from various failure scenarios, minimizing downtime and data loss.

::alert{type="success"}
Remember to regularly test your backup and recovery procedures through automated and manual processes. Documentation and team training are crucial components of a successful DR strategy.
::

::alert{type="warning"}
The most effective disaster recovery plan is the one that has been thoroughly tested before a real disaster occurs. Never assume your backups will work without validation.
::