Welcome to from-docker-to-kubernetes

StatefulSets & DaemonSets

Managing stateful applications and node-level daemons in Kubernetes

Introduction to StatefulSets

StatefulSets are specialized workload resources in Kubernetes designed for applications that require stable networking, persistent storage, and ordered, graceful deployment and scaling. Unlike Deployments, which treat pods as interchangeable, StatefulSets maintain a sticky identity for each pod, ensuring predictability in distributed systems.

StatefulSets are ideal for applications such as:

  • Distributed databases (MongoDB, Cassandra, MySQL)
  • Message brokers with persistent storage (Kafka, RabbitMQ)
  • Applications requiring stable network identifiers
  • Systems that need ordered scaling, updates, or rollbacks

The StatefulSet controller creates pods with predictable naming patterns (e.g., app-0, app-1, app-2) and ensures stable network identities through headless services, making them suitable for clustered applications where peers need to discover each other.

StatefulSet Core Concepts

Pod Identity

  • Stable network identity: Each pod maintains the same hostname and DNS record across rescheduling
  • Ordinal indices: Pods are named with sequential indices (0, 1, 2...) that never change
  • DNS entries: Each pod gets a DNS record: pod-name.service-name.namespace.svc.cluster.local
  • Sticky identity: Pods maintain their identity even after being rescheduled
  • Predictable naming: Facilitates peer discovery and leader election
  • Service targeting: Allows targeting specific instances by ordinal index

Ordered Deployment

  • Sequential creation: Pods are created one at a time in order (0, 1, 2...)
  • Ordered readiness: Each pod must be running and ready before next one starts
  • Predictable bootstrapping: Important for databases that need primary node first
  • Failure handling: If any pod fails to become ready, subsequent creations are blocked
  • Manual intervention: May require troubleshooting when deployment stalls
  • Initialization control: Ensures proper cluster formation for distributed systems

Ordered Scaling

  • Controlled scale-up: Pods are created in ascending order (0, 1, 2...)
  • Controlled scale-down: Pods are terminated in descending order (N, N-1...)
  • Safety guarantees: Prevents split-brain scenarios in clustered applications
  • Data consistency: Helps maintain consensus in distributed systems
  • Replica management: Total ordering of pod lifecycle operations
  • Predictable behavior: Critical for stateful applications with peer awareness

Persistent Storage

  • Stable storage: Each pod gets its own PersistentVolumeClaim
  • Storage retention: PVCs are preserved when pods are rescheduled
  • Volume templates: Automatically create PVCs for each pod
  • Storage identity: PVCs maintain a 1:1 relationship with pods
  • Deletion protection: Storage survives pod termination by default
  • Data persistence: Critical for databases and other stateful workloads

StatefulSet YAML Definition

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mongodb
spec:
  selector:
    matchLabels:
      app: mongodb
  serviceName: "mongodb" # Headless service for network identity
  replicas: 3
  updateStrategy:
    type: RollingUpdate  # or OnDelete
  podManagementPolicy: OrderedReady  # or Parallel
  template:
    metadata:
      labels:
        app: mongodb
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: mongodb
        image: mongo:4.4
        ports:
        - containerPort: 27017
          name: mongodb
        volumeMounts:
        - name: data
          mountPath: /data/db
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
  volumeClaimTemplates:  # PVC template for each pod
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "standard"
      resources:
        requests:
          storage: 10Gi

This StatefulSet will create three MongoDB pods (mongodb-0, mongodb-1, mongodb-2) with dedicated persistent storage for each pod.

StatefulSet Lifecycle

The StatefulSet controller manages the full lifecycle of the pods it creates:

  1. Creation
    • Pods are created in strict sequential order, starting with index 0
    • Each new pod waits for the previous pod to be running and ready
    • If a pod fails to become ready, creation of subsequent pods is blocked
  2. Updates
    • With RollingUpdate strategy, pods are updated in reverse order, starting with the highest index
    • Each pod is updated one at a time, waiting for readiness before proceeding
    • Updates can be paused if a pod fails to become ready
  3. Scaling
    • When scaling up, new pods are created in sequential order
    • When scaling down, pods are terminated in reverse order
    • Scaling operations respect the same ordering guarantees as creation
  4. Deletion
    • When a StatefulSet is deleted, pods are terminated in reverse order
    • PersistentVolumeClaims are not deleted automatically
    • Orphaned PVCs need to be manually deleted if no longer needed

StatefulSet Usage Scenarios

Distributed Database Cluster

StatefulSets are ideal for database clusters where each node needs:

  • A stable identity for peer discovery
  • Persistent storage that survives restarts
  • Predictable ordering for primary/secondary initialization

For example, a Cassandra cluster requires nodes to know about each other and join the ring in a controlled manner:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cassandra
spec:
  serviceName: cassandra
  replicas: 3
  selector:
    matchLabels:
      app: cassandra
  template:
    metadata:
      labels:
        app: cassandra
    spec:
      containers:
      - name: cassandra
        image: cassandra:3.11
        ports:
        - containerPort: 7000
          name: intra-node
        - containerPort: 7001
          name: tls-intra-node
        - containerPort: 7199
          name: jmx
        - containerPort: 9042
          name: cql
        env:
        - name: CASSANDRA_SEEDS
          value: "cassandra-0.cassandra.default.svc.cluster.local"
        - name: MAX_HEAP_SIZE
          value: 512M
        - name: HEAP_NEWSIZE
          value: 100M
        volumeMounts:
        - name: cassandra-data
          mountPath: /var/lib/cassandra
  volumeClaimTemplates:
  - metadata:
      name: cassandra-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard
      resources:
        requests:
          storage: 10Gi

Replicated Key-Value Store

For services like etcd or ZooKeeper that require quorum-based consensus:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd
spec:
  selector:
    matchLabels:
      app: etcd
  serviceName: etcd
  replicas: 3
  template:
    metadata:
      labels:
        app: etcd
    spec:
      containers:
      - name: etcd
        image: quay.io/coreos/etcd:v3.4.13
        command:
        - /bin/sh
        - -c
        - |
          PEERS=""
          for i in $(seq 0 $((${REPLICAS}-1))); do
            PEERS="${PEERS}${PEERS:+,}etcd-${i}=http://etcd-${i}.etcd:2380"
          done
          exec etcd --name ${HOSTNAME} \
            --listen-peer-urls http://0.0.0.0:2380 \
            --listen-client-urls http://0.0.0.0:2379 \
            --advertise-client-urls http://${HOSTNAME}.etcd:2379 \
            --initial-advertise-peer-urls http://${HOSTNAME}.etcd:2380 \
            --initial-cluster-token etcd-cluster \
            --initial-cluster ${PEERS} \
            --initial-cluster-state new \
            --data-dir /var/lib/etcd/data
        env:
        - name: REPLICAS
          value: "3"
        ports:
        - containerPort: 2379
          name: client
        - containerPort: 2380
          name: peer
        volumeMounts:
        - name: data
          mountPath: /var/lib/etcd
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

Introduction to DaemonSets

DaemonSets ensure that a copy of a pod runs on all (or a subset of) nodes in a Kubernetes cluster. As nodes are added to the cluster, pods are automatically added to them; as nodes are removed, those pods are garbage collected. Deleting a DaemonSet cleans up all the pods it created.

DaemonSets are ideal for:

  • Node monitoring agents (Prometheus Node Exporter, Fluentd)
  • Log collectors (Filebeat, Logstash)
  • Network plugins (CNI providers like Calico, Weave)
  • Storage daemons (Ceph, GlusterFS)
  • Security agents that need to run on every node

The key characteristic of a DaemonSet is that it places exactly one pod instance per qualifying node, making it perfect for infrastructure-level services.

DaemonSet Core Concepts

Node Coverage

  • One pod per node: Ensures a single instance runs on each node
  • Automatic pod creation: New pods are created when nodes join the cluster
  • Automatic cleanup: Pods are removed when nodes are drained or deleted
  • Node selection: Can target specific nodes using nodeSelector or node affinity
  • Taint tolerance: Can run on nodes with taints by specifying tolerations
  • Guaranteed presence: Critical for node-level monitoring and services

Pod Scheduling

  • Bypass scheduler: Traditionally used the NodeController to place pods directly
  • Modern scheduling: Uses default scheduler with node affinity in newer versions
  • Placement guarantee: Ensures pods run even on nodes that might be unschedulable
  • Update strategy: Controls how pods are updated across the cluster
  • Pod management policy: Determines if pods are created in parallel or sequentially
  • Revision history: Tracks changes for rollback purposes

Use Cases

  • Cluster networking: CNI plugin components on every node
  • Monitoring: Node exporters for metrics collection
  • Log aggregation: Collecting logs from every node
  • Storage provisioners: Node-level storage drivers
  • Security scanners: Vulnerability detection on each node
  • Service meshes: Proxies or sidecars that need node-wide presence

DaemonSet YAML Definition

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
  labels:
    app: fluentd
spec:
  selector:
    matchLabels:
      app: fluentd
  updateStrategy:
    type: RollingUpdate    # or OnDelete
    rollingUpdate:
      maxUnavailable: 1    # Number of pods that can be unavailable during update
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      tolerations:         # Allow scheduling on master/control-plane nodes
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.14
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

This DaemonSet deploys Fluentd for log collection on every node in the cluster, including control plane nodes (via tolerations).

DaemonSet Scheduling

DaemonSet pods are scheduled on nodes in two different ways, depending on the Kubernetes version:

  1. Traditional (Pre-1.12): DaemonSet controller directly created pods on nodes, bypassing the scheduler
  2. Modern (1.12+): DaemonSet controller creates pods with NodeAffinity and a tolerationSeconds: 0 to ensure they're scheduled on appropriate nodes

The modern approach offers several advantages:

  • Respects resource constraints and quality of service
  • Allows pod preemption
  • Provides consistent scheduling behavior
  • Integrates with scheduling features like pod priority

The DaemonSet controller ensures that pods are placed on all qualifying nodes even if the node is marked as unschedulable (cordoned) to prevent normal pods from being scheduled.

Node Selection for DaemonSets

DaemonSets can target specific nodes using three mechanisms:

These mechanisms can be combined to create sophisticated node targeting rules for DaemonSet pods.

DaemonSet Updates

DaemonSets support two update strategies:

  1. OnDelete: Pods are only updated when manually deleted
    • No automatic updates
    • Administrator controls timing
    • Manual rollout across the cluster
    • Good for critical infrastructure components
  2. RollingUpdate (default): Pods are automatically updated in controlled fashion
    • Controlled by maxUnavailable (default: 1)
    • Updates occur in batches
    • Respects pod disruption budget
    • Supports automatic rollbacks

Rolling updates allow for controlled deployment of new versions:

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%  # No more than 25% of nodes will have unavailable pods

DaemonSet Usage Scenarios

Monitoring Agent

Deploy a Prometheus Node Exporter on every node to collect system metrics:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: node-exporter
        image: prom/node-exporter:v1.3.1
        args:
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        ports:
        - containerPort: 9100
          protocol: TCP
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys

Storage Provisioner

Deploy a storage provisioner daemon on selected nodes with specific hardware:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ceph-osd
  namespace: storage
spec:
  selector:
    matchLabels:
      app: ceph-osd
  template:
    metadata:
      labels:
        app: ceph-osd
    spec:
      nodeSelector:
        storage-node: "true"
      containers:
      - name: ceph-osd
        image: ceph/daemon:latest
        args: ["osd"]
        env:
        - name: CEPH_DAEMON
          value: osd
        - name: OSD_DEVICE
          value: /dev/sdb
        securityContext:
          privileged: true
        volumeMounts:
        - name: devices
          mountPath: /dev
        - name: ceph
          mountPath: /var/lib/ceph
      volumes:
      - name: devices
        hostPath:
          path: /dev
      - name: ceph
        hostPath:
          path: /var/lib/ceph

StatefulSets vs DaemonSets vs Deployments

Understanding when to use each controller type is crucial for proper application architecture:

FeatureStatefulSetDaemonSetDeployment
Use caseStateful applicationsNode-level servicesStateless applications
Pod identityStable, persistentTied to nodeInterchangeable
ScalingOrdered, sequentialOne per nodeAny order, parallel
StoragePersistent per podUsually host mountsUsually ephemeral
NetworkingStable DNS namesUsually host networkService abstraction
Update strategyOrdered, in-placePer node, controlledControlled replacement
Node placementAny suitable nodeOne per matching nodeAny suitable node
Typical examplesDatabases, message queuesMonitoring, logging agentsWeb servers, API services

Best Practices

Advanced Features and Patterns

StatefulSet Update Partitioning

Update partitioning allows rolling out changes to a subset of pods:

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2  # Only pods with ordinal >= 2 will be updated

This creates a controlled update where only some pods receive the new version, enabling advanced rollout strategies.

Parallel Pod Management

For StatefulSets that don't require ordered deployment:

spec:
  podManagementPolicy: Parallel  # Instead of OrderedReady

This allows pods to be created or terminated in parallel, speeding up scaling operations when strict ordering isn't needed.

DaemonSet Rolling Back

To roll back a DaemonSet to a previous revision:

# Check revision history
kubectl rollout history daemonset fluentd -n kube-system

# Roll back to previous revision
kubectl rollout undo daemonset fluentd -n kube-system

# Roll back to specific revision
kubectl rollout undo daemonset fluentd -n kube-system --to-revision=2

Controlling Pod Deletion Cost

For DaemonSets, you can influence which pods are deleted first during updates:

spec:
  template:
    metadata:
      annotations:
        controller.kubernetes.io/pod-deletion-cost: "100"  # Higher values deleted later

This can be useful when some nodes are more critical than others.

Troubleshooting

Common issues and their solutions:

  1. StatefulSet pod stuck in Pending
    • Check PVC provisioning (kubectl get pvc)
    • Verify storage class exists and provisioner is running
    • Check for resource constraints on nodes
  2. DaemonSet not scheduling on a node
    • Verify node labels match nodeSelector
    • Check for taints that need tolerations
    • Examine node conditions and capacity
  3. StatefulSet update stuck
    • Check readiness probes in updated pods
    • Verify ordinal index causing the blockage
    • May need manual intervention if pod can't become ready
  4. DaemonSet creates too many or too few pods
    • Verify node selectors and affinities
    • Check if nodes are cordoned or have unexpected taints
    • Ensure nodes are in Ready state

Conclusion

StatefulSets and DaemonSets are specialized controllers that extend Kubernetes' capabilities beyond simple stateless applications. StatefulSets enable complex distributed systems with state, while DaemonSets ensure critical services run on every appropriate node.

Understanding when and how to use these controllers is crucial for building robust, scalable Kubernetes applications. By following the patterns and best practices outlined in this guide, you can effectively leverage these powerful resources to build sophisticated application architectures on Kubernetes.