StatefulSets & DaemonSets

Managing stateful applications and node-level daemons in Kubernetes

StatefulSets are specialized workload resources in Kubernetes designed for applications that require stable networking, persistent storage, and ordered, graceful deployment and scaling. Unlike Deployments, which treat pods as interchangeable, StatefulSets maintain a sticky identity for each pod, ensuring predictability in distributed systems.

StatefulSets are ideal for applications such as:

Distributed databases (MongoDB, Cassandra, MySQL)
Message brokers with persistent storage (Kafka, RabbitMQ)
Applications requiring stable network identifiers
Systems that need ordered scaling, updates, or rollbacks

The StatefulSet controller creates pods with predictable naming patterns (e.g., app-0, app-1, app-2) and ensures stable network identities through headless services, making them suitable for clustered applications where peers need to discover each other.

StatefulSet Core Concepts

Pod Identity

Stable network identity: Each pod maintains the same hostname and DNS record across rescheduling
Ordinal indices: Pods are named with sequential indices (0, 1, 2...) that never change
DNS entries: Each pod gets a DNS record: pod-name.service-name.namespace.svc.cluster.local
Sticky identity: Pods maintain their identity even after being rescheduled
Predictable naming: Facilitates peer discovery and leader election
Service targeting: Allows targeting specific instances by ordinal index

Ordered Deployment

Sequential creation: Pods are created one at a time in order (0, 1, 2...)
Ordered readiness: Each pod must be running and ready before next one starts
Predictable bootstrapping: Important for databases that need primary node first
Failure handling: If any pod fails to become ready, subsequent creations are blocked
Manual intervention: May require troubleshooting when deployment stalls
Initialization control: Ensures proper cluster formation for distributed systems

Ordered Scaling

Controlled scale-up: Pods are created in ascending order (0, 1, 2...)
Controlled scale-down: Pods are terminated in descending order (N, N-1...)
Safety guarantees: Prevents split-brain scenarios in clustered applications
Data consistency: Helps maintain consensus in distributed systems
Replica management: Total ordering of pod lifecycle operations
Predictable behavior: Critical for stateful applications with peer awareness

Persistent Storage

Stable storage: Each pod gets its own PersistentVolumeClaim
Storage retention: PVCs are preserved when pods are rescheduled
Volume templates: Automatically create PVCs for each pod
Storage identity: PVCs maintain a 1:1 relationship with pods
Deletion protection: Storage survives pod termination by default
Data persistence: Critical for databases and other stateful workloads

StatefulSet YAML Definition

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mongodb
spec:
  selector:
    matchLabels:
      app: mongodb
  serviceName: "mongodb" # Headless service for network identity
  replicas: 3
  updateStrategy:
    type: RollingUpdate  # or OnDelete
  podManagementPolicy: OrderedReady  # or Parallel
  template:
    metadata:
      labels:
        app: mongodb
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: mongodb
        image: mongo:4.4
        ports:
        - containerPort: 27017
          name: mongodb
        volumeMounts:
        - name: data
          mountPath: /data/db
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
  volumeClaimTemplates:  # PVC template for each pod
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "standard"
      resources:
        requests:
          storage: 10Gi

This StatefulSet will create three MongoDB pods (mongodb-0, mongodb-1, mongodb-2) with dedicated persistent storage for each pod.

StatefulSet Lifecycle

The StatefulSet controller manages the full lifecycle of the pods it creates:

Creation
- Pods are created in strict sequential order, starting with index 0
- Each new pod waits for the previous pod to be running and ready
- If a pod fails to become ready, creation of subsequent pods is blocked
Updates
- With RollingUpdate strategy, pods are updated in reverse order, starting with the highest index
- Each pod is updated one at a time, waiting for readiness before proceeding
- Updates can be paused if a pod fails to become ready
Scaling
- When scaling up, new pods are created in sequential order
- When scaling down, pods are terminated in reverse order
- Scaling operations respect the same ordering guarantees as creation
Deletion
- When a StatefulSet is deleted, pods are terminated in reverse order
- PersistentVolumeClaims are not deleted automatically
- Orphaned PVCs need to be manually deleted if no longer needed

StatefulSet deletion does not guarantee termination of all pods. For graceful termination, scale the StatefulSet to 0 replicas before deletion:

# First scale to 0 replicas
kubectl scale statefulset/mongodb --replicas=0

# Wait for all pods to terminate
kubectl wait --for=delete pod -l app=mongodb

# Then delete the StatefulSet
kubectl delete statefulset/mongodb

This approach helps prevent data corruption in stateful applications by ensuring clean shutdown.

StatefulSet Usage Scenarios

Distributed Database Cluster

StatefulSets are ideal for database clusters where each node needs:

A stable identity for peer discovery
Persistent storage that survives restarts
Predictable ordering for primary/secondary initialization

For example, a Cassandra cluster requires nodes to know about each other and join the ring in a controlled manner:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cassandra
spec:
  serviceName: cassandra
  replicas: 3
  selector:
    matchLabels:
      app: cassandra
  template:
    metadata:
      labels:
        app: cassandra
    spec:
      containers:
      - name: cassandra
        image: cassandra:3.11
        ports:
        - containerPort: 7000
          name: intra-node
        - containerPort: 7001
          name: tls-intra-node
        - containerPort: 7199
          name: jmx
        - containerPort: 9042
          name: cql
        env:
        - name: CASSANDRA_SEEDS
          value: "cassandra-0.cassandra.default.svc.cluster.local"
        - name: MAX_HEAP_SIZE
          value: 512M
        - name: HEAP_NEWSIZE
          value: 100M
        volumeMounts:
        - name: cassandra-data
          mountPath: /var/lib/cassandra
  volumeClaimTemplates:
  - metadata:
      name: cassandra-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard
      resources:
        requests:
          storage: 10Gi

Replicated Key-Value Store

For services like etcd or ZooKeeper that require quorum-based consensus:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd
spec:
  selector:
    matchLabels:
      app: etcd
  serviceName: etcd
  replicas: 3
  template:
    metadata:
      labels:
        app: etcd
    spec:
      containers:
      - name: etcd
        image: quay.io/coreos/etcd:v3.4.13
        command:
        - /bin/sh
        - -c
        - |
          PEERS=""
          for i in $(seq 0 $((${REPLICAS}-1))); do
            PEERS="${PEERS}${PEERS:+,}etcd-${i}=http://etcd-${i}.etcd:2380"
          done
          exec etcd --name ${HOSTNAME} \
            --listen-peer-urls http://0.0.0.0:2380 \
            --listen-client-urls http://0.0.0.0:2379 \
            --advertise-client-urls http://${HOSTNAME}.etcd:2379 \
            --initial-advertise-peer-urls http://${HOSTNAME}.etcd:2380 \
            --initial-cluster-token etcd-cluster \
            --initial-cluster ${PEERS} \
            --initial-cluster-state new \
            --data-dir /var/lib/etcd/data
        env:
        - name: REPLICAS
          value: "3"
        ports:
        - containerPort: 2379
          name: client
        - containerPort: 2380
          name: peer
        volumeMounts:
        - name: data
          mountPath: /var/lib/etcd
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

Introduction to DaemonSets

DaemonSets ensure that a copy of a pod runs on all (or a subset of) nodes in a Kubernetes cluster. As nodes are added to the cluster, pods are automatically added to them; as nodes are removed, those pods are garbage collected. Deleting a DaemonSet cleans up all the pods it created.

DaemonSets are ideal for:

Node monitoring agents (Prometheus Node Exporter, Fluentd)
Log collectors (Filebeat, Logstash)
Network plugins (CNI providers like Calico, Weave)
Storage daemons (Ceph, GlusterFS)
Security agents that need to run on every node

The key characteristic of a DaemonSet is that it places exactly one pod instance per qualifying node, making it perfect for infrastructure-level services.

DaemonSet Core Concepts

Node Coverage

One pod per node: Ensures a single instance runs on each node
Automatic pod creation: New pods are created when nodes join the cluster
Automatic cleanup: Pods are removed when nodes are drained or deleted
Node selection: Can target specific nodes using nodeSelector or node affinity
Taint tolerance: Can run on nodes with taints by specifying tolerations
Guaranteed presence: Critical for node-level monitoring and services

Pod Scheduling

Bypass scheduler: Traditionally used the NodeController to place pods directly
Modern scheduling: Uses default scheduler with node affinity in newer versions
Placement guarantee: Ensures pods run even on nodes that might be unschedulable
Update strategy: Controls how pods are updated across the cluster
Pod management policy: Determines if pods are created in parallel or sequentially
Revision history: Tracks changes for rollback purposes

Use Cases

Cluster networking: CNI plugin components on every node
Monitoring: Node exporters for metrics collection
Log aggregation: Collecting logs from every node
Storage provisioners: Node-level storage drivers
Security scanners: Vulnerability detection on each node
Service meshes: Proxies or sidecars that need node-wide presence

DaemonSet YAML Definition

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
  labels:
    app: fluentd
spec:
  selector:
    matchLabels:
      app: fluentd
  updateStrategy:
    type: RollingUpdate    # or OnDelete
    rollingUpdate:
      maxUnavailable: 1    # Number of pods that can be unavailable during update
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      tolerations:         # Allow scheduling on master/control-plane nodes
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.14
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

This DaemonSet deploys Fluentd for log collection on every node in the cluster, including control plane nodes (via tolerations).

DaemonSet Scheduling

DaemonSet pods are scheduled on nodes in two different ways, depending on the Kubernetes version:

Traditional (Pre-1.12): DaemonSet controller directly created pods on nodes, bypassing the scheduler
Modern (1.12+): DaemonSet controller creates pods with NodeAffinity and a tolerationSeconds: 0 to ensure they're scheduled on appropriate nodes

The modern approach offers several advantages:

Respects resource constraints and quality of service
Allows pod preemption
Provides consistent scheduling behavior
Integrates with scheduling features like pod priority

The DaemonSet controller ensures that pods are placed on all qualifying nodes even if the node is marked as unschedulable (cordoned) to prevent normal pods from being scheduled.

Node Selection for DaemonSets

DaemonSets can target specific nodes using three mechanisms:

nodeSelector: Simple key-value matching for node labels
spec: template: spec: nodeSelector: disk: ssd

Node Affinity: More expressive matching with logical operators

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
              - key: node-role.kubernetes.io/worker
                operator: Exists

Taints and Tolerations: Allow pods to schedule on nodes with matching taints
spec: template: spec: tolerations: - key: special-hardware operator: Exists effect: NoSchedule

These mechanisms can be combined to create sophisticated node targeting rules for DaemonSet pods.

DaemonSet Updates

DaemonSets support two update strategies:

OnDelete: Pods are only updated when manually deleted
- No automatic updates
- Administrator controls timing
- Manual rollout across the cluster
- Good for critical infrastructure components
RollingUpdate (default): Pods are automatically updated in controlled fashion
- Controlled by maxUnavailable (default: 1)
- Updates occur in batches
- Respects pod disruption budget
- Supports automatic rollbacks

Rolling updates allow for controlled deployment of new versions:

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%  # No more than 25% of nodes will have unavailable pods

DaemonSet Usage Scenarios

Monitoring Agent

Deploy a Prometheus Node Exporter on every node to collect system metrics:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: node-exporter
        image: prom/node-exporter:v1.3.1
        args:
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        ports:
        - containerPort: 9100
          protocol: TCP
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys

Storage Provisioner

Deploy a storage provisioner daemon on selected nodes with specific hardware:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ceph-osd
  namespace: storage
spec:
  selector:
    matchLabels:
      app: ceph-osd
  template:
    metadata:
      labels:
        app: ceph-osd
    spec:
      nodeSelector:
        storage-node: "true"
      containers:
      - name: ceph-osd
        image: ceph/daemon:latest
        args: ["osd"]
        env:
        - name: CEPH_DAEMON
          value: osd
        - name: OSD_DEVICE
          value: /dev/sdb
        securityContext:
          privileged: true
        volumeMounts:
        - name: devices
          mountPath: /dev
        - name: ceph
          mountPath: /var/lib/ceph
      volumes:
      - name: devices
        hostPath:
          path: /dev
      - name: ceph
        hostPath:
          path: /var/lib/ceph

StatefulSets vs DaemonSets vs Deployments

Understanding when to use each controller type is crucial for proper application architecture:

Feature	StatefulSet	DaemonSet	Deployment
Use case	Stateful applications	Node-level services	Stateless applications
Pod identity	Stable, persistent	Tied to node	Interchangeable
Scaling	Ordered, sequential	One per node	Any order, parallel
Storage	Persistent per pod	Usually host mounts	Usually ephemeral
Networking	Stable DNS names	Usually host network	Service abstraction
Update strategy	Ordered, in-place	Per node, controlled	Controlled replacement
Node placement	Any suitable node	One per matching node	Any suitable node
Typical examples	Databases, message queues	Monitoring, logging agents	Web servers, API services

Best Practices

StatefulSet Best Practices

Use headless services
- Create a headless service (clusterIP: None) for StatefulSet DNS entries
- Essential for peer discovery in clustered applications
- Enables direct pod-to-pod communication
Plan for pod failures
- Implement proper readiness probes to prevent cascading failures
- Use PodDisruptionBudgets to maintain quorum during maintenance
- Create proper initialization and shutdown procedures
Manage storage carefully
- Use storage classes appropriate for your application's performance needs
- Implement backup solutions for persistent volumes
- Consider using volumeClaimTemplates with different storage classes for different roles
Initialize stateful applications properly
- Provide initialization containers for complex setup
- Use pod hostname and ordinal index for dynamic configuration
- Consider custom bootstrapping scripts for cluster formation
Plan updates strategically
- Use partition updates for controlled rollouts
- Test update strategies thoroughly before production
- Consider manual updates for critical database workloads

DaemonSet Best Practices

Resource management
- Always set resource limits and requests
- Keep resource footprint small to avoid impacting node capacity
- Consider using priorityClass for critical system DaemonSets
Minimize privileges
- Only request necessary permissions and capabilities
- Use SecurityContext to restrict container privileges
- Avoid running as root unless absolutely necessary
Handle node variants
- Use node affinity and tolerations for different node types
- Consider separate DaemonSets for significantly different node architectures
- Handle OS-specific variations with appropriate container images
Update cautiously
- Use slow rolling updates for critical components
- Test updates thoroughly before cluster-wide deployment
- Consider canary deployments on a subset of nodes first
Monitor performance impact
- Ensure DaemonSets don't degrade node performance
- Monitor resource usage across the cluster
- Be aware of the cumulative impact of multiple DaemonSets

Advanced Features and Patterns

StatefulSet Update Partitioning

Update partitioning allows rolling out changes to a subset of pods:

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2  # Only pods with ordinal >= 2 will be updated

This creates a controlled update where only some pods receive the new version, enabling advanced rollout strategies.

Parallel Pod Management

For StatefulSets that don't require ordered deployment:

spec:
  podManagementPolicy: Parallel  # Instead of OrderedReady

This allows pods to be created or terminated in parallel, speeding up scaling operations when strict ordering isn't needed.

DaemonSet Rolling Back

To roll back a DaemonSet to a previous revision:

# Check revision history
kubectl rollout history daemonset fluentd -n kube-system

# Roll back to previous revision
kubectl rollout undo daemonset fluentd -n kube-system

# Roll back to specific revision
kubectl rollout undo daemonset fluentd -n kube-system --to-revision=2

Controlling Pod Deletion Cost

For DaemonSets, you can influence which pods are deleted first during updates:

spec:
  template:
    metadata:
      annotations:
        controller.kubernetes.io/pod-deletion-cost: "100"  # Higher values deleted later

This can be useful when some nodes are more critical than others.

Troubleshooting

Common issues and their solutions:

StatefulSet pod stuck in Pending
- Check PVC provisioning (kubectl get pvc)
- Verify storage class exists and provisioner is running
- Check for resource constraints on nodes
DaemonSet not scheduling on a node
- Verify node labels match nodeSelector
- Check for taints that need tolerations
- Examine node conditions and capacity
StatefulSet update stuck
- Check readiness probes in updated pods
- Verify ordinal index causing the blockage
- May need manual intervention if pod can't become ready
DaemonSet creates too many or too few pods
- Verify node selectors and affinities
- Check if nodes are cordoned or have unexpected taints
- Ensure nodes are in Ready state

Conclusion

StatefulSets and DaemonSets are specialized controllers that extend Kubernetes' capabilities beyond simple stateless applications. StatefulSets enable complex distributed systems with state, while DaemonSets ensure critical services run on every appropriate node.

Understanding when and how to use these controllers is crucial for building robust, scalable Kubernetes applications. By following the patterns and best practices outlined in this guide, you can effectively leverage these powerful resources to build sophisticated application architectures on Kubernetes.

Edit this page

Policy Management with OPA

Understanding and implementing Kubernetes policy management with Open Policy Agent (OPA) and Gatekeeper

Advanced Scheduling & Affinity

Understanding Kubernetes advanced scheduling capabilities and affinity rules