Welcome to from-docker-to-kubernetes

Advanced Scheduling & Affinity

Understanding Kubernetes advanced scheduling capabilities and affinity rules

Introduction to Kubernetes Scheduling

Kubernetes scheduling is the process of assigning pods to nodes in the cluster. While the default scheduler works well for many scenarios, advanced applications often require more sophisticated placement strategies. Kubernetes provides powerful mechanisms to influence scheduling decisions, allowing administrators and developers to optimize for hardware efficiency, workload co-location, and availability.

The Kubernetes scheduler is a control plane component that watches for newly created pods with no assigned node and selects a node for them to run on. This decision-making process considers:

  • Resource requirements and availability
  • Hardware/software constraints
  • Affinity and anti-affinity specifications
  • Taints and tolerations
  • Priority and preemption
  • Custom scheduler policies

Advanced scheduling capabilities allow you to implement complex deployment strategies, enforce business requirements, and optimize resource utilization across your cluster.

Node Selectors

Node selectors provide a simple way to constrain pods to nodes with specific labels. This is the most straightforward way to control pod placement.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-machine-learning
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.2.2-base-ubuntu20.04
  nodeSelector:
    hardware: gpu
    gpu-type: nvidia-tesla

To use node selectors:

  1. Label your nodes appropriately:
    kubectl label nodes node1 hardware=gpu gpu-type=nvidia-tesla
    
  2. Add a nodeSelector field to your pod specification targeting those labels

While simple to use, node selectors have limitations:

  • Only support equality-based requirements
  • Cannot express more complex conditions (OR, NOT operations)
  • Limited expressiveness for complex deployment scenarios

For more sophisticated requirements, Kubernetes offers node affinity.

Node Affinity

Node Affinity Basics

  • More expressive matching: Supports complex logical operations beyond simple equality
  • Rich selector syntax: Uses the same selector syntax as labels
  • Two types:
    • requiredDuringSchedulingIgnoredDuringExecution: Hard requirement (must be met)
    • preferredDuringSchedulingIgnoredDuringExecution: Soft preference (preferred but not mandatory)
  • Operators: Includes In, NotIn, Exists, DoesNotExist, Gt, Lt
  • Weight-based preferences: Assign priorities to different requirements
  • Future support: Will eventually support requiredDuringSchedulingRequiredDuringExecution

Hard Requirements Example

apiVersion: v1
kind: Pod
metadata:
  name: nginx-with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
            - arm64

This pod will only be scheduled on Linux nodes with either amd64 or arm64 architecture.

Soft Preferences Example

apiVersion: v1
kind: Pod
metadata:
  name: nginx-with-preferred-node-affinity
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: availability-zone
            operator: In
            values:
            - zone1
      - weight: 20
        preference:
          matchExpressions:
          - key: node-type
            operator: In
            values:
            - compute-optimized

This pod prefers to run in zone1 (weight 80) and on compute-optimized nodes (weight 20), but will still run elsewhere if these preferences cannot be satisfied.

The "IgnoredDuringExecution" part of these rules means that if a node's labels change after a pod is scheduled, the pod won't be evicted. Future versions of Kubernetes may support "RequiredDuringExecution" which would evict pods when nodes no longer satisfy the requirements.

Pod Affinity and Anti-Affinity

While node affinity controls which nodes pods can run on, pod affinity and anti-affinity control how pods are scheduled relative to other pods.

Topology Key

The topologyKey field is crucial for pod affinity/anti-affinity. It defines the domain over which the rule applies:

  • kubernetes.io/hostname: Node level (most restrictive)
  • topology.kubernetes.io/zone: Availability zone level
  • topology.kubernetes.io/region: Region level (least restrictive)
  • Custom topology domains: Any node label can be used

Using a broader topology key creates a wider domain for pod distribution, while a narrower key creates more concentrated placement.

Complete Pod Affinity Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-server
  template:
    metadata:
      labels:
        app: web-server
    spec:
      affinity:
        # Attract to cache pods (same zone is sufficient)
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - cache
              topologyKey: topology.kubernetes.io/zone
        
        # Repel from other web-server pods (different nodes required)
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-server
            topologyKey: kubernetes.io/hostname
      containers:
      - name: web-app
        image: nginx:1.19

This deployment:

  1. Prefers to place pods in zones that have cache pods
  2. Strictly requires spreading web-server pods across different nodes

Taints and Tolerations

Taints and tolerations work together to ensure pods are not scheduled onto inappropriate nodes:

  • Taints are applied to nodes to repel certain pods
  • Tolerations are applied to pods to allow (but not require) scheduling on nodes with matching taints

This mechanism is complementary to node affinity:

  • Node affinity is a property of pods that attracts them to nodes
  • Taints are a property of nodes that repel pods
  • Tolerations allow pods to overcome taints

Tainting Nodes

Nodes can be tainted with key-value pairs and an effect:

# Format: kubectl taint nodes <node-name> <key>=<value>:<effect>
kubectl taint nodes node1 dedicated=special-workload:NoSchedule

The three possible effects are:

  • NoSchedule: Pods won't be scheduled on the node (unless they have matching tolerations)
  • PreferNoSchedule: The system will try to avoid placing pods on the node (soft version)
  • NoExecute: New pods won't be scheduled AND existing pods will be evicted if they don't have matching tolerations

Kubernetes automatically adds some taints to nodes with issues:

  • node.kubernetes.io/not-ready: Node is not ready
  • node.kubernetes.io/unreachable: Node is unreachable from the node controller
  • node.kubernetes.io/out-of-disk: Node has no free disk space
  • node.kubernetes.io/memory-pressure: Node has memory pressure
  • node.kubernetes.io/disk-pressure: Node has disk pressure
  • node.kubernetes.io/network-unavailable: Node's network is unavailable
  • node.kubernetes.io/unschedulable: Node is cordoned

Adding Tolerations to Pods

Pods can specify tolerations to match node taints:

apiVersion: v1
kind: Pod
metadata:
  name: specialized-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "special-workload"
    effect: "NoSchedule"
  containers:
  - name: specialized-container
    image: specialized-image

Tolerations can use two operators:

  • Equal: Matches when key, value, and effect are equal
  • Exists: Matches when key and effect exist (ignores value)

You can also create broader tolerations:

  • Omitting effect matches all effects with the given key
  • Omitting key (with operator Exists) matches all taints
  • Setting operator: Exists with no value matches any value

Tolerations and Eviction

The NoExecute effect evicts pods that don't tolerate the taint. Two optional fields for NoExecute tolerations control this behavior:

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300  # Pod is evicted after 300 seconds

The tolerationSeconds field defines how long the pod can run on a node with the matching taint before being evicted.

Use Cases for Taints and Tolerations

  1. Dedicated Nodes
    # Taint the nodes
    kubectl taint nodes node1 dedicated=gpu:NoSchedule
    
    # Only pods with matching toleration will run on this node
    
  2. Special Hardware
    # Pod with toleration for special hardware
    tolerations:
    - key: "hardware"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
    
  3. Control Plane Isolation
    # Kubernetes automatically taints control plane nodes
    node-role.kubernetes.io/control-plane:NoSchedule
    
    # Toleration to allow running on control plane nodes
    tolerations:
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Exists"
      effect: "NoSchedule"
    

Node Affinity vs Taints and Tolerations

Both mechanisms control pod placement but with different approaches:

FeatureNode AffinityTaints and Tolerations
Primary purposeAttract pods to nodesRepel pods from nodes
DirectionPods select nodesNodes reject pods
Default behaviorPods go anywherePods avoid tainted nodes
ImplementationPod specificationNode configuration + Pod specification
Selection mechanismNode labelsNode taints + Pod tolerations
Effect on existing podsNoneCan evict (with NoExecute)
Logical operationsComplex (AND, OR, NOT)Simple matching

For complete control over pod placement, you often need to use both mechanisms together:

  • Node affinity to ensure pods go to specific nodes
  • Taints and tolerations to ensure only specific pods go to those nodes

Pod Priority and Preemption

Priority and preemption allow you to influence the scheduling order and eviction behavior when resources are constrained:

PriorityClass

PriorityClass is a cluster-wide object that defines priority levels:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"

The value field defines the priority (higher numbers = higher priority). The globalDefault field (when true) makes this the default for pods without a specified priority.

Assigning Priority to Pods

apiVersion: v1
kind: Pod
metadata:
  name: critical-web-app
spec:
  priorityClassName: high-priority
  containers:
  - name: web-app
    image: nginx

This pod will have priority 1000000 based on the referenced PriorityClass.

Preemption Behavior

When a higher-priority pod cannot be scheduled due to resource constraints:

  1. The scheduler identifies lower-priority pods that could be evicted
  2. The scheduler preempts (evicts) enough lower-priority pods
  3. The higher-priority pod can then be scheduled
  4. Preempted pods go back to pending state and may be rescheduled

Preemption respects PodDisruptionBudget constraints when possible but may violate them if necessary for critical workloads.

Priority Class Examples

# System-critical workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: system-cluster-critical
value: 2000000000
globalDefault: false
description: "System critical pods that must run at all costs"
---
# Production workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-high
value: 900000
globalDefault: false
description: "High priority production workloads"
---
# Development workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: development
value: -100000
globalDefault: false
description: "Development workloads that should yield to production"

Kubernetes automatically creates two PriorityClasses:

  • system-cluster-critical (2000000000)
  • system-node-critical (2000001000)

These are used for critical system components.

Custom Scheduler

In some cases, the default scheduler may not meet your specific requirements. Kubernetes allows you to deploy custom schedulers that can run alongside the default scheduler:

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduled-pod
spec:
  schedulerName: my-custom-scheduler  # Specify custom scheduler
  containers:
  - name: container
    image: nginx

Custom schedulers can implement specialized algorithms for:

  • GPU scheduling
  • Topology-aware placement
  • Custom resource management
  • Domain-specific optimizations
  • Experimental scheduling policies

Pod Topology Spread Constraints

Topology spread constraints provide a declarative way to distribute pods across failure domains such as regions, zones, and nodes:

apiVersion: v1
kind: Pod
metadata:
  name: web-server
  labels:
    app: web-server
spec:
  topologySpreadConstraints:
  - maxSkew: 1                     # Maximum difference between domains
    topologyKey: topology.kubernetes.io/zone  # Spread across zones
    whenUnsatisfiable: DoNotSchedule  # Hard requirement
    labelSelector:                 # Pods to count for spreading
      matchLabels:
        app: web-server
  - maxSkew: 2                     # Maximum difference between domains
    topologyKey: kubernetes.io/hostname  # Spread across nodes
    whenUnsatisfiable: ScheduleAnyway  # Soft requirement
    labelSelector:
      matchLabels:
        app: web-server
  containers:
  - name: nginx
    image: nginx

The key parameters are:

  • maxSkew: The maximum difference between the number of pods in any two topology domains
  • topologyKey: The key of node labels defining the topology domain
  • whenUnsatisfiable: What to do if the constraint cannot be satisfied
    • DoNotSchedule: Treat as a hard requirement
    • ScheduleAnyway: Treat as a soft preference
  • labelSelector: Which pods to count when calculating the spread

Difference Between Topology Spread and Pod Anti-Affinity

While both can spread pods across domains, they work differently:

FeatureTopology Spread ConstraintsPod Anti-Affinity
Primary goalEven numerical distributionSeparation from specific pods
MechanismCount-based balancingBinary avoidance
FlexibilityConfigure allowed skewAll-or-nothing
ComplexitySimpler to expressMore complex for even spreading
ControlFine-grained numerical controlBoolean logic control

Advanced Scheduling Scenarios

Multi-Zone High Availability

Distributing a StatefulSet across multiple availability zones:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mongodb
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mongodb
  template:
    metadata:
      labels:
        app: mongodb
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - mongodb
            topologyKey: topology.kubernetes.io/zone
      containers:
      - name: mongodb
        image: mongo:4.4

This ensures each MongoDB replica is in a different availability zone for maximum resilience.

GPU Workload Placement

Placing machine learning workloads on GPU nodes:

apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  priorityClassName: high-priority
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: accelerator
            operator: In
            values:
            - nvidia-tesla-v100
  containers:
  - name: ml-container
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 2

This pod requires nodes with specific GPU hardware and requests 2 GPUs.

Mixed Workload Co-location

Co-locating complementary workloads for resource efficiency:

# CPU-intensive workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-intensive
spec:
  replicas: 5
  template:
    metadata:
      labels:
        workload-type: cpu-intensive
    spec:
      containers:
      - name: cpu-app
        image: cpu-app:latest
        resources:
          requests:
            cpu: 2
            memory: 1Gi
          limits:
            cpu: 3
            memory: 1.5Gi
---
# Memory-intensive workload with co-location preference
apiVersion: apps/v1
kind: Deployment
metadata:
  name: memory-intensive
spec:
  replicas: 5
  template:
    metadata:
      labels:
        workload-type: memory-intensive
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: workload-type
                  operator: In
                  values:
                  - cpu-intensive
              topologyKey: kubernetes.io/hostname
      containers:
      - name: memory-app
        image: memory-app:latest
        resources:
          requests:
            cpu: 0.5
            memory: 4Gi
          limits:
            cpu: 1
            memory: 6Gi

This setup places memory-intensive workloads on the same nodes as CPU-intensive workloads for complementary resource usage.

Practical Implementation Tips

Performance Considerations

Advanced scheduling features can impact scheduler performance:

  1. Pod affinity/anti-affinity
    • Has O(n²) computational complexity in worst case
    • Can significantly slow down scheduling in large clusters
    • Use pod anti-affinity with wider topology domains when possible
  2. Inter-pod affinity
    • More expensive than node affinity
    • Careful with large numbers of pods and complex rules
    • Consider using topology spread constraints instead for better performance
  3. Scheduler throughput
    • Complex rules reduce scheduler throughput
    • Monitor scheduler latency in large clusters
    • Consider using multiple scheduler profiles for different workloads

Best Practices

  1. Start simple
    • Begin with node selectors for basic constraints
    • Add node affinity for more complex requirements
    • Use pod affinity/anti-affinity only when necessary
    • Apply taints and tolerations for dedicated nodes
  2. Balance flexibility and constraints
    • Use soft preferences (preferredDuringScheduling) when possible
    • Reserve hard requirements for critical needs
    • Design for failure by allowing flexibility in placement
  3. Documentation
    • Document your node labeling scheme
    • Create clear diagrams for complex scheduling rules
    • Establish consistent naming conventions for labels and taints
  4. Monitoring
    • Watch for pods stuck in Pending state
    • Monitor scheduler latency metrics
    • Set up alerts for scheduling failures
    • Regularly review scheduling decisions
  5. Testing
    • Simulate node failures to verify resilience
    • Test scheduling behavior in non-production environments
    • Verify that scheduling rules work as expected during upgrades

Conclusion

Kubernetes advanced scheduling features provide powerful tools for optimizing workload placement. By using node affinity, pod affinity/anti-affinity, taints and tolerations, and other scheduling mechanisms, you can create sophisticated deployment strategies that maximize performance, availability, and resource utilization.

The key to successful scheduling is understanding your application requirements and cluster topology, then applying the right combination of scheduling features to achieve your goals. Start with simple rules and gradually add complexity as needed, always testing and validating that your scheduling decisions behave as expected.

By mastering these advanced scheduling concepts, you can ensure your Kubernetes workloads are deployed optimally across your infrastructure, improving reliability and efficiency.