Advanced Scheduling & Affinity

Understanding Kubernetes advanced scheduling capabilities and affinity rules

Kubernetes scheduling is the process of assigning pods to nodes in the cluster. While the default scheduler works well for many scenarios, advanced applications often require more sophisticated placement strategies. Kubernetes provides powerful mechanisms to influence scheduling decisions, allowing administrators and developers to optimize for hardware efficiency, workload co-location, and availability.

The Kubernetes scheduler is a control plane component that watches for newly created pods with no assigned node and selects a node for them to run on. This decision-making process considers:

Resource requirements and availability
Hardware/software constraints
Affinity and anti-affinity specifications
Taints and tolerations
Priority and preemption
Custom scheduler policies

Advanced scheduling capabilities allow you to implement complex deployment strategies, enforce business requirements, and optimize resource utilization across your cluster.

Node Selectors

Node selectors provide a simple way to constrain pods to nodes with specific labels. This is the most straightforward way to control pod placement.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-machine-learning
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.2.2-base-ubuntu20.04
  nodeSelector:
    hardware: gpu
    gpu-type: nvidia-tesla

To use node selectors:

Label your nodes appropriately:
kubectl label nodes node1 hardware=gpu gpu-type=nvidia-tesla
Add a nodeSelector field to your pod specification targeting those labels

While simple to use, node selectors have limitations:

Only support equality-based requirements
Cannot express more complex conditions (OR, NOT operations)
Limited expressiveness for complex deployment scenarios

For more sophisticated requirements, Kubernetes offers node affinity.

Node Affinity

Node Affinity Basics

More expressive matching: Supports complex logical operations beyond simple equality
Rich selector syntax: Uses the same selector syntax as labels
Two types:
- requiredDuringSchedulingIgnoredDuringExecution: Hard requirement (must be met)
- preferredDuringSchedulingIgnoredDuringExecution: Soft preference (preferred but not mandatory)
Operators: Includes In, NotIn, Exists, DoesNotExist, Gt, Lt
Weight-based preferences: Assign priorities to different requirements
Future support: Will eventually support requiredDuringSchedulingRequiredDuringExecution

Hard Requirements Example

apiVersion: v1
kind: Pod
metadata:
  name: nginx-with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
            - arm64

This pod will only be scheduled on Linux nodes with either amd64 or arm64 architecture.

Soft Preferences Example

apiVersion: v1
kind: Pod
metadata:
  name: nginx-with-preferred-node-affinity
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: availability-zone
            operator: In
            values:
            - zone1
      - weight: 20
        preference:
          matchExpressions:
          - key: node-type
            operator: In
            values:
            - compute-optimized

This pod prefers to run in zone1 (weight 80) and on compute-optimized nodes (weight 20), but will still run elsewhere if these preferences cannot be satisfied.

The "IgnoredDuringExecution" part of these rules means that if a node's labels change after a pod is scheduled, the pod won't be evicted. Future versions of Kubernetes may support "RequiredDuringExecution" which would evict pods when nodes no longer satisfy the requirements.

Pod Affinity and Anti-Affinity

While node affinity controls which nodes pods can run on, pod affinity and anti-affinity control how pods are scheduled relative to other pods.

Pod Affinity

Pod affinity attracts pods to nodes that already run specific pods. This is useful for:

Co-locating related workloads
- Placing frontend and backend components together
- Reducing network latency between communicating services
- Optimizing cache sharing between applications
Hardware utilization
- Packing complementary workloads on the same node
- Balancing CPU and memory intensive applications
- Maximizing resource efficiency

Example pod affinity rule:

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - redis
      topologyKey: kubernetes.io/hostname

This ensures the pod runs on nodes that are already running pods with the label app=redis.

Pod Anti-Affinity

Pod anti-affinity repels pods from nodes that already run specific pods. This is useful for:

High availability
- Spreading replicas across different nodes
- Reducing correlated failures
- Improving fault tolerance
Resource contention
- Preventing competing workloads on the same node
- Avoiding noisy neighbor problems
- Ensuring quality of service

Example pod anti-affinity rule:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - web-server
      topologyKey: kubernetes.io/hostname

This ensures the pod doesn't run on nodes that are already running pods with the label app=web-server.

Topology Key

The topologyKey field is crucial for pod affinity/anti-affinity. It defines the domain over which the rule applies:

kubernetes.io/hostname: Node level (most restrictive)
topology.kubernetes.io/zone: Availability zone level
topology.kubernetes.io/region: Region level (least restrictive)
Custom topology domains: Any node label can be used

Using a broader topology key creates a wider domain for pod distribution, while a narrower key creates more concentrated placement.

Complete Pod Affinity Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-server
  template:
    metadata:
      labels:
        app: web-server
    spec:
      affinity:
        # Attract to cache pods (same zone is sufficient)
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - cache
              topologyKey: topology.kubernetes.io/zone
        
        # Repel from other web-server pods (different nodes required)
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-server
            topologyKey: kubernetes.io/hostname
      containers:
      - name: web-app
        image: nginx:1.19

This deployment:

Prefers to place pods in zones that have cache pods
Strictly requires spreading web-server pods across different nodes

Taints and Tolerations

Taints and tolerations work together to ensure pods are not scheduled onto inappropriate nodes:

Taints are applied to nodes to repel certain pods
Tolerations are applied to pods to allow (but not require) scheduling on nodes with matching taints

This mechanism is complementary to node affinity:

Node affinity is a property of pods that attracts them to nodes
Taints are a property of nodes that repel pods
Tolerations allow pods to overcome taints

Tainting Nodes

Nodes can be tainted with key-value pairs and an effect:

# Format: kubectl taint nodes <node-name> <key>=<value>:<effect>
kubectl taint nodes node1 dedicated=special-workload:NoSchedule

The three possible effects are:

NoSchedule: Pods won't be scheduled on the node (unless they have matching tolerations)
PreferNoSchedule: The system will try to avoid placing pods on the node (soft version)
NoExecute: New pods won't be scheduled AND existing pods will be evicted if they don't have matching tolerations

Kubernetes automatically adds some taints to nodes with issues:

node.kubernetes.io/not-ready: Node is not ready
node.kubernetes.io/unreachable: Node is unreachable from the node controller
node.kubernetes.io/out-of-disk: Node has no free disk space
node.kubernetes.io/memory-pressure: Node has memory pressure
node.kubernetes.io/disk-pressure: Node has disk pressure
node.kubernetes.io/network-unavailable: Node's network is unavailable
node.kubernetes.io/unschedulable: Node is cordoned

Adding Tolerations to Pods

Pods can specify tolerations to match node taints:

apiVersion: v1
kind: Pod
metadata:
  name: specialized-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "special-workload"
    effect: "NoSchedule"
  containers:
  - name: specialized-container
    image: specialized-image

Tolerations can use two operators:

Equal: Matches when key, value, and effect are equal
Exists: Matches when key and effect exist (ignores value)

You can also create broader tolerations:

Omitting effect matches all effects with the given key
Omitting key (with operator Exists) matches all taints
Setting operator: Exists with no value matches any value

Tolerations and Eviction

The NoExecute effect evicts pods that don't tolerate the taint. Two optional fields for NoExecute tolerations control this behavior:

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300  # Pod is evicted after 300 seconds

The tolerationSeconds field defines how long the pod can run on a node with the matching taint before being evicted.

Use Cases for Taints and Tolerations

Dedicated Nodes
# Taint the nodes kubectl taint nodes node1 dedicated=gpu:NoSchedule # Only pods with matching toleration will run on this node
Special Hardware
# Pod with toleration for special hardware tolerations: - key: "hardware" operator: "Equal" value: "gpu" effect: "NoSchedule"
Control Plane Isolation
# Kubernetes automatically taints control plane nodes node-role.kubernetes.io/control-plane:NoSchedule # Toleration to allow running on control plane nodes tolerations: - key: "node-role.kubernetes.io/control-plane" operator: "Exists" effect: "NoSchedule"

Node Affinity vs Taints and Tolerations

Both mechanisms control pod placement but with different approaches:

Feature	Node Affinity	Taints and Tolerations
Primary purpose	Attract pods to nodes	Repel pods from nodes
Direction	Pods select nodes	Nodes reject pods
Default behavior	Pods go anywhere	Pods avoid tainted nodes
Implementation	Pod specification	Node configuration + Pod specification
Selection mechanism	Node labels	Node taints + Pod tolerations
Effect on existing pods	None	Can evict (with NoExecute)
Logical operations	Complex (AND, OR, NOT)	Simple matching

For complete control over pod placement, you often need to use both mechanisms together:

Node affinity to ensure pods go to specific nodes
Taints and tolerations to ensure only specific pods go to those nodes

Pod Priority and Preemption

Priority and preemption allow you to influence the scheduling order and eviction behavior when resources are constrained:

PriorityClass

PriorityClass is a cluster-wide object that defines priority levels:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"

The value field defines the priority (higher numbers = higher priority). The globalDefault field (when true) makes this the default for pods without a specified priority.

Assigning Priority to Pods

apiVersion: v1
kind: Pod
metadata:
  name: critical-web-app
spec:
  priorityClassName: high-priority
  containers:
  - name: web-app
    image: nginx

This pod will have priority 1000000 based on the referenced PriorityClass.

Preemption Behavior

When a higher-priority pod cannot be scheduled due to resource constraints:

The scheduler identifies lower-priority pods that could be evicted
The scheduler preempts (evicts) enough lower-priority pods
The higher-priority pod can then be scheduled
Preempted pods go back to pending state and may be rescheduled

Preemption respects PodDisruptionBudget constraints when possible but may violate them if necessary for critical workloads.

Priority Class Examples

# System-critical workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: system-cluster-critical
value: 2000000000
globalDefault: false
description: "System critical pods that must run at all costs"
---
# Production workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-high
value: 900000
globalDefault: false
description: "High priority production workloads"
---
# Development workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: development
value: -100000
globalDefault: false
description: "Development workloads that should yield to production"

Kubernetes automatically creates two PriorityClasses:

system-cluster-critical (2000000000)
system-node-critical (2000001000)

These are used for critical system components.

Custom Scheduler

In some cases, the default scheduler may not meet your specific requirements. Kubernetes allows you to deploy custom schedulers that can run alongside the default scheduler:

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduled-pod
spec:
  schedulerName: my-custom-scheduler  # Specify custom scheduler
  containers:
  - name: container
    image: nginx

Custom schedulers can implement specialized algorithms for:

GPU scheduling
Topology-aware placement
Custom resource management
Domain-specific optimizations
Experimental scheduling policies

Pod Topology Spread Constraints

Topology spread constraints provide a declarative way to distribute pods across failure domains such as regions, zones, and nodes:

apiVersion: v1
kind: Pod
metadata:
  name: web-server
  labels:
    app: web-server
spec:
  topologySpreadConstraints:
  - maxSkew: 1                     # Maximum difference between domains
    topologyKey: topology.kubernetes.io/zone  # Spread across zones
    whenUnsatisfiable: DoNotSchedule  # Hard requirement
    labelSelector:                 # Pods to count for spreading
      matchLabels:
        app: web-server
  - maxSkew: 2                     # Maximum difference between domains
    topologyKey: kubernetes.io/hostname  # Spread across nodes
    whenUnsatisfiable: ScheduleAnyway  # Soft requirement
    labelSelector:
      matchLabels:
        app: web-server
  containers:
  - name: nginx
    image: nginx

The key parameters are:

maxSkew: The maximum difference between the number of pods in any two topology domains
topologyKey: The key of node labels defining the topology domain
whenUnsatisfiable: What to do if the constraint cannot be satisfied
- DoNotSchedule: Treat as a hard requirement
- ScheduleAnyway: Treat as a soft preference
labelSelector: Which pods to count when calculating the spread

Difference Between Topology Spread and Pod Anti-Affinity

While both can spread pods across domains, they work differently:

Feature	Topology Spread Constraints	Pod Anti-Affinity
Primary goal	Even numerical distribution	Separation from specific pods
Mechanism	Count-based balancing	Binary avoidance
Flexibility	Configure allowed skew	All-or-nothing
Complexity	Simpler to express	More complex for even spreading
Control	Fine-grained numerical control	Boolean logic control

Advanced Scheduling Scenarios

Multi-Zone High Availability

Distributing a StatefulSet across multiple availability zones:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mongodb
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mongodb
  template:
    metadata:
      labels:
        app: mongodb
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - mongodb
            topologyKey: topology.kubernetes.io/zone
      containers:
      - name: mongodb
        image: mongo:4.4

This ensures each MongoDB replica is in a different availability zone for maximum resilience.

GPU Workload Placement

Placing machine learning workloads on GPU nodes:

apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  priorityClassName: high-priority
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: accelerator
            operator: In
            values:
            - nvidia-tesla-v100
  containers:
  - name: ml-container
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 2

This pod requires nodes with specific GPU hardware and requests 2 GPUs.

Mixed Workload Co-location

Co-locating complementary workloads for resource efficiency:

# CPU-intensive workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-intensive
spec:
  replicas: 5
  template:
    metadata:
      labels:
        workload-type: cpu-intensive
    spec:
      containers:
      - name: cpu-app
        image: cpu-app:latest
        resources:
          requests:
            cpu: 2
            memory: 1Gi
          limits:
            cpu: 3
            memory: 1.5Gi
---
# Memory-intensive workload with co-location preference
apiVersion: apps/v1
kind: Deployment
metadata:
  name: memory-intensive
spec:
  replicas: 5
  template:
    metadata:
      labels:
        workload-type: memory-intensive
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: workload-type
                  operator: In
                  values:
                  - cpu-intensive
              topologyKey: kubernetes.io/hostname
      containers:
      - name: memory-app
        image: memory-app:latest
        resources:
          requests:
            cpu: 0.5
            memory: 4Gi
          limits:
            cpu: 1
            memory: 6Gi

This setup places memory-intensive workloads on the same nodes as CPU-intensive workloads for complementary resource usage.

Practical Implementation Tips

Testing Scheduling Rules

Always test your scheduling rules to ensure they work as expected:

Dry-run checks
kubectl apply --dry-run=server -f deployment.yaml
Describe pods to see scheduling decisions
kubectl describe pod <pod-name>
Look at "Events" section to see scheduling explanation
Use pod conditions
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="PodScheduled")]}'
Check node affinity score
kubectl get events --field-selector type=Warning
Look for FailedScheduling events with detailed explanations

Common Pitfalls

Overly restrictive requirements
- Combining too many hard constraints may make pods unschedulable
- Start with soft preferences and only add hard requirements when necessary
- Monitor for pods stuck in Pending state
Topology key errors
- Using non-existent topology keys will silently fail
- Verify that topology keys exist on your nodes
- Common keys: kubernetes.io/hostname, topology.kubernetes.io/zone
Label and selector mismatches
- Double-check that selectors match your intended pods
- Test with simple examples before complex deployments
- Use kubectl get pods --show-labels to verify labels
Combining incompatible rules
- Node affinity, pod affinity, and taints may conflict
- Ensure all scheduling constraints can be satisfied simultaneously
- Create diagrams for complex scheduling scenarios

Performance Considerations

Advanced scheduling features can impact scheduler performance:

Pod affinity/anti-affinity
- Has O(n²) computational complexity in worst case
- Can significantly slow down scheduling in large clusters
- Use pod anti-affinity with wider topology domains when possible
Inter-pod affinity
- More expensive than node affinity
- Careful with large numbers of pods and complex rules
- Consider using topology spread constraints instead for better performance
Scheduler throughput
- Complex rules reduce scheduler throughput
- Monitor scheduler latency in large clusters
- Consider using multiple scheduler profiles for different workloads

Best Practices

Start simple
- Begin with node selectors for basic constraints
- Add node affinity for more complex requirements
- Use pod affinity/anti-affinity only when necessary
- Apply taints and tolerations for dedicated nodes
Balance flexibility and constraints
- Use soft preferences (preferredDuringScheduling) when possible
- Reserve hard requirements for critical needs
- Design for failure by allowing flexibility in placement
Documentation
- Document your node labeling scheme
- Create clear diagrams for complex scheduling rules
- Establish consistent naming conventions for labels and taints
Monitoring
- Watch for pods stuck in Pending state
- Monitor scheduler latency metrics
- Set up alerts for scheduling failures
- Regularly review scheduling decisions
Testing
- Simulate node failures to verify resilience
- Test scheduling behavior in non-production environments
- Verify that scheduling rules work as expected during upgrades

Conclusion

Kubernetes advanced scheduling features provide powerful tools for optimizing workload placement. By using node affinity, pod affinity/anti-affinity, taints and tolerations, and other scheduling mechanisms, you can create sophisticated deployment strategies that maximize performance, availability, and resource utilization.

The key to successful scheduling is understanding your application requirements and cluster topology, then applying the right combination of scheduling features to achieve your goals. Start with simple rules and gradually add complexity as needed, always testing and validating that your scheduling decisions behave as expected.

By mastering these advanced scheduling concepts, you can ensure your Kubernetes workloads are deployed optimally across your infrastructure, improving reliability and efficiency.

Edit this page

StatefulSets & DaemonSets

Managing stateful applications and node-level daemons in Kubernetes

Kubernetes Multi-tenancy

Understanding and implementing multi-tenant architectures in Kubernetes

On this page

Star on GitHub Create Issues