Advanced Scheduling & Affinity
Understanding Kubernetes advanced scheduling capabilities and affinity rules
Introduction to Kubernetes Scheduling
Kubernetes scheduling is the process of assigning pods to nodes in the cluster. While the default scheduler works well for many scenarios, advanced applications often require more sophisticated placement strategies. Kubernetes provides powerful mechanisms to influence scheduling decisions, allowing administrators and developers to optimize for hardware efficiency, workload co-location, and availability.
The Kubernetes scheduler is a control plane component that watches for newly created pods with no assigned node and selects a node for them to run on. This decision-making process considers:
- Resource requirements and availability
- Hardware/software constraints
- Affinity and anti-affinity specifications
- Taints and tolerations
- Priority and preemption
- Custom scheduler policies
Advanced scheduling capabilities allow you to implement complex deployment strategies, enforce business requirements, and optimize resource utilization across your cluster.
Node Selectors
Node selectors provide a simple way to constrain pods to nodes with specific labels. This is the most straightforward way to control pod placement.
To use node selectors:
- Label your nodes appropriately:
- Add a nodeSelector field to your pod specification targeting those labels
While simple to use, node selectors have limitations:
- Only support equality-based requirements
- Cannot express more complex conditions (OR, NOT operations)
- Limited expressiveness for complex deployment scenarios
For more sophisticated requirements, Kubernetes offers node affinity.
Node Affinity
Node Affinity Basics
- More expressive matching: Supports complex logical operations beyond simple equality
- Rich selector syntax: Uses the same selector syntax as labels
- Two types:
requiredDuringSchedulingIgnoredDuringExecution
: Hard requirement (must be met)preferredDuringSchedulingIgnoredDuringExecution
: Soft preference (preferred but not mandatory)
- Operators: Includes
In
,NotIn
,Exists
,DoesNotExist
,Gt
,Lt
- Weight-based preferences: Assign priorities to different requirements
- Future support: Will eventually support
requiredDuringSchedulingRequiredDuringExecution
Hard Requirements Example
This pod will only be scheduled on Linux nodes with either amd64 or arm64 architecture.
Soft Preferences Example
This pod prefers to run in zone1 (weight 80) and on compute-optimized nodes (weight 20), but will still run elsewhere if these preferences cannot be satisfied.
The "IgnoredDuringExecution" part of these rules means that if a node's labels change after a pod is scheduled, the pod won't be evicted. Future versions of Kubernetes may support "RequiredDuringExecution" which would evict pods when nodes no longer satisfy the requirements.
Pod Affinity and Anti-Affinity
While node affinity controls which nodes pods can run on, pod affinity and anti-affinity control how pods are scheduled relative to other pods.
Pod Affinity
Pod affinity attracts pods to nodes that already run specific pods. This is useful for:
- Co-locating related workloads
- Placing frontend and backend components together
- Reducing network latency between communicating services
- Optimizing cache sharing between applications
- Hardware utilization
- Packing complementary workloads on the same node
- Balancing CPU and memory intensive applications
- Maximizing resource efficiency
Example pod affinity rule:
This ensures the pod runs on nodes that are already running pods with the label app=redis
.
Pod Anti-Affinity
Pod anti-affinity repels pods from nodes that already run specific pods. This is useful for:
- High availability
- Spreading replicas across different nodes
- Reducing correlated failures
- Improving fault tolerance
- Resource contention
- Preventing competing workloads on the same node
- Avoiding noisy neighbor problems
- Ensuring quality of service
Example pod anti-affinity rule:
This ensures the pod doesn't run on nodes that are already running pods with the label app=web-server
.
Topology Key
The topologyKey
field is crucial for pod affinity/anti-affinity. It defines the domain over which the rule applies:
kubernetes.io/hostname
: Node level (most restrictive)topology.kubernetes.io/zone
: Availability zone leveltopology.kubernetes.io/region
: Region level (least restrictive)- Custom topology domains: Any node label can be used
Using a broader topology key creates a wider domain for pod distribution, while a narrower key creates more concentrated placement.
Complete Pod Affinity Example
This deployment:
- Prefers to place pods in zones that have cache pods
- Strictly requires spreading web-server pods across different nodes
Taints and Tolerations
Taints and tolerations work together to ensure pods are not scheduled onto inappropriate nodes:
- Taints are applied to nodes to repel certain pods
- Tolerations are applied to pods to allow (but not require) scheduling on nodes with matching taints
This mechanism is complementary to node affinity:
- Node affinity is a property of pods that attracts them to nodes
- Taints are a property of nodes that repel pods
- Tolerations allow pods to overcome taints
Tainting Nodes
Nodes can be tainted with key-value pairs and an effect:
The three possible effects are:
NoSchedule
: Pods won't be scheduled on the node (unless they have matching tolerations)PreferNoSchedule
: The system will try to avoid placing pods on the node (soft version)NoExecute
: New pods won't be scheduled AND existing pods will be evicted if they don't have matching tolerations
Kubernetes automatically adds some taints to nodes with issues:
node.kubernetes.io/not-ready
: Node is not readynode.kubernetes.io/unreachable
: Node is unreachable from the node controllernode.kubernetes.io/out-of-disk
: Node has no free disk spacenode.kubernetes.io/memory-pressure
: Node has memory pressurenode.kubernetes.io/disk-pressure
: Node has disk pressurenode.kubernetes.io/network-unavailable
: Node's network is unavailablenode.kubernetes.io/unschedulable
: Node is cordoned
Adding Tolerations to Pods
Pods can specify tolerations to match node taints:
Tolerations can use two operators:
Equal
: Matches when key, value, and effect are equalExists
: Matches when key and effect exist (ignores value)
You can also create broader tolerations:
- Omitting
effect
matches all effects with the given key - Omitting
key
(with operatorExists
) matches all taints - Setting
operator: Exists
with novalue
matches any value
Tolerations and Eviction
The NoExecute
effect evicts pods that don't tolerate the taint. Two optional fields for NoExecute
tolerations control this behavior:
The tolerationSeconds
field defines how long the pod can run on a node with the matching taint before being evicted.
Use Cases for Taints and Tolerations
- Dedicated Nodes
- Special Hardware
- Control Plane Isolation
Node Affinity vs Taints and Tolerations
Both mechanisms control pod placement but with different approaches:
Feature | Node Affinity | Taints and Tolerations |
---|---|---|
Primary purpose | Attract pods to nodes | Repel pods from nodes |
Direction | Pods select nodes | Nodes reject pods |
Default behavior | Pods go anywhere | Pods avoid tainted nodes |
Implementation | Pod specification | Node configuration + Pod specification |
Selection mechanism | Node labels | Node taints + Pod tolerations |
Effect on existing pods | None | Can evict (with NoExecute) |
Logical operations | Complex (AND, OR, NOT) | Simple matching |
For complete control over pod placement, you often need to use both mechanisms together:
- Node affinity to ensure pods go to specific nodes
- Taints and tolerations to ensure only specific pods go to those nodes
Pod Priority and Preemption
Priority and preemption allow you to influence the scheduling order and eviction behavior when resources are constrained:
PriorityClass
PriorityClass is a cluster-wide object that defines priority levels:
The value
field defines the priority (higher numbers = higher priority).
The globalDefault
field (when true) makes this the default for pods without a specified priority.
Assigning Priority to Pods
This pod will have priority 1000000 based on the referenced PriorityClass.
Preemption Behavior
When a higher-priority pod cannot be scheduled due to resource constraints:
- The scheduler identifies lower-priority pods that could be evicted
- The scheduler preempts (evicts) enough lower-priority pods
- The higher-priority pod can then be scheduled
- Preempted pods go back to pending state and may be rescheduled
Preemption respects PodDisruptionBudget constraints when possible but may violate them if necessary for critical workloads.
Priority Class Examples
Kubernetes automatically creates two PriorityClasses:
system-cluster-critical
(2000000000)system-node-critical
(2000001000)
These are used for critical system components.
Custom Scheduler
In some cases, the default scheduler may not meet your specific requirements. Kubernetes allows you to deploy custom schedulers that can run alongside the default scheduler:
Custom schedulers can implement specialized algorithms for:
- GPU scheduling
- Topology-aware placement
- Custom resource management
- Domain-specific optimizations
- Experimental scheduling policies
Pod Topology Spread Constraints
Topology spread constraints provide a declarative way to distribute pods across failure domains such as regions, zones, and nodes:
The key parameters are:
maxSkew
: The maximum difference between the number of pods in any two topology domainstopologyKey
: The key of node labels defining the topology domainwhenUnsatisfiable
: What to do if the constraint cannot be satisfiedDoNotSchedule
: Treat as a hard requirementScheduleAnyway
: Treat as a soft preference
labelSelector
: Which pods to count when calculating the spread
Difference Between Topology Spread and Pod Anti-Affinity
While both can spread pods across domains, they work differently:
Feature | Topology Spread Constraints | Pod Anti-Affinity |
---|---|---|
Primary goal | Even numerical distribution | Separation from specific pods |
Mechanism | Count-based balancing | Binary avoidance |
Flexibility | Configure allowed skew | All-or-nothing |
Complexity | Simpler to express | More complex for even spreading |
Control | Fine-grained numerical control | Boolean logic control |
Advanced Scheduling Scenarios
Multi-Zone High Availability
Distributing a StatefulSet across multiple availability zones:
This ensures each MongoDB replica is in a different availability zone for maximum resilience.
GPU Workload Placement
Placing machine learning workloads on GPU nodes:
This pod requires nodes with specific GPU hardware and requests 2 GPUs.
Mixed Workload Co-location
Co-locating complementary workloads for resource efficiency:
This setup places memory-intensive workloads on the same nodes as CPU-intensive workloads for complementary resource usage.
Practical Implementation Tips
Testing Scheduling Rules
Always test your scheduling rules to ensure they work as expected:
- Dry-run checks
- Describe pods to see scheduling decisions
Look at "Events" section to see scheduling explanation - Use pod conditions
- Check node affinity score
Look for FailedScheduling events with detailed explanations
Common Pitfalls
- Overly restrictive requirements
- Combining too many hard constraints may make pods unschedulable
- Start with soft preferences and only add hard requirements when necessary
- Monitor for pods stuck in Pending state
- Topology key errors
- Using non-existent topology keys will silently fail
- Verify that topology keys exist on your nodes
- Common keys:
kubernetes.io/hostname
,topology.kubernetes.io/zone
- Label and selector mismatches
- Double-check that selectors match your intended pods
- Test with simple examples before complex deployments
- Use
kubectl get pods --show-labels
to verify labels
- Combining incompatible rules
- Node affinity, pod affinity, and taints may conflict
- Ensure all scheduling constraints can be satisfied simultaneously
- Create diagrams for complex scheduling scenarios
Performance Considerations
Advanced scheduling features can impact scheduler performance:
- Pod affinity/anti-affinity
- Has O(n²) computational complexity in worst case
- Can significantly slow down scheduling in large clusters
- Use pod anti-affinity with wider topology domains when possible
- Inter-pod affinity
- More expensive than node affinity
- Careful with large numbers of pods and complex rules
- Consider using topology spread constraints instead for better performance
- Scheduler throughput
- Complex rules reduce scheduler throughput
- Monitor scheduler latency in large clusters
- Consider using multiple scheduler profiles for different workloads
Best Practices
- Start simple
- Begin with node selectors for basic constraints
- Add node affinity for more complex requirements
- Use pod affinity/anti-affinity only when necessary
- Apply taints and tolerations for dedicated nodes
- Balance flexibility and constraints
- Use soft preferences (
preferredDuringScheduling
) when possible - Reserve hard requirements for critical needs
- Design for failure by allowing flexibility in placement
- Use soft preferences (
- Documentation
- Document your node labeling scheme
- Create clear diagrams for complex scheduling rules
- Establish consistent naming conventions for labels and taints
- Monitoring
- Watch for pods stuck in Pending state
- Monitor scheduler latency metrics
- Set up alerts for scheduling failures
- Regularly review scheduling decisions
- Testing
- Simulate node failures to verify resilience
- Test scheduling behavior in non-production environments
- Verify that scheduling rules work as expected during upgrades
Conclusion
Kubernetes advanced scheduling features provide powerful tools for optimizing workload placement. By using node affinity, pod affinity/anti-affinity, taints and tolerations, and other scheduling mechanisms, you can create sophisticated deployment strategies that maximize performance, availability, and resource utilization.
The key to successful scheduling is understanding your application requirements and cluster topology, then applying the right combination of scheduling features to achieve your goals. Start with simple rules and gradually add complexity as needed, always testing and validating that your scheduling decisions behave as expected.
By mastering these advanced scheduling concepts, you can ensure your Kubernetes workloads are deployed optimally across your infrastructure, improving reliability and efficiency.