Welcome to from-docker-to-kubernetes

API Priority and Fairness

Understanding Kubernetes API Priority and Fairness for managing request flows and preventing API server overload

API Priority and Fairness

API Priority and Fairness (APF) is a Kubernetes feature that ensures fair handling of requests to the API server, preventing resource starvation during high traffic periods and improving cluster stability. This critical component manages how the Kubernetes API server processes concurrent requests, replacing the older max-inflight request limiting with a more sophisticated approach that categorizes and prioritizes requests based on their source, type, and importance.

Core Concepts

Flow Control

  • Request categorization: Classifies incoming requests based on attributes like user, namespace, and resource type
  • Priority levels: Assigns different levels of importance to request categories, with critical system requests getting higher priority
  • Queue management: Manages separate request queues for different priority levels to prevent high-priority requests from being blocked
  • Traffic shaping: Controls the rate at which requests are processed to prevent API server overload
  • Starvation prevention: Ensures that even low-priority requests eventually get processed, preventing complete starvation

Fair Queuing

  • Request dispatching: Efficiently distributes requests to available server resources
  • Flow distinction: Groups related requests into "flows" based on shared characteristics
  • Shuffle sharding: Uses a technique to distribute requests across multiple queues to reduce the impact of noisy neighbors
  • Concurrency limits: Controls how many requests from each priority level can execute simultaneously
  • FIFO queuing within flows: Processes requests in order within each flow to maintain fairness

Key Components

Priority Level Configuration

apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
  name: global-default
spec:
  type: Limited  # Can be 'Limited' or 'Exempt' (for critical system components)
  limited:
    assuredConcurrencyShares: 20  # Relative share of concurrency - higher values get more concurrent requests
    limitResponse:
      type: Reject  # What happens when the concurrency limit is hit - 'Reject' or 'Queue'
      queuing:
        queues: 128  # Number of queues (more queues reduce "noisy neighbor" problems)
        queueLengthLimit: 50  # Maximum number of requests per queue
        handSize: 8  # Number of queues a request can potentially be assigned to (shuffle sharding parameter)

This example defines a priority level named "global-default" that receives 20 concurrency shares. When the concurrency limit is reached, new requests will be rejected rather than queued. If queuing is enabled, the configuration specifies 128 queues with a maximum of 50 requests per queue, and uses a shuffle sharding approach with a hand size of 8.

Flow Schema Configuration

Basic Flow Schema

apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: FlowSchema
metadata:
  name: service-accounts
spec:
  priorityLevelConfiguration:
    name: service-level  # References the PriorityLevelConfiguration to use
  distinguisherMethod:
    type: ByUser  # How to distinguish between flows - can be 'ByUser' or 'ByNamespace'
  matchingPrecedence: 1000  # Lower values have higher precedence when multiple FlowSchemas match
  rules:  # Define which requests match this FlowSchema
  - subjects:  # Who the rule applies to
    - kind: ServiceAccount  # Applies to all ServiceAccounts
    resourceRules:  # What API resources the rule applies to
    - apiGroups: ["*"]  # All API groups
      resources: ["*"]  # All resources
      verbs: ["*"]  # All verbs (GET, POST, etc.)

Rules and Matching

rules:
- subjects:  # The entities this rule applies to
  - kind: ServiceAccount  # A specific service account
    serviceAccount:
      name: controller  # Name of the service account
      namespace: kube-system  # Namespace of the service account
  - kind: User  # A specific user
    name: system:kube-controller-manager  # Name of the user
  resourceRules:  # Rules for API resources
  - apiGroups: ["*"]  # All API groups
    resources: ["*"]  # All resources
    verbs: ["*"]  # All verbs
  nonResourceRules:  # Rules for non-resource endpoints like /healthz
  - nonResourceURLs: ["*"]  # All non-resource URLs
    verbs: ["*"]  # All verbs

This example shows how to create a FlowSchema that directs requests from service accounts to a priority level called "service-level". The distinguisher method "ByUser" means each user gets its own flow. The rules section defines which requests match this FlowSchema, in this case all requests from any ServiceAccount. The second example shows more specific matching for a particular service account and user, with rules for both resource and non-resource endpoints.

Predefined Configurations

Request Classification

graph TD
    A[API Request] --> B{Match FlowSchema?}
    B -->|Yes| C[Assign Priority Level]
    B -->|No| D[Use catch-all]
    C --> E{Queue Full?}
    E -->|Yes| F[Apply limitResponse]
    E -->|No| G[Queue Request]
    G --> H[Process by Priority]

When a request arrives at the API server:

  1. The system evaluates all FlowSchemas in order of matchingPrecedence to find a match
  2. Once matched, the request is assigned to the priority level specified by that FlowSchema
  3. The system determines if the priority level's concurrency limit is reached
  4. If the limit is reached, the system either queues or rejects the request based on the limitResponse configuration
  5. Queued requests are placed into one of the available queues using shuffle sharding
  6. Requests are processed according to their priority level and position in queue
  7. If no FlowSchema matches, the request falls into the catch-all "global-default" configuration

This classification ensures that even during heavy API server load, critical operations continue to function while less important requests may be queued or rejected.

Advanced Configuration

Exempt Priority Level

apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
  name: exempt-workloads
spec:
  type: Exempt  # 'Exempt' type means these requests bypass all APF limits
  exempt: {}    # No additional configuration needed for exempt requests

The Exempt priority level is special - requests assigned to this level bypass all concurrency limits and queue management. This should be used very sparingly and only for truly critical components where any delay could cause serious cluster issues. Overuse of exempt priority levels can defeat the purpose of API Priority and Fairness.

Configuring Queue Parameters

apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
  name: analytics-priority
spec:
  type: Limited
  limited:
    assuredConcurrencyShares: 10    # Relative priority compared to other levels
    limitResponse:
      type: Queue                   # Queue requests rather than rejecting them
      queuing:
        queues: 64                  # Number of queues - higher values reduce "noisy neighbor" problems
        queueLengthLimit: 50        # Max queue length - prevents unbounded memory growth
        handSize: 8                 # Shuffle sharding hand size - balances fairness and isolation

Shuffle sharding is a key concept in APF that helps balance fairness with isolation. With a handSize of 8 and 64 queues, each flow can be assigned to 8 of the 64 queues. This means that if one flow is very busy, it will only impact a fraction of the queues, reducing the "noisy neighbor" problem where a busy tenant impacts others.

The optimal values for these parameters depend on your cluster size, workload patterns, and priorities:

  • Higher queue counts provide better isolation but consume more memory
  • Larger handSize improves fairness but reduces isolation
  • Higher queueLengthLimit allows more requests to be queued but increases memory usage

Monitoring APF

Enabling APF

# kube-apiserver configuration
--feature-gates=APIPriorityAndFairness=true   # Enable the APF feature gate
--enable-priority-and-fairness=true           # Enable the APF functionality
--max-requests-inflight=0                     # Optional: disable the old limiting mechanism
--max-mutating-requests-inflight=0            # Optional: disable the old limiting mechanism

Starting with Kubernetes 1.20, API Priority and Fairness became beta and is enabled by default. In Kubernetes 1.23 and later, the feature gates aren't required as APF is enabled by default. However, you may need to explicitly set the --enable-priority-and-fairness=true flag in some environments.

When transitioning from the older max-inflight limiting approach to APF, you can run both simultaneously (which is the default) or completely disable the old mechanism by setting the max-inflight values to 0. Running both provides a safety net during the transition, but eventually switching completely to APF provides the most benefits.

Use Cases

Critical API Protection

  • Protect system components: Ensure core Kubernetes components like controllers and schedulers always have API access
  • Ensure cluster operation: Maintain cluster health monitoring and management operations even under heavy load
  • Prevent admin lockout: Guarantee administrators can always access the cluster to fix issues, even during overload
  • Critical path prioritization: Ensure that operations in the critical path (like pod scheduling) take precedence
  • Control plane protection: Shield the Kubernetes control plane from being overwhelmed by application workloads

For example, in a production cluster running hundreds of applications, during a mass restart event, APF ensures the scheduler and controllers can still function to restore the system, while potentially delaying less critical application-initiated API calls.

Multi-tenant Fairness

  • Resource fairness between tenants: Ensure each tenant gets their fair share of API server resources
  • Prevent tenant starvation: Stop aggressive tenants from consuming all API resources and starving others
  • Noisy neighbor mitigation: Isolate the impact of chatty applications to prevent them from affecting others
  • Workload isolation: Separate production and development workloads to ensure production takes priority
  • Shared cluster management: Enable multiple teams to share a cluster without interfering with each other

For example, in a shared development cluster where multiple teams deploy applications, APF prevents one team's CI/CD pipeline that makes hundreds of API calls per minute from blocking another team's ability to deploy or debug their applications.

Best Practices

Troubleshooting

Common Issues

  • Too many rejected requests: Indicates insufficient concurrency limits or queue sizes for the workload
  • Unexpected request queuing: Requests may be categorized into a different priority level than expected
  • Priority level misconfiguration: Incorrect concurrency shares or queue parameters causing bottlenecks
  • Flow schema conflicts: Multiple flow schemas matching the same requests with unclear precedence
  • Performance degradation: Overall API server slowdown due to APF overhead or misconfiguration
  • System component disruption: Critical components being incorrectly throttled due to wrong priority levels
  • Inconsistent request handling: Some requests processed promptly while similar ones face delays

Diagnostic Approaches

# Check APF configuration
kubectl get prioritylevelconfigurations.flowcontrol.apiserver.k8s.io
kubectl get flowschemas.flowcontrol.apiserver.k8s.io

# View metrics
kubectl get --raw /metrics | grep apiserver_flowcontrol

# Check for rejected requests
kubectl get --raw /metrics | grep apiserver_flowcontrol_rejected_requests_total

# Check current queue status
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_inqueue_requests

# Check execution and wait times
kubectl get --raw /metrics | grep apiserver_flowcontrol_request_execution_seconds
kubectl get --raw /metrics | grep apiserver_flowcontrol_request_wait_seconds

# View detailed configuration
kubectl describe prioritylevelconfiguration <name>
kubectl describe flowschema <name>

# Check which flow schema is handling specific requests
kubectl get --raw /debug/api_priority_and_fairness/dump_priority_levels

# Examine which requests are being queued or rejected
kubectl logs -n kube-system -l component=kube-apiserver | grep -i "priority and fairness"

When troubleshooting APF issues:

  1. First identify if requests are being rejected, queued, or just slow
  2. Determine which priority level and flow schema are handling the affected requests
  3. Check if concurrency limits or queue parameters need adjustment
  4. Verify that critical requests are being classified into the appropriate priority levels
  5. Look for patterns in the metrics that might indicate misconfigurations

Example Scenarios

Advanced Topics

Multiple API Servers

  • Configuration consistency: Ensure all API servers have identical APF configurations to prevent inconsistent request handling
  • Load balancing implications: Consider how load balancers distribute requests across API servers when designing APF settings
  • HA considerations: In high-availability setups, ensure APF doesn't interfere with API server failover mechanisms
  • Metric aggregation: Aggregate APF metrics across all API servers for a complete view of request handling
  • Cross-server fairness: Understand that APF operates independently on each API server, so global fairness requires coordination

When running multiple API servers, each server maintains its own independent APF system. This means a request that might be queued on one API server could be immediately processed on another. Consider using consistent hashing at the load balancer level to direct similar requests to the same API server for more predictable behavior.

Custom Metrics

  • Creating custom APF metrics: Develop additional metrics to monitor specific aspects of API request handling
  • Integrating with monitoring: Configure Prometheus to collect and alert on APF metrics
  • Setting alerts: Create alerts for excessive request rejection or queuing to proactively address issues
  • Performance dashboards: Build dashboards that visualize APF behavior over time
  • Historical trend analysis: Analyze long-term trends to optimize APF configurations

Example Prometheus alert for detecting excessive API request rejection:

- alert: KubeAPIServerRejectedRequests
  expr: sum(rate(apiserver_flowcontrol_rejected_requests_total[5m])) by (priority_level, flow_schema) > 10
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Kubernetes API server is rejecting too many requests"
    description: "API server is rejecting requests for priority level {{ $labels.priority_level }} 
                 and flow schema {{ $labels.flow_schema }} at a rate of {{ $value }} per second 
                 over the last 15 minutes."

Compatibility and Version Support

Migration Considerations

From max-in-flight Limits

  • Legacy settings: --max-requests-inflight and --max-mutating-requests-inflight provide simple global limits on concurrent requests
  • More granular control with APF: APF provides fine-grained control based on request attributes rather than simple global limits
  • Transition strategy: Run both systems in parallel initially, then gradually shift reliance to APF
  • Behavioral differences: APF queues and prioritizes requests rather than simply rejecting them when limits are reached
  • Performance impact assessment: Monitor API server performance during transition to ensure no degradation

The traditional max-in-flight limits are a blunt instrument that treats all requests equally. For example, with --max-requests-inflight=400, the 401st request is rejected regardless of its importance. APF, in contrast, might allow a critical system request through while queuing a less important request, even if the overall number of requests is high.

Upgrade Path

# Check existing settings
kubectl -n kube-system get configmap kube-apiserver -o yaml | grep max-

# Gradually transition
# 1. Enable APF alongside max-in-flight
# 2. Monitor APF metrics
# 3. Adjust APF configurations
# 4. Remove max-in-flight limits

When transitioning from max-in-flight to APF, follow these steps:

  1. Initial audit: Document current settings and understand their impact
    kubectl -n kube-system get configmap kube-apiserver -o yaml | grep max-requests-inflight
    kubectl -n kube-system get configmap kube-apiserver -o yaml | grep max-mutating-requests-inflight
    
  2. Enable APF: Turn on APF without disabling max-in-flight
    # kube-apiserver configuration
    --feature-gates=APIPriorityAndFairness=true
    --enable-priority-and-fairness=true
    # Keep existing max-in-flight settings during transition
    
  3. Create baseline configurations: Establish initial APF configurations that approximate your current limits
    apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
    kind: PriorityLevelConfiguration
    metadata:
      name: workload-standard
    spec:
      type: Limited
      limited:
        assuredConcurrencyShares: 100
        limitResponse:
          type: Queue
          queuing:
            queues: 64
            queueLengthLimit: 50
            handSize: 6
    
  4. Monitor and adjust: Use metrics to fine-tune APF settings while still protected by max-in-flight
    # Track rejection rates
    kubectl get --raw /metrics | grep apiserver_flowcontrol_rejected_requests_total
    # Monitor queue lengths
    kubectl get --raw /metrics | grep apiserver_flowcontrol_current_inqueue_requests
    
  5. Final transition: Once confident in APF configuration, consider disabling max-in-flight
    # kube-apiserver configuration
    --max-requests-inflight=0
    --max-mutating-requests-inflight=0