Job and CronJob Enhancements

Understanding the latest enhancements to Kubernetes Jobs and CronJobs for improved batch processing capabilities

Kubernetes Jobs and CronJobs have received significant enhancements to improve their reliability, flexibility, and performance for batch processing workloads. These improvements address key challenges in batch processing, including better error handling, more efficient parallel processing, improved lifecycle management, and enhanced scheduling capabilities.

Job Fundamentals

Basic Job Concepts

One-time task execution: Jobs run pods that execute until successful completion, unlike long-running services
Completions tracking: Jobs monitor and record the number of successfully completed pods
Parallelism control: Jobs can run multiple pods concurrently to process work in parallel
Restart policies: Configure how pods behave after failure (Never, OnFailure, etc.)
Completion handling: Determine when a job is considered complete based on success criteria

Job Enhancements

Indexed jobs: Assign unique indices to pods for coordinated distributed processing
Job tracking with finalizers: Prevent premature job deletion to ensure accurate completion tracking
Backoff limits: Control retry behavior with sophisticated backoff mechanisms
Pod failure policy: Define specific actions for different failure scenarios
Suspend capability: Pause and resume jobs for debugging or resource management
Completion mode options: Flexible ways to determine when a job is considered complete

Indexed Jobs

Indexed Jobs assign a unique completion index to each Pod, enabling parallel data processing with clear work distribution. This is particularly useful for distributed data processing, ML training, and other workloads that need to divide work predictably across pods:

apiVersion: batch/v1
kind: Job
metadata:
  name: indexed-job
spec:
  completions: 5       # Total number of successful pod completions needed
  parallelism: 3       # How many pods can run concurrently
  completionMode: Indexed  # Enables indexed job mode (vs. the default "NonIndexed")
  template:
    spec:
      containers:
      - name: worker
        image: batch-processor:v1
        env:
        - name: JOB_COMPLETION_INDEX  # This environment variable will contain the pod's index (0-4)
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        command: ["process", "--chunk-index=$(JOB_COMPLETION_INDEX)"]  # Pass the index to the application
      restartPolicy: Never  # Don't restart pods that fail

With this configuration, Kubernetes guarantees that:

Exactly one pod will run for each index from 0 to 4
If a pod fails, a replacement pod with the same index will be created
The job completes only when one pod for each index completes successfully
Pods can use their index to determine which portion of data to process

Pod Failure Policy

Pod Failure Policy provides fine-grained control over how different types of pod failures are handled, allowing for more sophisticated error handling strategies:

apiVersion: batch/v1
kind: Job
metadata:
  name: job-with-pod-failure-policy
spec:
  completions: 12        # Need 12 successful completions
  parallelism: 3         # Run up to 3 pods at once
  template:
    spec:
      containers:
      - name: main
        image: batch-job:latest
        command: ["worker", "--batch-id=$(JOB_COMPLETION_INDEX)"]
      restartPolicy: Never
  backoffLimit: 6        # Allow up to 6 retries for failures counted against the backoff limit
  podFailurePolicy:      # Define custom handling for different failure scenarios
    rules:
    - action: FailJob    # Immediately fail the entire job
      onExitCodes:       # When the container exits with code 42
        containerName: main
        operator: In     # Exit code is in the list
        values: [42]     # List of exit codes for this rule
    - action: Ignore     # Don't count against backoffLimit
      onExitCodes:       # When container exits with codes 5, 6, or 7
        containerName: main
        operator: In
        values: [5, 6, 7]
    - action: Count      # Count against the backoffLimit (default behavior)
      onPodConditions:   # When pod has the DisruptionTarget condition
      - type: DisruptionTarget

This policy allows for:

FailJob: Immediately terminate the job when certain critical errors occur (exit code 42)
Ignore: Don't count certain expected or recoverable errors against the retry limit (exit codes 5, 6, 7)
Count: Standard behavior of counting the failure against backoffLimit (for disruption events)

Pod failure policies are especially useful for:

Failing fast when unrecoverable errors occur
Preserving retry budget for transient failures
Distinguishing between application errors and infrastructure issues
Implementing graceful degradation strategies

Job Termination and Cleanup

Job TTL Controller

apiVersion: batch/v1
kind: Job
metadata:
  name: cleanup-job
spec:
  ttlSecondsAfterFinished: 100  # Job will be automatically deleted 100 seconds after completion
  template:
    spec:
      containers:
      - name: worker
        image: batch-processor:v1
      restartPolicy: Never

The TTL Controller automatically cleans up finished Jobs (both succeeded and failed) after a specified time period. This prevents the accumulation of completed Job objects in the cluster, which can cause performance degradation in the Kubernetes API server and etcd.

Benefits of the TTL Controller:

Reduces API server and etcd load
Prevents namespace clutter
Automates lifecycle management
Configurable retention periods
Works for both regular Jobs and CronJob-created Jobs

Finalizers for Tracking

Guarantees job completion tracking: The batch.kubernetes.io/job-tracking finalizer ensures the job controller can track completions even if pods are deleted
Prevents premature job deletion: Jobs with active finalizers cannot be deleted until the finalizer is removed
Maintains accurate job history: Ensures completion records are accurately maintained for metrics and history
Supports reliable metrics: Provides consistent data for monitoring systems tracking job success/failure rates
Ensures proper resource cleanup: Coordinates proper cleanup of all job-related resources

The job tracking finalizer addresses race conditions that could occur when pods complete but the job controller hasn't yet recorded the completion, ensuring that job status is always accurate.

Suspend Capability

The suspend feature allows pausing and resuming jobs, useful for debugging or resource management. When a job is suspended, no new pods are created, but existing pods continue to run until completion:

apiVersion: batch/v1
kind: Job
metadata:
  name: suspendable-job
spec:
  suspend: true         # Job is created in suspended state
  parallelism: 3        # Will run 3 pods concurrently when unsuspended
  completions: 10       # Requires 10 successful completions
  template:
    spec:
      containers:
      - name: worker
        image: batch-job:latest
      restartPolicy: Never

To resume a suspended job:

kubectl patch job suspendable-job -p '{"spec":{"suspend":false}}'

The suspend feature enables several important use cases:

Debugging: Pause a job to examine its state or logs without new pods being created
Resource management: Temporarily suspend jobs during cluster maintenance or high load
Manual approval workflows: Create jobs in suspended state, requiring manual approval to start
Scheduled execution: Create jobs ahead of time but suspend them until needed
Coordinated batch processing: Suspend dependent jobs until prerequisite jobs complete

CronJob Improvements

CronJobs have been enhanced with several features to improve reliability, timezone support, and history management:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: reliable-cronjob
spec:
  schedule: "*/10 * * * *"   # Run every 10 minutes (cron syntax)
  timeZone: "America/New_York"  # Use New York timezone for scheduling
  concurrencyPolicy: Forbid   # Don't allow concurrent executions
  startingDeadlineSeconds: 200  # Allow job to start up to 200s late
  successfulJobsHistoryLimit: 3  # Keep history of 3 successful jobs
  failedJobsHistoryLimit: 1   # Keep history of 1 failed job
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: periodic-task
            image: cron-worker:v2
          restartPolicy: OnFailure  # Restart container if it fails

Key CronJob improvements include:

Timezone Support: Specify schedules in any IANA timezone rather than only UTC
Improved Controller Reliability: More robust handling of missed schedules
Configurable History Limits: Control how many successful and failed jobs are retained
Concurrency Policies:
- Allow: Allow concurrent jobs (default)
- Forbid: Skip new job if previous job still running
- Replace: Cancel running job and start new one
Starting Deadline: Define how late a job can start before being considered missed
Stability Enhancements: Reduced API server load with optimized controller behavior

Performance Enhancements

Job API Optimization

Reduced API server load: Fewer status updates and optimized watch patterns decrease API server pressure
Optimized status updates: Updates are batched and throttled to minimize API calls
Better scaling for large jobs: Efficient handling of jobs with thousands of completions
Improved controller efficiency: Enhanced algorithms in the job controller reduce CPU and memory usage
Reduced etcd pressure: Fewer and smaller writes to etcd improve overall cluster performance

These optimizations are particularly important for large-scale batch processing where hundreds or thousands of pods might be managed by a single job, addressing previous performance bottlenecks that could affect the entire cluster.

Tracking Finalizers

apiVersion: batch/v1
kind: Job
metadata:
  name: high-volume-job
  finalizers:
  - batch.kubernetes.io/job-tracking  # Ensures proper tracking even at scale
spec:
  completions: 1000  # Large number of required completions
  parallelism: 50    # Run up to 50 pods concurrently
  template:
    spec:
      containers:
      - name: worker
        image: batch-processor:v1
      restartPolicy: Never

The job tracking finalizer is particularly valuable for high-volume jobs because:

It prevents race conditions where pods complete but the job controller hasn't processed the completion
It ensures accurate tracking even if the job controller temporarily fails or restarts
It maintains the integrity of completion counts for jobs with large numbers of pods
It provides consistency guarantees even under high cluster load
It works seamlessly with the indexed job feature for reliable distributed processing

For jobs with thousands of completions, these guarantees prevent subtle bugs and inconsistencies that could occur in earlier Kubernetes versions.

Advanced Job Patterns

Enhanced features enable sophisticated job patterns:

Job dependencies with owner references: Create hierarchical relationships between jobs, where child jobs are automatically cleaned up when parent jobs complete
apiVersion: batch/v1 kind: Job metadata: name: child-job ownerReferences: - apiVersion: batch/v1 kind: Job name: parent-job uid: d9607e19-f88f-11e6-a518-42010a800195 controller: true
Staged workflows with completion signals: Use jobs as stages in a pipeline, with each job waiting for signals from prerequisite jobs
# Job that checks for completion of prerequisite job command: ["sh", "-c", "until kubectl get job prerequisite-job -o jsonpath='{.status.succeeded}' | grep 1; do sleep 10; done && echo 'Starting main process' && process-data"]
Data processing pipelines with indexed jobs: Process large datasets by dividing work among indexed pods for efficient parallel processing
# Example: Process 1000 data chunks with 50 workers completions: 1000 parallelism: 50 completionMode: Indexed
Error handling with pod failure policies: Implement sophisticated error handling with different actions for different failure types
# Different handling for different error codes podFailurePolicy: rules: - action: FailJob # Critical error - stop everything onExitCodes: {values: [1]} - action: Ignore # Recoverable error - don't count against retries onExitCodes: {values: [5]}
Resource management with suspend capability: Create resource-intensive jobs in suspended state and activate them during off-peak hours
# Create suspended, activate later suspend: true # Then later: # kubectl patch job resource-intensive-job -p '{"spec":{"suspend":false}}'
Geographic scheduling with timezone-aware CronJobs: Schedule jobs according to business hours in different geographic regions
# Run daily reports for different regions at their local 2AM - name: asia-report schedule: "0 2 * * *" timeZone: "Asia/Tokyo" - name: europe-report schedule: "0 2 * * *" timeZone: "Europe/Paris"

Timezone Support for CronJobs

Timezone support allows scheduling jobs according to local time in specific geographic regions, making it easier to coordinate batch processing with business operations around the world:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: timezone-aware-job
spec:
  schedule: "0 7 * * *"        # Run at 7:00 AM
  timeZone: "Europe/Paris"     # Using Paris local time
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: morning-task
            image: daily-report:v1
          restartPolicy: OnFailure

Benefits of timezone support:

Business hour alignment: Schedule jobs during specific business hours in different regions
Maintenance window coordination: Align batch processing with regional maintenance windows
User experience optimization: Schedule customer-facing operations at appropriate local times
Regulatory compliance: Execute jobs at legally required times in various jurisdictions
Global operations: Manage global operations with region-specific schedules

The CronJob controller automatically handles daylight saving time adjustments within the specified timezone, ensuring jobs run at the expected local time throughout the year without manual intervention.

Scalability Improvements

Large Job Support

Optimized status tracking: Efficient algorithms for tracking thousands of pods without overwhelming the API server
Reduced API call volume: Batch updates and throttling mechanisms minimize API calls for job status updates
Efficient completion tracking: Improved indexing and caching of pod completion status
Memory optimizations: Reduced memory footprint in the job controller for large-scale jobs
Backoff management: Sophisticated retry mechanisms that prevent thundering herd problems during retries

These improvements address critical scalability bottlenecks that previously limited the size and performance of batch workloads in Kubernetes, enabling much larger batch processing operations.

High-volume Considerations

apiVersion: batch/v1
kind: Job
metadata:
  name: large-scale-job
spec:
  completions: 10000                # Very large number of completions
  parallelism: 100                  # High concurrency
  backoffLimit: 10                  # Allow up to 10 retries
  template:
    spec:
      containers:
      - name: worker
        image: batch-processor:latest
        resources:                  # Resource constraints are crucial for large jobs
          requests:
            memory: "64Mi"          # Memory request per pod
            cpu: "100m"             # CPU request per pod (0.1 core)
          limits:
            memory: "128Mi"         # Memory limit per pod
            cpu: "200m"             # CPU limit per pod (0.2 core)
      restartPolicy: Never

For high-volume jobs with thousands of pods, consider these additional practices:

Set appropriate resource requests/limits: Prevent resource contention and ensure predictable performance
Use node anti-affinity: Spread pods across nodes to avoid overwhelming individual nodes
Implement pod disruption budgets: Protect critical batch workloads during cluster maintenance
Consider pod priority classes: Ensure important batch jobs get resources when needed
Monitor cluster-wide impact: Watch for API server and etcd performance when running very large jobs
Use indexed completion mode: Enables more efficient tracking for jobs with many completions
Implement proper failure handling: Use pod failure policies to handle different error scenarios

Best Practices

Use appropriate Pod failure policies for different error types: Distinguish between recoverable errors (network issues, temporary resource constraints) and non-recoverable errors (data corruption, invalid input)
podFailurePolicy: rules: - action: FailJob # For non-recoverable errors onExitCodes: {values: [1, 2, 3]} - action: Ignore # For transient infrastructure issues onExitCodes: {values: [137, 143]} # SIGKILL and SIGTERM
Set reasonable TTL values for automatic cleanup: Balance history retention with cluster resource usage
# For debugging jobs: longer retention ttlSecondsAfterFinished: 86400 # 24 hours # For frequent production jobs: shorter retention ttlSecondsAfterFinished: 3600 # 1 hour
Configure proper history limits for CronJobs: Maintain enough history for troubleshooting without excessive resource usage
successfulJobsHistoryLimit: 3 # Keep 3 successful jobs failedJobsHistoryLimit: 5 # Keep more failed jobs for troubleshooting
Implement resource requests and limits for job pods: Ensure predictable performance and prevent resource starvation
resources: requests: cpu: "500m" # 0.5 CPU cores memory: "512Mi" # 512 MB memory limits: cpu: "1" # 1 CPU core memory: "1Gi" # 1 GB memory
Use indexed jobs for data partitioning workloads: Enable deterministic work distribution and simplify parallel processing
completionMode: Indexed # Application can use JOB_COMPLETION_INDEX to determine which data to process
Consider suspending large jobs during cluster maintenance: Prevent job failures during node drains or upgrades
# Before maintenance: kubectl patch job large-job -p '{"spec":{"suspend":true}}' # After maintenance: kubectl patch job large-job -p '{"spec":{"suspend":false}}'
Set appropriate parallelism based on cluster resources: Balance completion speed with resource availability
# For CPU-bound jobs on a cluster with 20 nodes of 8 cores each: parallelism: 80 # Allow 4 cores per node for system processes
Use timeZone for geographical scheduling needs: Align batch processing with business operations in specific regions
schedule: "0 3 * * *" # 3 AM timeZone: "Australia/Sydney" # In Sydney's local time
Implement pod anti-affinity for critical jobs: Spread pods across failure domains for reliability
affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: topologyKey: "kubernetes.io/hostname"
Monitor job completion rates and durations: Establish baselines and alert on deviations
# Prometheus queries for job monitoring rate(kube_job_complete[5m])

Practical Examples

Data Processing Pipeline

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing-pipeline
spec:
  completions: 100            # Process 100 data shards
  parallelism: 10             # Process 10 shards concurrently
  completionMode: Indexed     # Use indexed mode for deterministic work distribution
  template:
    spec:
      containers:
      - name: processor
        image: data-processor:v1
        env:
        - name: SHARD_INDEX   # Make index available to container
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        volumeMounts:
        - name: data-volume   # Mount shared data volume
          mountPath: /data
        command:
        - "/bin/sh"
        - "-c"
        - "process-shard --input=/data/input --shard=$SHARD_INDEX --output=/data/output"
      volumes:
      - name: data-volume     # Define persistent storage for data
        persistentVolumeClaim:
          claimName: data-pvc
      restartPolicy: Never    # Don't restart failed pods

This data processing example demonstrates several advanced features:

Indexed completion mode: Each pod knows exactly which data shard to process
Controlled parallelism: Limits concurrent processing to avoid resource contention
Persistent volume integration: Provides shared storage for input and output data
Automatic work distribution: Kubernetes ensures each shard is processed exactly once
Fault tolerance: If a pod fails, a new pod with the same index is created

This pattern is ideal for batch processing of large datasets, ETL workflows, and distributed computation tasks that can be partitioned.

Scheduled Database Backup

apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-backup
spec:
  schedule: "0 2 * * *"             # Run daily at 2 AM
  timeZone: "UTC"                   # Using UTC time
  concurrencyPolicy: Forbid         # Don't allow overlapping executions
  successfulJobsHistoryLimit: 7     # Keep 7 days of successful backups
  failedJobsHistoryLimit: 3         # Keep 3 failed jobs for troubleshooting
  jobTemplate:
    spec:
      ttlSecondsAfterFinished: 86400  # Clean up completed jobs after 24 hours
      template:
        spec:
          containers:
          - name: backup
            image: db-backup:v3
            env:
            - name: DB_HOST          # Database connection details
              value: postgres-svc
            - name: BACKUP_PATH      # Dynamic backup path using date
              value: /backups/$(date +%Y%m%d).sql.gz
            volumeMounts:
            - name: backup-volume    # Mount backup storage
              mountPath: /backups
          volumes:
          - name: backup-volume      # Persistent storage for backups
            persistentVolumeClaim:
              claimName: backup-pvc
          restartPolicy: OnFailure   # Retry container if it fails

This database backup example demonstrates several CronJob best practices:

Scheduled execution: Runs automatically at a specific time every day
Concurrency control: Prevents overlapping backups that could cause conflicts
History management: Maintains a week of successful backup history
Automatic cleanup: Uses TTL controller to remove old Job objects
Persistent storage: Ensures backups are stored durably outside of pods
Dynamic naming: Creates date-stamped backup files
Failure handling: Uses OnFailure restart policy for resilience

Troubleshooting

Common issues and solutions:

Stuck jobs: Check pod events and logs; consider pod failure policy
# Check job status kubectl describe job stuck-job # Examine pod logs kubectl logs -l job-name=stuck-job # Look for pod events kubectl get events --field-selector involvedObject.kind=Pod
CronJob not triggering: Verify schedule format and timezone
# Validate cron syntax kubectl get cronjob problem-cronjob -o jsonpath='{.spec.schedule}' # Check for timezone issues kubectl get cronjob problem-cronjob -o jsonpath='{.spec.timeZone}' # Check for missed schedules kubectl get events --field-selector involvedObject.kind=CronJob
Job pods keep failing: Inspect logs and consider increasing backoffLimit
# Check pod status kubectl get pods -l job-name=failing-job # Examine container logs kubectl logs -l job-name=failing-job -c main # Increase backoff limit kubectl patch job failing-job -p '{"spec":{"backoffLimit":10}}'
Excessive resource usage: Adjust parallelism and resource limits
# Check resource usage kubectl top pods -l job-name=resource-heavy-job # Reduce parallelism kubectl patch job resource-heavy-job -p '{"spec":{"parallelism":5}}'
Slow job completion: Consider higher parallelism or indexed job approach
# Check job progress kubectl get job slow-job -o jsonpath='{.status.succeeded}/{.spec.completions}' # Increase parallelism if resources allow kubectl patch job slow-job -p '{"spec":{"parallelism":20}}'
Jobs not being cleaned up: Configure ttlSecondsAfterFinished
# Check old jobs kubectl get jobs --sort-by='.status.completionTime' # Add TTL to existing job kubectl patch job old-job -p '{"spec":{"ttlSecondsAfterFinished":3600}}'

Migration Considerations

Upgrading from Older Versions

Job API version changes: Migration from batch/v1beta1 to batch/v1 for CronJob resources
CronJob stability improvements: More reliable scheduling behavior in newer versions
Feature gate requirements: Some features require specific feature gates to be enabled
Controller behavior changes: More efficient job tracking and status updates
Performance impacts: Improved scalability for large jobs with many completions

Kubernetes 1.21 promoted CronJobs to stable (batch/v1), requiring migration from the beta API:

# Find CronJobs using the beta API
kubectl get cronjobs.v1beta1.batch --all-namespaces

# Convert beta CronJobs to v1
kubectl get cronjobs.v1beta1.batch -o json | \
  sed 's/apiVersion: batch\/v1beta1/apiVersion: batch\/v1/g' | \
  kubectl apply -f -

Feature Gates

Some enhancements may require enabling specific feature gates:

# kube-apiserver and kube-controller-manager flags
--feature-gates=JobPodFailurePolicy=true,JobMutableNodeSchedulingDirectives=true,JobBackoffLimitPerIndex=true

Feature gates timeline:

JobTrackingWithFinalizers: Beta in 1.23, stable in 1.26
SuspendJob: Beta in 1.22, stable in 1.24
JobPodFailurePolicy: Alpha in 1.25, beta in 1.26
JobMutableNodeSchedulingDirectives: Beta in 1.27
JobBackoffLimitPerIndex: Alpha in 1.28

Future Directions

Upcoming features on the roadmap include:

Enhanced job dependencies and workflows: Native support for job dependencies and directed acyclic graphs (DAGs) of jobs
# Future concept - not currently available spec: dependencies: - name: prerequisite-job policy: CompleteSuccess # Only proceed if dependency succeeded
More sophisticated failure handling options: Additional failure policies and recovery strategies
# Future concept - not currently available spec: podFailurePolicy: rules: - action: Retry onExitCodes: {values: [42]} maxRetries: 3 backoffFactor: 2.0
Improved metrics and observability: Enhanced prometheus metrics for job performance and status
# Future metrics might include kube_job_duration_seconds kube_job_pod_failure_by_exit_code kube_job_completion_time_histogram
Integration with events and notifications: Native support for triggering external systems on job events
# Future concept - not currently available spec: notifications: - events: ["complete", "fail"] webhook: "https://notify.example.com/job-events"
Advanced scheduling capabilities: More control over job scheduling and placement
# Future concept - not currently available spec: scheduling: timeWindow: start: "2023-01-01T22:00:00Z" end: "2023-01-02T05:00:00Z" priority: 100
Resource quota improvements for batch workloads: Better handling of burst capacity for batch jobs
# Future concept - not currently available spec: resources: burstable: true burstLimit: cpu: "5" memory: "10Gi"
Job backoff limit per index: Control retry behavior individually for each index in an indexed job
# Currently alpha in 1.28 spec: backoffLimitPerIndex: 3 # Retry each index up to 3 times maxFailedIndexes: 5 # Fail job after 5 distinct indices fail

Reference and Integration

Integration with Other Services

Argo Workflows for complex pipelines: Extends Kubernetes jobs with DAG-based workflows, dependencies, and artifacts

# Argo Workflow example
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: data-pipeline
spec:
  entrypoint: process-data
  templates:
  - name: process-data
    dag:
      tasks:
      - name: extract
        template: extract-task
      - name: transform
        template: transform-task
        dependencies: [extract]
      - name: load
        template: load-task
        dependencies: [transform]

Tekton for CI/CD integration: Integrates Jobs into CI/CD pipelines with rich features
# Tekton TaskRun example apiVersion: tekton.dev/v1 kind: TaskRun metadata: name: build-and-test spec: taskRef: name: build-test-task params: - name: repo-url value: "https://github.com/example/repo"
Prometheus for job metrics: Monitor job performance and success rates
# Prometheus query examples - record: job_success_ratio expr: sum(kube_job_status_succeeded) / sum(kube_job_spec_completions) - record: job_completion_time expr: kube_job_status_completion_time - kube_job_status_start_time

Event-driven architectures: Trigger jobs based on events from Kubernetes or external systems

# Using KEDA ScaledJob example
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: event-processor
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: processor
          image: event-processor:v1
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: job-processor
      topic: events
      lagThreshold: "100"

Serverless frameworks: Use Jobs as compute backends for serverless workloads
# Knative example apiVersion: serving.knative.dev/v1 kind: Service metadata: name: batch-function spec: template: spec: containers: - image: batch-processor:v1 env: - name: BATCH_SIZE value: "1000"

Kubernetes Events

Jobs and CronJobs emit events that can be monitored:

# Watch Job events in real-time
kubectl get events --field-selector involvedObject.kind=Job,involvedObject.name=my-job --watch

# Filter for specific event types
kubectl get events --field-selector involvedObject.kind=Job,reason=SuccessfulCreate

# Monitor CronJob schedule events
kubectl get events --field-selector involvedObject.kind=CronJob

# View events across multiple jobs
kubectl get events --field-selector involvedObject.kind=Job,involvedObject.namespace=batch-jobs

Important events to monitor:

SuccessfulCreate: Job controller created a pod
FailedCreate: Job controller failed to create a pod
Completed: Job completed successfully
Failed: Job failed to complete
CronJobExecutionStarting: CronJob triggered a new job execution
MissingSchedule: CronJob missed a scheduled execution

Edit this page

API Priority and Fairness

Understanding Kubernetes API Priority and Fairness for managing request flows and preventing API server overload

Gateway API

Understanding the Kubernetes Gateway API for advanced traffic routing, management, and service networking