Docker for AI/ML Workloads
Learn how to effectively containerize, deploy, and orchestrate AI and machine learning workloads with Docker
Docker for AI/ML Workloads
Docker provides an excellent platform for developing, training, and deploying AI and machine learning models, offering reproducibility, portability, and scalability for complex AI workflows. By containerizing AI/ML environments, data scientists and ML engineers can ensure consistent execution across development, testing, and production systems while eliminating the notorious "it works on my machine" problem that often plagues complex ML dependencies. Docker also enables efficient collaboration between teams, version control of entire environments, and seamless integration with orchestration tools for distributed training and inference.
AI/ML Development Environment
Base Images for AI/ML
- NVIDIA CUDA images for GPU workloads
- Pre-configured with CUDA drivers and libraries
- Various versions available to match specific CUDA requirements
- Optimized for different GPU architectures (Pascal, Volta, Turing, Ampere)
- Base layer for building custom deep learning environments
- Example:
nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
- TensorFlow official images
- Complete environments with TensorFlow pre-installed
- Available in CPU and GPU variants
- Jupyter notebooks integration in many images
- Consistent versioning with TensorFlow releases
- Example:
tensorflow/tensorflow:2.12.0-gpu
- PyTorch container ecosystem
- Official PyTorch installations with CUDA support
- Optimized for performance with GPU acceleration
- Includes common PyTorch libraries and extensions
- Various Python version options
- Example:
pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
- Scikit-learn and data science stacks
- Comprehensive Python data science toolkits
- Pandas, NumPy, Matplotlib, and other common libraries
- Ready-to-use Jupyter environments
- Optimized for data processing pipelines
- Example:
jupyter/scipy-notebook:python-3.10
- Specialized deep learning images
- Domain-specific containers (NLP, computer vision, etc.)
- Pre-trained models and frameworks
- Hugging Face Transformers, Detectron2, etc.
- Production-optimized inference containers
- Example:
huggingface/transformers-pytorch-gpu:4.29.2
Setting Up GPU Support
- NVIDIA Container Toolkit (nvidia-docker)
- System component that enables GPU access from containers
- Provides runtime extensions for Docker
- Manages NVIDIA driver mapping between host and container
- Essential prerequisite for GPU-accelerated containers
- Installation:
apt-get install nvidia-container-toolkit
- GPU passthrough configuration
- Exposes specific GPUs to containers
- Controls which containers can access which GPUs
- Enables fine-grained resource allocation
- Configured with
--gpus
flag or in docker-compose - Example:
docker run --gpus device=0,1 nvidia/cuda nvidia-smi
- Driver compatibility considerations
- Container CUDA version must be compatible with host driver
- Driver version must support the required CUDA version
- Compatibility matrix available in NVIDIA documentation
- Minimum driver version depends on CUDA toolkit version
- Best practice: Use container CUDA version ≤ host driver CUDA capability
- Resource allocation
- Memory limits for preventing OOM errors
- GPU memory monitoring and management
- NVIDIA MPS for shared GPU access
- Using nvidia-smi for resource monitoring
- Example:
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
- Multi-GPU setups
- Data parallelism across multiple GPUs
- Managing process-to-GPU mapping
- NCCL configuration for inter-GPU communication
- NVLink considerations for high-bandwidth connections
- Optimizing container placement for GPU topology
Essential Docker Images for AI/ML
Optimizing Dockerfiles for ML
For efficient AI/ML workflows, optimize your Dockerfiles with these best practices:
- Use multi-stage builds to reduce image size
- Separate build environments from runtime environments
- Compile dependencies in build stage and copy only artifacts
- Significantly reduces final image size (often by 70-90%)
- Improves security by eliminating build tools in production
- Example: Build stage for compiling custom ops, runtime stage for inference
- Layer dependencies strategically (least-to-most changing)
- Order installations from most stable to most frequently changing
- System packages first, then framework dependencies, then project code
- Maximizes Docker's layer caching for faster rebuilds
- Groups related installations in single RUN commands to reduce layers
- Example: OS packages → Python packages → Framework → Custom code
- Include only necessary libraries and tools
- Avoid installing development packages in production images
- Use slim/runtime variants of base images when possible
- Remove temporary files, caches, and documentation
- Consider specialized distros like Alpine for smaller images
- Example: Use
apt-get install --no-install-recommends
and clean apt cache
- Cache model weights and datasets appropriately
- Use volume mounts for large datasets instead of including in image
- Implement download checkpointing to resume interrupted transfers
- Consider multi-stage downloads with verification in Dockerfile
- Use build args to control which models are included
- Example: Implement dataset version control with content-addressable storage
- Configure environment variables for optimal performance
- Set framework-specific optimization flags
- Configure threading and parallelism settings
- Enable hardware-specific acceleration features
- Tune memory allocation and garbage collection
- Example: Set
TF_ENABLE_ONEDNN_OPTS=1
for Intel CPU optimization
Data Management Strategies
Volume Mounting for Datasets
- Mount large datasets as volumes
- Avoids copying data into containers
- Enables sharing datasets across multiple containers
- Persists data beyond container lifecycle
- Improves build time and reduces image size
- Example:
docker run -v /host/data:/data:ro tensorflow/tensorflow:latest-gpu
- Configure data caching
- Implement multi-level caching strategies
- Use tmpfs mounts for high-speed temporary storage
- Leverage SSD for intermediate datasets and HDD for archival
- Cache preprocessed data to avoid redundant computation
- Example:
--mount type=tmpfs,destination=/cache,tmpfs-size=16g
- Structure data directories
- Organize by dataset, version, and splits (train/val/test)
- Implement consistent naming conventions
- Create metadata files documenting dataset properties
- Consider columnar formats (Parquet, Arrow) for efficiency
- Design for parallel access patterns in distributed training
- Version datasets effectively
- Implement content-addressable storage patterns
- Use dataset versioning tools (DVC, Pachyderm, etc.)
- Create dataset manifests with checksums
- Track dataset lineage and transformations
- Consider ACID-compliant dataset management
- Optimize I/O operations
- Use memory mapping for large files
- Implement asynchronous data loading pipelines
- Consider data compression tradeoffs
- Tune buffer sizes for specific storage systems
- Example: TensorFlow tf.data API with prefetching and parallelism
Model Storage and Versioning
- Efficient model serialization
- Choose appropriate serialization formats (saved_model, ONNX, TorchScript)
- Optimize for size vs. loading speed tradeoffs
- Consider quantization for model compression
- Implement selective parameter saving
- Example: Convert large models to fp16 precision for storage
- Model registry integration
- Connect to MLflow, Weights & Biases, or custom registries
- Implement automated versioning on training completion
- Tag models with metadata (metrics, dataset version, etc.)
- Support model lineage tracking
- Example:
mlflow.tensorflow.log_model(model, "model")
- Version control for models
- Store models with semantic versioning
- Implement immutable model artifacts
- Track model dependencies and environment
- Manage experimental vs. production models
- Example: Use Git LFS or specialized model versioning tools
- Artifact management
- Define lifecycle policies for model retention
- Implement access control for model artifacts
- Store evaluation metrics alongside models
- Include sample inputs/outputs with models
- Track compute resources used for training
- Reproducible model loading
- Store model configuration separately from weights
- Document initialization procedures
- Version model loaders alongside models
- Implement model compatibility checking
- Example: Create model cards with reproducibility instructions
Training Workflows
Containerized training requires careful resource management to ensure stability, efficiency, and resilience:
- Configure memory limits appropriately for your model size
- Account for model parameters, gradients, optimizer states, and batch size
- Include buffer for framework overhead (often 20-30% extra)
- Set hard limits to prevent OOM crashes affecting other containers
- Consider gradient accumulation for large models with memory constraints
- Example:
docker run --memory=24g --memory-reservation=20g tensorflow/tensorflow
- Enable GPU access with proper runtime configurations
- Install NVIDIA Container Toolkit on host system
- Use
--gpus
flag with appropriate constraints - Set CUDA_VISIBLE_DEVICES for framework-level GPU selection
- Configure GPU memory growth settings to prevent over-allocation
- Example:
docker run --gpus 'device=0,1' --shm-size=1g pytorch/pytorch
- Implement checkpointing for recovery
- Save checkpoints to persistent volumes, not container filesystem
- Implement regular checkpoint intervals (e.g., every N batches or epochs)
- Use async checkpointing to minimize training interruption
- Implement checkpoint rotation policy to manage storage
- Include metadata for resuming from exact position
- Example:
model.save_checkpoint('/mnt/checkpoints/model_epoch_{epoch}.h5')
- Monitor resource utilization during training
- Implement logging of GPU memory, CPU usage, I/O wait times
- Use tools like
nvidia-smi
, cAdvisor, or custom monitoring - Alert on resource exhaustion before failure occurs
- Track batch throughput and training speed over time
- Example: Include TensorBoard profiling for performance analysis
- Consider distributed training patterns for large models
- Implement data parallelism for dataset-scale challenges
- Use model parallelism for very large model architectures
- Configure proper communication protocols between nodes
- Implement gradient compression for bandwidth-constrained setups
- Manage synchronization points to prevent stragglers
- Example: Use Horovod or PyTorch DDP for distributed training
Inference Serving
Model Serving Options
- TensorFlow Serving
- Production-grade serving system for TensorFlow models
- Supports model versioning and hot swapping
- Highly optimized for TensorFlow SavedModel format
- Provides both gRPC and REST APIs
- Enables batching and high-performance inference
- Example:
tensorflow/serving:2.12.0
- NVIDIA Triton Inference Server
- Multi-framework inference server (TensorFlow, PyTorch, ONNX, etc.)
- Dynamic batching and sequence batching
- Concurrent model execution
- Model ensemble support
- Optimized for NVIDIA GPUs with TensorRT integration
- Example:
nvcr.io/nvidia/tritonserver:23.04-py3
- TorchServe
- Production serving system for PyTorch models
- Model versioning and management
- REST and gRPC endpoints
- A/B testing capabilities
- Custom handlers for preprocessing/postprocessing
- Example:
pytorch/torchserve:0.7.1-gpu
- ONNX Runtime
- Cross-platform inference engine for ONNX models
- Hardware acceleration on various devices (CPU, GPU, TPU)
- Quantization and optimization support
- Wide framework compatibility
- Graph optimizations for performance
- Example:
mcr.microsoft.com/azureml/onnxruntime:latest
- Custom REST API services
- Flexible APIs built on Flask, FastAPI, or other frameworks
- Complete control over request handling and processing
- Easy integration with business logic
- Custom authentication and authorization
- Tailored scaling and deployment options
- Example: FastAPI with model loading on startup
Optimizing Inference Containers
Distributed Training
MLOps Integration
Containers enable end-to-end MLOps workflows, creating a seamless machine learning lifecycle:
- Version control for code, data, and models
- Git integration for code versioning
- DVC or similar tools for dataset versioning
- Model registries for artifact versioning
- Container registries for environment versioning
- Example:
git commit + dvc push + mlflow.log_model() + docker push
- Automated CI/CD for model training and deployment
- Containerized training triggered by code/data changes
- Automated testing of model performance
- Continuous deployment of models meeting quality thresholds
- Environment consistency between development and production
- Example: GitHub Actions workflow that trains, evaluates, and deploys models
- A/B testing and canary deployments for models
- Multiple model versions deployed simultaneously
- Traffic splitting between model versions
- Gradual rollout of new models
- Automated rollback based on performance metrics
- Example: Kubernetes with traffic splitting between model services
- Model monitoring and performance tracking
- Runtime performance metrics collection
- Model drift detection
- Prediction quality monitoring
- Resource utilization tracking
- Example: Prometheus + Grafana dashboards for model performance
- Reproducible experimentation and tracking
- Experiment containerization for perfect reproducibility
- Hyperparameter tracking and comparison
- Training metrics visualization
- Experiment metadata management
- Example: Weights & Biases or MLflow integrated with containerized training
Orchestrating ML Pipelines
Pipeline Components
- Data preparation containers
- Data validation and cleaning
- Format conversion and normalization
- Feature extraction from raw data
- Dataset splitting (train/validation/test)
- Data augmentation for training
- Example: Apache Beam or Luigi containers
- Feature engineering services
- Feature transformation pipelines
- Feature selection algorithms
- Dimensionality reduction
- Feature encoding and normalization
- Feature store integration
- Example: Feast or custom feature services
- Model training jobs
- Hyperparameter optimization
- Model fitting and validation
- Cross-validation execution
- Checkpoint management
- Distributed training coordination
- Example: Containers with TensorFlow, PyTorch, etc.
- Evaluation workers
- Model performance assessment
- Metric calculation and validation
- A/B comparison with baseline models
- Threshold determination
- Test dataset evaluation
- Example: Custom containers running evaluation scripts
- Deployment services
- Model packaging for production
- Serving infrastructure setup
- Canary deployment handling
- Versioning and rollback support
- Integration with API gateways
- Example: KServe or TensorFlow Serving containers
- Monitoring components
- Data drift detection
- Model performance tracking
- Resource utilization monitoring
- Prediction logging and analysis
- Alert generation for degradation
- Example: Prometheus exporters and Grafana dashboards
Workflow Orchestration
Resource Management
::
Hyperparameter Optimization
Containerize hyperparameter optimization workloads for scalable and reproducible tuning:
- Package search algorithms in containers
- Wrap optimization libraries (Optuna, Ray Tune, Hyperopt) in containers
- Standardize optimization interfaces across frameworks
- Configure algorithm-specific parameters via environment variables
- Support various search strategies (Bayesian, grid, random, evolutionary)
- Example:
docker run -e SEARCH_SPACE='{"lr": [0.001, 0.01]}' optuna-container
- Parallelize trials across multiple containers
- Distribute independent trials across worker containers
- Implement master-worker pattern for coordinated search
- Use shared storage for coordination and results
- Scale dynamically based on available resources
- Example: Kubernetes Jobs for parallel hyperparameter trials
- Implement resource-aware scheduling
- Set appropriate resource limits for each trial
- Prioritize promising trials based on early results
- Implement early stopping for underperforming trials
- Balance exploration vs. exploitation in resource allocation
- Example: Kubernetes pod resource requests/limits with priority classes
- Persist and version optimization results
- Store comprehensive trial data and configurations
- Implement results database with query capabilities
- Version control hyperparameter search spaces
- Maintain reproducibility through configuration snapshots
- Example: PostgreSQL container for structured optimization results
- Integrate with experiment tracking systems
- Connect optimization with MLflow, W&B, or TensorBoard
- Visualize optimization progress in real-time
- Compare multiple optimization runs
- Analyze parameter importance and interactions
- Example: Optuna dashboard or Ray Tune with TensorBoard integration
Deployment Architectures
Edge Deployment
- Optimized containers for edge devices
- Minimized container size for limited storage
- Platform-specific builds (ARM, x86, RISC-V)
- Specialized base images (Alpine, Distroless)
- Static linking to reduce dependencies
- Example:
FROM arm32v7/python:3.9-slim
for Raspberry Pi deployment
- Model quantization and pruning
- Int8/FP16 quantization for reduced memory footprint
- Weight pruning for smaller model size
- Knowledge distillation for compact student models
- Post-training optimization techniques
- Example: TensorFlow Lite models with 75% size reduction
- Runtime optimization
- Hardware-specific acceleration (NEON, AVX)
- Memory mapping for efficient loading
- Thread and process optimization
- Batch size tuning for latency vs throughput
- Example: ONNX Runtime with custom execution providers
- Resource-constrained environments
- CPU/RAM/storage limitations management
- Thermal and power consumption considerations
- Offline operation capabilities
- Graceful degradation under resource pressure
- Example: Container configured with
--memory=512m --cpus=0.5
- Update strategies for edge models
- Delta updates to minimize bandwidth
- A/B model deployment for validation
- Rollback mechanisms for failed updates
- Version compatibility verification
- Example: Container image layering for efficient updates
Cloud Deployment
- Scalable inference APIs
- RESTful and gRPC API interfaces
- Stateless design for horizontal scaling
- Asynchronous processing for batch requests
- Client libraries for multiple languages
- Example: KServe or TorchServe behind API Gateway
- Auto-scaling model servers
- Horizontal pod autoscaling based on CPU/memory/custom metrics
- Prediction request queue-based scaling
- Minimum replicas for baseline performance
- GPU utilization-based scaling policies
- Example: Kubernetes HPA with custom metrics from Prometheus
- Load balancing strategies
- Round-robin for stateless inference
- Session affinity for stateful models
- Weighted distribution based on instance capacity
- Latency-based routing for global deployments
- Example: Cloud load balancer with health checks
- High-availability configurations
- Multi-zone and multi-region deployments
- Automated failover mechanisms
- Redundant model server instances
- State replication where needed
- Example: Multi-regional Kubernetes clusters with PodDisruptionBudget
- Cloud-native integrations
- Managed Kubernetes services (EKS, GKE, AKS)
- Serverless inference (AWS Lambda, Cloud Run, Azure Functions)
- Cloud monitoring and logging integration
- Identity and access management integration
- Example: AWS SageMaker with auto-scaling inference endpoints
Performance Optimization
Real-world ML Use Cases
Common containerized ML applications and their containerization patterns:
- Recommendation engines with scalable inference
- Microservice architecture with separate retrieval and ranking services
- Redis or Elasticsearch containers for candidate generation
- Batch prediction containers for offline feature generation
- Real-time feature servers for online inference
- A/B testing infrastructure for recommendation strategies
- Example: E-commerce product recommendations using matrix factorization
- Natural language processing pipelines
- Text preprocessing containers (tokenization, normalization)
- Model containers for specific NLP tasks (classification, NER, summarization)
- Multi-stage pipelines with intermediate result caching
- Language-specific processing containers
- Scalable transformer model deployment with optimized inference
- Example: Customer support automation with BERT-based intent classification
- Computer vision services with GPU acceleration
- Image preprocessing containers for normalization and augmentation
- Object detection services with GPU acceleration
- Video processing pipelines with frame extraction
- Model ensembles for improved accuracy
- Edge deployment for camera integration
- Example: Manufacturing quality control with defect detection models
- Time series forecasting with model versioning
- Data ingestion containers for time series collection
- Feature engineering specific to temporal data
- Multiple forecasting models for different time horizons
- Backtesting frameworks for model evaluation
- Versioned model deployment for forecast comparison
- Example: Financial market prediction with ensemble LSTM models
- Anomaly detection systems with streaming data
- Stream processing containers (Kafka Streams, Flink)
- Online learning models for concept drift adaptation
- Threshold computation containers for alert generation
- Dashboard containers for anomaly visualization
- Alert management and notification services
- Example: Network security monitoring with unsupervised anomaly detection
- Reinforcement learning environments
- Simulation environment containers for agent training
- Experience replay databases for offline learning
- Distributed training with parameter server architecture
- Policy serving containers for agent deployment
- Monitoring services for reward tracking
- Example: Industrial process optimization with PPO algorithms
Advanced Topics
Multi-node Training
- Container orchestration for distributed training
- Kubernetes for managing training pods across nodes
- Custom operators for ML workloads (KubeFlow, Ray)
- Resource allocation optimization for heterogeneous clusters
- Training job scheduling and prioritization
- Example: Kubernetes StatefulSets for ordered pod creation
- Parameter servers and workers
- Architectural patterns for distributed optimization
- Sharded parameter servers for large models
- Asynchronous vs. synchronous parameter updates
- Communication topology optimization
- Example: TensorFlow tf.distribute.ParameterServerStrategy
- Network optimization for data transfer
- RDMA/RoCE for high-speed GPU communication
- Gradient compression techniques to reduce bandwidth
- Topology-aware pod placement
- Custom container networking plugins
- Example: NVIDIA NCCL with InfiniBand for GPU communication
- Checkpoint management
- Distributed checkpoint coordination
- Incremental checkpointing strategies
- Cloud storage integration for durability
- Checkpoint validation and corruption detection
- Example: Distributed TensorFlow CheckpointManager
- Failure recovery strategies
- Preemption-aware training processes
- Automatic worker replacement
- Elastic training group management
- Gradual scaling with minimal recomputation
- Example: PyTorch Elastic for fault-tolerant training
Federated Learning
- Container-based federated learning nodes
- Self-contained training environments on edge devices
- Minimal runtime dependencies for diverse deployments
- Standardized APIs for model and update exchange
- Resource-constrained container optimization
- Example: TensorFlow Federated client containers
- Secure aggregation strategies
- Cryptographic protocols in containerized services
- Secure multi-party computation containers
- Zero-knowledge proof systems
- Threshold cryptography implementations
- Example: PySyft containers for secure aggregation
- Privacy-preserving techniques
- Differential privacy implementation containers
- Local vs. global privacy budget management
- Privacy-preserving preprocessing pipelines
- Anonymization service containers
- Example: TensorFlow Privacy with configurable DP parameters
- Edge-to-cloud coordination
- Asynchronous update mechanisms
- Connection management for intermittent availability
- Bandwidth-aware synchronization strategies
- Multi-tier aggregation hierarchies
- Example: MQTT-based communication for lightweight coordination
- Model update synchronization
- FedAvg and advanced aggregation algorithms
- Weight divergence monitoring
- Conflict resolution for concurrent updates
- Version control for model iterations
- Example: Flower framework for federated learning orchestration
Best Practices
Follow these guidelines for AI/ML containers to ensure reliability, scalability, and security:
- Version lock all dependencies for reproducibility
- Use explicit versions for all packages in requirements.txt
- Pin operating system packages to specific versions
- Hash dependencies for absolute consistency
- Maintain a dependency inventory with security metadata
- Consider tools like pip-compile or poetry for dependency management
- Example:
tensorflow==2.12.0 numpy==1.24.3 pandas==2.0.1
- Implement proper error handling and recovery
- Graceful degradation for missing dependencies or models
- Comprehensive exception handling with appropriate logging
- Automatic retry mechanisms with backoff strategies
- Monitoring hooks for critical failures
- Health check endpoints for orchestration systems
- Example: Circuit breakers for external service dependencies
- Design for scalability from the beginning
- Stateless design where possible for horizontal scaling
- Efficient resource utilization with proper memory management
- Optimized I/O patterns for high-throughput scenarios
- Parameterized performance controls (batch size, threading)
- Load testing with realistic data volumes
- Example: Asynchronous prediction APIs with configurable worker pools
- Separate compute-intensive and serving workloads
- Different container configurations for training and inference
- Specialized resource allocations based on workload type
- Right-sized containers for each stage of the pipeline
- Batch processing containers vs. low-latency serving containers
- Separate scaling policies for different workload types
- Example: GPU-enabled training containers with CPU-only inference services
- Implement comprehensive logging and monitoring
- Structured logging with consistent formats
- Performance metrics collection at appropriate granularity
- Distributed tracing for complex pipelines
- Alerting on critical model performance degradation
- Resource utilization tracking with time-series data
- Example: Prometheus metrics for model latency, throughput, and accuracy
- Ensure proper security for model artifacts
- Access control for model files and parameters
- Encryption for sensitive model weights
- Vulnerability scanning in container images
- Secure communication between distributed components
- Model lineage tracking for auditability
- Example: HashiCorp Vault for managing model access credentials
- Optimize container size for each stage of the ML lifecycle
- Multi-stage builds to eliminate build dependencies
- Minimal base images appropriate for each phase
- Layer optimization to reduce image size and build time
- Development containers with debugging tools
- Production containers with minimal attack surface
- Example: Development container with debugging tools vs. slim production container
Troubleshooting Guide
Common Issues
- GPU not detected in container
- NVIDIA Container Toolkit not installed or configured properly
- Incorrect --gpus flag usage or missing GPU capabilities
- Driver/CUDA version incompatibility
- GPU visibility issues in nested virtualization
- Permission problems accessing GPU devices
- Example error: "could not select device driver with capabilities: [[gpu]]"
- Out of memory errors during training
- Batch size too large for available GPU memory
- Memory leaks from not releasing tensors properly
- Insufficient container memory limits
- Fragmented GPU memory after long training
- Multiple processes competing for same GPU
- Example error: "CUDA out of memory. Tried to allocate 2.00 GiB"
- Model loading failures
- Incompatible serialization format versions
- Missing model files or incorrect paths
- Framework version mismatches between save/load
- Corrupted model files from interrupted saves
- Insufficient permissions for model directories
- Example error: "Error loading model: KeyError: 'unexpected key in state_dict'"
- Performance degradation
- CPU throttling due to thermal issues
- Resource contention with other containers
- Inefficient data loading creating bottlenecks
- Network saturation in distributed training
- Suboptimal container resource limits
- Example symptom: Training iterations becoming progressively slower
- Data access bottlenecks
- Inefficient volume mounts or network storage
- Missing data caching strategies
- Sequential data access patterns
- Improper buffer sizes for I/O operations
- Container networking limitations
- Example symptom: High wait times in I/O profiling