MLOps on Kubernetes: A Practical Guide
Setting up a complete MLOps pipeline on Kubernetes with auto-scaling, model versioning, and monitoring.
Why Kubernetes for ML Workloads?
Machine learning systems have unique infrastructure demands that do not map cleanly onto traditional web application deployment patterns. Training jobs need GPU nodes that should scale to zero when idle. Inference services need autoscaling based on request latency, not just CPU utilization. Feature pipelines need scheduled execution with complex dependency graphs. Data scientists want to experiment without breaking production.
Kubernetes handles all of these requirements -- but only if you set it up correctly. After building MLOps platforms at multiple organizations, I have converged on a set of patterns and tools that work reliably. This guide walks through the architecture end to end.
The Reference Architecture
Our MLOps platform on Kubernetes consists of five layers:
- Compute Layer: EKS/GKE cluster with heterogeneous node pools (CPU, GPU, high-memory)
- Orchestration Layer: Argo Workflows for training pipelines, Argo CD for deployment
- Serving Layer: KServe (formerly KFServing) for model inference with autoscaling
- Storage Layer: S3-compatible object storage for artifacts, PostgreSQL for metadata
- Observability Layer: Prometheus, Grafana, and custom ML-specific metrics
Let me walk through each layer and the decisions behind them.
Compute: Node Pools and Scheduling
The cluster runs three node pools:
# GPU pool for training jobs
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: gpu-training
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["g5.xlarge", "g5.2xlarge"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
limits:
resources:
nvidia.com/gpu: "8"
ttlSecondsAfterEmpty: 300We use Karpenter instead of Cluster Autoscaler for GPU nodes. Karpenter provisions the right instance type for each workload and -- critically -- scales to zero when no training jobs are running. GPU instances are expensive, and idle GPU nodes are the single largest waste in most ML platforms. The ttlSecondsAfterEmpty: 300 setting tears down nodes five minutes after their last pod completes.
For inference, we run a separate pool of CPU-optimized instances (c6i family) with the standard Kubernetes Horizontal Pod Autoscaler. Inference workloads have more predictable resource requirements and benefit from always-on capacity for low-latency responses.
Training Pipelines with Argo Workflows
Every training pipeline is defined as an Argo Workflow DAG. A typical pipeline has these stages:
- Data validation: Schema checks, drift detection against the training baseline
- Feature engineering: Transformations executed as containerized steps
- Training: The actual model training, potentially distributed across multiple GPUs
- Evaluation: Automated metrics computation against held-out test sets
- Registration: Pushing the model artifact and metadata to the model registry
- Promotion gate: Automated check -- does this model beat the current production model on key metrics?
The promotion gate is essential. Without it, you end up with manual "looks good to me" approvals that either bottleneck the pipeline or get rubber-stamped. Our gate checks three conditions: the candidate model must exceed the production model's accuracy by a configurable threshold (typically 0.5%), must not regress on any monitored fairness metric, and must meet latency requirements when served on the target hardware.
- name: promotion-gate
script:
image: mlops/evaluator:latest
command: [python]
source: |
from mlops.registry import get_production_model, get_candidate_model
from mlops.evaluation import compare_models
prod = get_production_model("{{workflow.parameters.model_name}}")
candidate = get_candidate_model("{{workflow.parameters.run_id}}")
result = compare_models(prod, candidate, metrics=["accuracy", "f1", "latency_p99"])
if result.candidate_wins:
print("PROMOTION: Candidate passes all gates")
else:
print(f"BLOCKED: {result.failure_reasons}")
exit(1)Model Serving with KServe
KServe provides a standardized inference protocol with powerful autoscaling. We define each model as an InferenceService custom resource:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
annotations:
serving.kserve.io/autoscalerClass: hpa
serving.kserve.io/targetUtilizationPercentage: "70"
spec:
predictor:
minReplicas: 2
maxReplicas: 20
model:
modelFormat:
name: sklearn
storageUri: "s3://models/fraud-detector/v12"
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"KServe handles canary deployments natively. When we promote a new model version, we roll it out to 10% of traffic, monitor for 30 minutes, then gradually increase to 100%. If error rates spike during canary, an automated rollback triggers.
The autoscaling configuration deserves attention. We scale on a custom metric -- inference latency p95 -- rather than CPU utilization. ML models can exhibit high latency before CPU saturates, especially with batch preprocessing. A custom HPA metric ensures we scale proactively.
Feature Store Integration
Feature pipelines run on two cadences: batch (hourly/daily via Argo cron workflows) and real-time (Kafka consumers writing to Redis). The serving layer reads features from both stores at inference time.
The critical design decision was making feature computation deterministic and versioned. Every feature transformation is a pure function with a version hash. Training pipelines record which feature versions they used, and the serving layer serves the same versions. This eliminates training-serving skew, which is the most common source of silent model degradation in production.
Observability: Beyond Standard Metrics
Standard Kubernetes metrics (CPU, memory, pod restarts) are necessary but insufficient for ML workloads. We add three categories of ML-specific monitoring:
Data quality metrics: Input feature distributions are monitored in real time. When the distribution of an input feature drifts beyond a configured threshold (measured by Population Stability Index), an alert fires. This catches upstream data pipeline issues before they corrupt model predictions.
Model performance metrics: We compute rolling accuracy, precision, and recall against delayed ground truth labels. These metrics are exported as Prometheus gauges and visualized in Grafana dashboards per model, per version.
Business metrics: Ultimately, model quality is measured by business outcomes. We instrument downstream systems to report business KPIs (conversion rate, fraud detection rate, recommendation click-through) and correlate them with model versions. This closes the feedback loop between ML engineering and business value.
GitOps for Everything
Every configuration described above lives in git. Argo CD watches the repository and reconciles the cluster state. This means:
- Model deployments are pull requests, reviewed and auditable
- Rollbacks are
git revertoperations - The entire platform state can be reconstructed from the repository
- New environments (staging, production) are directory copies with overrides
Lessons and Pitfalls
Do not share GPU nodes between training and inference. Training jobs are bursty and will starve inference pods of resources. Use separate node pools with taints and tolerations.
Invest in artifact storage early. Model artifacts, training data snapshots, and evaluation results accumulate fast. Set up lifecycle policies and a clear naming convention from day one. We use the pattern s3://models/{model_name}/v{version}/ with metadata stored in MLflow.
Make the happy path easy. If deploying a model requires 15 manual steps, people will skip steps. Our data scientists run a single command -- make deploy MODEL=fraud-detector VERSION=12 -- which triggers the entire pipeline from evaluation through canary deployment.
Plan for multi-tenancy from the start. If you will ever serve multiple teams or products, namespace isolation, resource quotas, and RBAC should be in the initial design, not bolted on later.
Conclusion
Kubernetes is not the simplest way to serve a single model. But when you need to support multiple models, multiple teams, heterogeneous hardware, and production-grade reliability, it becomes the most cost-effective platform. The key is investing in automation and guardrails so that the complexity of Kubernetes is hidden behind simple, safe interfaces that data scientists actually want to use.
Comments