Excerpt

## How to Deploy Local LLMs to Kubernetes
1. Provision GPU nodes with at least 24 GB VRAM and install the NVIDIA GPU Operator via Helm.
2. Verify GPU availability by checking nvidia.com/gpu in node allocatable resources.
3. Select a serving framework — start with a plain vLLM Deployment for single-model setups.
4. Configure GPU resource requests/limits and taint nodes for dedicated inference workloads.
5. Deploy vLLM using the provided Helm chart with your Hugging Face token secret.
6. Expose custom Prometheus metrics (queue depth) via the Prometheus Adapter.
7. Enable HPA autoscaling targeting vllm_queue_depth with tuned stabilization windows.
8. Validate the full pipeline by sending a completion request and confirming a JSON response.
Running local LLMs on Kubernetes gives DevOps teams a self-hosted inference path that is health-checked, autoscaled, and rolling-updatable, without depending on costly cloud API

## How to Deploy Local LLMs to Kubernetes
1. Provision GPU nodes with at least 24 GB VRAM and install the NVIDIA GPU Operator via Helm.
2. Verify GPU availability by checking nvidia.com/gpu in node allocatable resources.
3. Select a serving framework — start with a plain vLLM Deployment for single-model setups.
4. Configure GPU resource requests/limits and taint nodes for dedicated inference workloads.
5. Deploy vLLM using the provided Helm chart with your Hugging Face token secret.
6. Expose custom Prometheus metrics (queue depth) via the Prometheus Adapter.
7. Enable HPA autoscaling targeting vllm_queue_depth with tuned stabilization windows.
8. Validate the full pipeline by sending a completion request and confirming a JSON response.
Running local LLMs on Kubernetes gives DevOps teams a self-hosted inference path that is health-checked, autoscaled, and rolling-updatable, without depending on costly cloud API endpoints. This guide walks through the full pipeline: from preparing GPU nodes and installing the NVIDIA GPU Operator, through selecting a serving framework, to deploying a complete Helm chart for vLLM with custom-metric autoscaling.
## Table of Contents
Deploying these workloads on Kubernetes brings together data privacy guarantees, infrastructure-bound costs that scale with hardware utilization rather than per-token API pricing, and lower latency by keeping model serving inside the cluster boundary (eliminating the network round-trip to external APIs, typically 50 to 200 ms).
Kubernetes fits inference workload management well because its core orchestration primitives, including scheduling, health checks, rolling updates, and horizontal scaling, map directly onto the operational requirements of serving large language models. GPU-aware scheduling ensures pods land on nodes with available accelerators, while liveness and readiness probes guard against serving stale or crashed model instances. Rolling updates then enable zero-downtime model version swaps.
This guide walks through the full pipeline: from preparing GPU nodes and installing the NVIDIA GPU Operator, through selecting a serving framework, to deploying a complete Helm chart for vLLM with custom-metric autoscaling. The target audience is DevOps and platform engineers with intermediate Kubernetes experience who want a reproducible, opinionated deployment rather than scattered documentation fragments. For broader context on running models outside cloud APIs, the Running LLMs Locally hub covers the wider set of approaches.
## Prerequisites: Preparing Your Cluster for GPU Workloads
### Hardware and Cluster Requirements
A minimum viable GPU node for serving a 7B-parameter model (such as Mistral 7B or Llama 2 7B) requires an NVIDIA GPU with at least 24 GB of VRAM. 24 GB provides headroom for the KV-cache beyond the ~14 GB weight footprint; a 16 GB GPU is feasible for small batch sizes but will constrain throughput. The NVIDIA A10G and L4 are common choices in cloud environments, while the A100 (40 GB or 80 GB) provides headroom for larger models or higher throughput via increased KV-cache capacity. Each inference node should have sufficient system RAM (at least 32 GB) and fast local or network-attached storage if model weights will be cached locally rather than streamed from object storage.
All features in this guide require Kubernetes 1.27 or later. The autoscaling/v2 HPA API requires Kubernetes 1.23+; GPU Operator 23.x requires Kubernetes 1.24+. Managed Kubernetes services (EKS, GKE, AKS) simplify GPU node provisioning through dedicated node pools with pre-configured machine types. Bare-metal clusters require manual NVIDIA driver management unless the GPU Operator handles it, which is the approach outlined below.
The full walkthrough requires the following:
- Prometheus deployed and configured to scrape pod annotations in the inference namespace (e.g., via kube-prometheus-stack).
- Prometheus Adapter installed and configured (covered in the autoscaling section below).
- A Hugging Face account with Mistral-7B-Instruct-v0.2 model terms accepted and an API token generated. Gated models require authentication; without a token, pod startup will fail.
### Installing the NVIDIA GPU Operator
The NVIDIA GPU Operator automates the full stack needed to run GPU containers on Kubernetes: host NVIDIA drivers, the NVIDIA Container Toolkit, the Kubernetes device plugin that advertises nvidia.com/gpu resources, and optional GPU monitoring via DCGM Exporter. Rather than baking drivers into node images and managing version drift, the Operator deploys everything as DaemonSets.
```plain text
--set devicePlugin.enabled=true \
```
Setting driver.enabled=true tells the Operator to install and manage NVIDIA drivers on the host. On managed cloud node pools where drivers are pre-installed, set this to false to avoid conflicts. The dcgmExporter.enabled=true flag deploys NVIDIA DCGM Exporter, which exposes GPU utilization, temperature, and memory metrics to Prometheus.
### Verifying GPU Availability
After the Operator pods reach a Running state (which typically takes 3 to 8 minutes on first install as drivers compile; see the NVIDIA GPU Operator documentation for version-specific timing), verify that GPU resources are visible to the scheduler. You can poll readiness with:
```plain text
kubectl get pods -n gpu-operator -w
```
Wait until all pods show Running or Completed, then verify GPU resources:
```plain text
kubectl describe node <gpu-node-name> | grep -A 5 "Allocatable"
cat <<EOF | kubectl apply -f -
apiVersion: v1
name: gpu-test
namespace: gpu-operator
containers:
- name: nvidia-smi
image: nvidia/cuda:12.3.1-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
EOF
```
The nvidia-smi output should display the GPU model, driver version, and available memory. If nvidia.com/gpu does not appear in allocatable resources, the device plugin DaemonSet likely has not started correctly. Check GPU Operator pod logs in the gpu-operator namespace.
## Choosing a Serving Framework: KServe vs. Ray Serve vs. Simple Deployment
### Option 1: Plain Kubernetes Deployment
A standard Kubernetes Deployment wrapping a vLLM or similar container is the simplest approach. It fits single-model, low-complexity setups where teams want full control over the pod spec and no additional abstractions. You get no extra CRDs, no framework-specific operational overhead, and straightforward debugging. On the other hand, you lose built-in dynamic batching management at the orchestration layer (though vLLM handles continuous batching internally), multi-model routing, and traffic splitting without adding an ingress layer manually.
### Option 2: Ray Serve on Kubernetes (KubeRay)
Ray Serve, deployed via the KubeRay operator, suits teams running multi-model pipelines or who are deeply invested in the Python ecosystem. A typical use case: a pipeline chaining an embedding model with a reranker and a generator, all managed as a single deployment graph. Ray Serve provides autoscaling at the actor level, dynamic batching, and model composition within that graph. The cost is operational complexity: a Ray head node, worker nodes, and the KubeRay CRDs add moving parts. Resource management across Ray actors and Kubernetes pods creates a two-layer scheduling problem that can be difficult to debug.
### Option 3: KServe (ModelMesh or Serverless)
Teams serving dozens of models at enterprise scale use KServe for its standardized V2 inference protocol. ModelMesh multiplexes many models onto shared GPU pods, making it efficient when serving dozens of smaller models. KServe's serverless (scale-to-zero) mode requires Knative Serving and an ingress controller (Istio or Kourier). KServe's RawDeployment mode requires neither Knative nor a service mesh, making it significantly lighter to operate.
### Decision Matrix
| Criteria | Plain Deployment | Ray Serve (KubeRay) | KServe |
| Setup complexity | Low | Medium-High | High |
| Multi-model support | None (manual) | Native (deployment graph) | Native (ModelMesh) |
| Autoscaling granularity | HPA on custom metrics | Per-actor autoscaling | KPA / HPA with Knative |
| Community maturity | Mature (core K8s primitives) | Growing | Established |
| GPU utilization efficiency | One model per GPU | Flexible actor placement | Model multiplexing |
Start with a plain Deployment running vLLM. vLLM's internal continuous batching and PagedAttention memory management handle the serving-layer optimizations, while Kubernetes handles orchestration. Teams can graduate to KServe or Ray Serve as multi-model, canary, or pipeline requirements emerge. For a deeper comparison of serving engines including Ollama, see the Ollama vs vLLM article, which contextualizes why vLLM's throughput characteristics make it a strong choice for production deployments.
## Resource Management: GPU Requests, Limits, and Bin-Packing
### Setting GPU Requests and Limits
GPU resources in Kubernetes behave differently from CPU and memory. The nvidia.com/gpu resource is integer-only and non-overcommittable: a request of 1 means one entire GPU is reserved. The standard device plugin does not support fractional requests. Time-slicing (via GPU Operator config) enables overcommit but without memory isolation. For nvidia.com/gpu, requests and limits must be identical; this resource is non-overcommittable and integer-only. CPU and memory may differ between request and limit.
For a 7B-parameter model running in float16, model weights alone consume roughly 14 GB of VRAM. The remaining VRAM on a 24 GB GPU serves the KV-cache. Setting CPU requests to 4 cores and memory to 32 GB accounts for tokenization overhead, model loading, and the serving framework's host-side memory.
### Dealing with Bin-Packing and Fragmentation
Because GPU allocation is all-or-nothing at the device level, a pod using 14 GB on a 24 GB GPU leaves 10 GB stranded. Kubernetes cannot schedule another pod onto that GPU. Two strategies address this. NVIDIA MIG on supported hardware (A100, H100, A30) partitions a physical GPU into isolated instances with dedicated memory and compute slices. Note that the A10G and L4 recommended in this guide do not support MIG; use MPS on those GPUs instead. NVIDIA Multi-Process Service (MPS) allows multiple processes to share a GPU, though without the memory isolation guarantees of MIG.
At the Kubernetes level, dedicating GPU nodes to inference workloads via taints prevents non-GPU pods from occupying these expensive nodes. Apply the taint to each GPU node:
Then in the pod spec, add a matching toleration and node affinity:
## Autoscaling Inference: HPA on Custom Metrics
### Why Standard CPU/Memory HPA Fails for LLMs
LLM inference is GPU-bound and queue-bound. The CPU on an inference node may idle at 10% while the GPU is saturated and dozens of requests wait in the serving queue. A standard HPA targeting CPU utilization will never trigger scale-up under these conditions, so queued requests wait too long.
### Exposing Custom Metrics (Queue Depth)
vLLM exposes a Prometheus-compatible /metrics endpoint with several metrics critical for autoscaling decisions. Before configuring the Prometheus Adapter, verify the exact metric names exposed by vLLM in your version:
```plain text
kubectl exec -n inference <vllm-pod-name> -- curl -s http://localhost:8000/metrics | grep -i "waiting\|cache"
```
Confirm the metric names match those used in the adapter configuration below. The metric names may vary between vLLM versions; names using colon notation (e.g., vllm:num_requests_waiting) follow the Prometheus recording rule convention and may indicate a recording rule must be defined in Prometheus, while raw metrics exposed directly by vLLM typically use underscores (e.g., vllm_num_requests_waiting). Use the exact name returned by the /metrics endpoint.
These metrics need to be surfaced to the Kubernetes HPA controller via the Prometheus Adapter. KEDA ScaledObject configuration for vLLM is outside the scope of this guide; see the KEDA documentation for a Prometheus scaler example.
First, install the Prometheus Adapter if it is not already present in your cluster:
Replace <prometheus-service> and <prometheus-namespace> with the actual Prometheus service name and namespace in your cluster (e.g., http://prometheus-kube-prometheus-prometheus.monitoring.svc).
Verify the adapter is running:
The Prometheus Adapter configuration translates Prometheus queries into Kubernetes custom metrics API responses. Create prometheus-adapter-config.yaml with the following content. Important: Run the verification command above first to confirm the exact metric name. The configuration below uses vllm_num_requests_waiting (underscores), which is the raw metric name typically exposed by vLLM. If your version uses a different name, adjust accordingly:
```plain text
- seriesQuery: 'vllm_num_requests_waiting{namespace!="",pod!=""}'
```
This configuration queries vllm_num_requests_waiting, maps it to Kubernetes namespace and pod labels, and exposes it as a custom metric named vllm_queue_depth that the HPA can target.
### Configuring the HPA
Cost note: minReplicas: 1 keeps at least one GPU pod running at all times, which means continuous GPU node cost even during idle periods. On cloud providers, consider using cluster autoscaler node scale-down in combination with this setting, or set minReplicas: 0 if your setup supports scale-to-zero (requires KEDA or Knative).
```plain text
apiVersion: autoscaling/v2
```
The averageValue of 5 means the HPA targets no more than 5 waiting requests per pod on average. When queue depth exceeds this, new replicas are requested. The scaleDown.stabilizationWindowSeconds of 300 seconds is critical: vLLM model loading can take 30 to 120 seconds depending on model size and storage speed. For models >13B parameters or on slow NFS/S3-backed PVCs, loading can exceed 300 seconds; tune initialDelaySeconds and stabilization windows accordingly. A 300-second window prevents thrashing but delays capacity reduction after traffic drops. If traffic arrives in bursts with 10-minute peaks, set the stabilization window to match peak duration so the HPA does not scale down mid-burst. Scaling down too aggressively means pods are destroyed and recreated repeatedly. Keeping at least one warm replica (minReplicas: 1) avoids cold-start latency on the first request.
## Full Walkthrough: Helm Chart for a vLLM Service
### Chart Structure Overview
### values.yaml: Configurable Parameters
Important: Verify the latest vLLM image tag at https://github.com/vllm-project/vllm/releases before deploying. The tag below was current at time of writing but may not exist on the registry if the project has moved on.
### Deployment Template
```plain text
command: ["sleep", "15"]
```
The initialDelaySeconds on the liveness probe is set to 180 seconds, deliberately higher than the readiness probe (120 seconds), to accommodate model loading time. The failureThreshold: 6 on the liveness probe provides 90 seconds of tolerance (6 × 15s) after the initial delay before Kubernetes kills the pod, preventing restart loops under heavy GPU inference load. If the model is larger or storage is slow, both values may need to increase further to prevent Kubernetes from killing the pod during startup.
### Service Template
```plain text
namespace: {{ .Release.Namespace }}
type: {{ .Values.service.type }}
app: {{ .Release.Name }}-vllm
- port: {{ .Values.service.port }}
```
### HPA Template
```plain text
{{- if .Values.hpa.enabled }}
namespace: {{ .Release.Namespace }}
maxReplicas: {{ .Values.hpa.maxReplicas }}
averageValue: "{{ .Values.hpa.targetQueueDepth | int }}"
stabilizationWindowSeconds: {{ .Values.hpa.scaleDownStabilizationSeconds }}
```
### Deploying and Verifying
Before deploying, create the Hugging Face token secret in the target namespace. This is required for downloading gated models such as Mistral-7B-Instruct-v0.2:
Security note: Do not pass the token as a plain environment variable or commit it to version control. The Secret-based approach above keeps the token out of your Helm values and pod specs.
Now install the chart:
```plain text
kubectl logs -n inference -l app=vllm-inference-vllm --tail=50
kubectl logs -n inference -l app=vllm-inference-vllm | grep -E "error|401|gated|token"
-H "Content-Type: application/json" \
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"max_tokens": 64
```
A successful response returns a JSON object with the generated completion, confirming the full pipeline works end to end: GPU Operator, device plugin, vLLM container, and Kubernetes networking.
## Implementation Checklist
1. GPU nodes provisioned and labeled.
2. NVIDIA GPU Operator installed and verified.
3. Inference nodes tainted for dedicated workloads.
4. Hugging Face token Secret created in the inference namespace.
5. Model artifacts accessible (PVC, S3, or Hugging Face Hub with valid token).
6. vLLM Helm chart values reviewed for resource sizing.
7. Prometheus deployed and scraping confirmed for vLLM metrics in the inference namespace.
8. Prometheus Adapter installed and custom metrics API available.
9. HPA deployed and tested under synthetic load.
10. Liveness/readiness probes validated (ensure initialDelaySeconds exceeds model load time).
11. Scale-down stabilization window tuned for model load time.
## Where to Go Next
To mature this platform, add model versioning and canary rollouts with KServe, implement A/B traffic splitting for model evaluation, and integrate GPU-aware cost monitoring tools to track inference spend per model and per team.
For a broader view of local LLM tooling options, the Running LLMs Locally guide covers alternative approaches. Teams evaluating serving engines should also review the Ollama vs vLLM comparison to understand where each tool fits in the deployment spectrum.