@@ -36,31 +36,40 @@ This example demonstrates how to deploy a server for AI inference using [vLLM](h
3636
3737## Detailed Steps & Explanation
3838
39- 1 . Ensure Hugging Face permissions to retrieve model:
39+ 1 . Create a namespace. This example uses ` vllm-example ` , but you can choose any name:
40+
41+ ``` bash
42+ kubectl create namespace vllm-example
43+ ```
44+
45+ 2 . Ensure Hugging Face permissions to retrieve model:
4046
4147``` bash
4248# Env var HF_TOKEN contains hugging face account token
43- kubectl create secret generic hf-secret \
49+ # Make sure to use the same namespace as in the previous step
50+ kubectl create secret generic hf-secret -n vllm-example \
4451 --from-literal=hf_token=$HF_TOKEN
4552```
4653
47- 2 . Apply vLLM server:
54+
55+ 3 . Apply vLLM server:
4856
4957``` bash
50- kubectl apply -f vllm-deployment.yaml
58+ # Make sure to use the same namespace as in the previous steps
59+ kubectl apply -f vllm-deployment.yaml -n vllm-example
5160```
5261
5362 - Wait for deployment to reconcile, creating vLLM pod(s):
5463
5564``` bash
56- kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment
57- kubectl get pods -l app=gemma-server -w
65+ kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment -n vllm-example
66+ kubectl get pods -l app=gemma-server -w -n vllm-example
5867```
5968
6069 - View vLLM pod logs:
6170
6271``` bash
63- kubectl logs -f -l app=gemma-server
72+ kubectl logs -f -l app=gemma-server -n vllm-example
6473```
6574
6675Expected output:
@@ -77,11 +86,12 @@ Expected output:
7786...
7887```
7988
80- 3 . Create service:
89+ 4 . Create service:
8190
8291``` bash
8392# ClusterIP service on port 8080 in front of vllm deployment
84- kubectl apply -f vllm-service.yaml
93+ # Make sure to use the same namespace as in the previous steps
94+ kubectl apply -f vllm-service.yaml -n vllm-example
8595```
8696
8797## Verification / Seeing it Work
@@ -90,18 +100,19 @@ kubectl apply -f vllm-service.yaml
90100
91101``` bash
92102# Forward a local port (e.g., 8080) to the service port (e.g., 8080)
93- kubectl port-forward service/vllm-service 8080:8080
103+ # Make sure to use the same namespace as in the previous steps
104+ kubectl port-forward service/vllm-service 8080:8080 -n vllm-example
94105```
95106
961072 . Send request to local forwarding port:
97108
98109``` bash
99110curl -X POST http://localhost:8080/v1/chat/completions \
100111-H " Content-Type: application/json" \
101- -d ' {
102- "model": "google/gemma-3-1b-it",
103- "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms."}],
104- "max_tokens": 100
112+ -d ' { \
113+ "model": "google/gemma-3-1b-it", \
114+ "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms." }], \
115+ "max_tokens": 100 \
105116}'
106117```
107118
@@ -151,9 +162,11 @@ Node selectors make sure vLLM pods land on Nodes with the correct GPU, and they
151162## Cleanup
152163
153164` ` ` bash
154- kubectl delete -f vllm-service.yaml
155- kubectl delete -f vllm-deployment.yaml
156- kubectl delete -f secret/hf_secret
165+ # Make sure to use the same namespace as in the previous steps
166+ kubectl delete -f vllm-service.yaml -n vllm-example
167+ kubectl delete -f vllm-deployment.yaml -n vllm-example
168+ kubectl delete secret hf-secret -n vllm-example
169+ kubectl delete namespace vllm-example
157170```
158171
159172---
0 commit comments