在云计算和人工智能技术快速发展的今天,将大型语言模型(LLM)部署到生产环境已成为许多企业和开发者的迫切需求。Azure Kubernetes Service(AKS)作为微软云平台提供的托管Kubernetes服务,为LLM部署提供了理想的运行环境。本指南将详细介绍如何使用vLLM引擎在AKS上部署Mistral-7B等大型语言模型。
提示:虽然本指南以Mistral-7B为例,但所述方法同样适用于其他Hugging Face托管的开源模型,只需相应调整资源配置即可。
首先需要确保Azure环境准备就绪。以下是详细步骤:
Azure订阅验证:
bash复制az account show
这条命令会显示当前激活的订阅信息。确保订阅状态为"Enabled"且已设置正确的计费方式。
Azure CLI安装:
bash复制curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
bash复制brew install azure-cli
Kubernetes工具安装:
bash复制az aks install-cli
这将安装kubectl和相关的Kubernetes管理工具。
定义环境变量以简化后续命令:
bash复制export MY_RESOURCE_GROUP_NAME="llm-deployment-rg"
export MY_AKS_CLUSTER_NAME="llm-cluster"
export LOCATION="eastus"
创建资源组:
bash复制az group create --name $MY_RESOURCE_GROUP_NAME --location $LOCATION
创建AKS集群基础配置:
bash复制az aks create \
--resource-group $MY_RESOURCE_GROUP_NAME \
--name $MY_AKS_CLUSTER_NAME \
--node-count 1 \
--generate-ssh-keys \
--network-plugin azure \
--network-policy azure
注意:初始节点数设置为1是为了最小化初始成本,后续可以根据需要扩展。
系统节点池运行Kubernetes系统组件,建议配置如下:
bash复制az aks nodepool add \
--resource-group $MY_RESOURCE_GROUP_NAME \
--cluster-name $MY_AKS_CLUSTER_NAME \
--name system \
--node-count 3 \
--node-vm-size D2s_v3
选择D2s_v3虚拟机规格的原因:
GPU节点池专门用于运行LLM推理工作负载:
bash复制az aks nodepool add \
--resource-group $MY_RESOURCE_GROUP_NAME \
--cluster-name $MY_AKS_CLUSTER_NAME \
--name gpunp \
--node-count 1 \
--node-vm-size Standard_NC4as_T4_v3 \
--node-taints sku=gpu:NoSchedule \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
关键配置解析:
创建以下DaemonSet配置(保存为nvidia-device-plugin.yaml):
yaml复制apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
应用配置:
bash复制kubectl apply -f nvidia-device-plugin.yaml
验证安装:
bash复制kubectl get pods -n kube-system | grep nvidia-device-plugin
创建PersistentVolumeClaim(保存为volume.yaml):
yaml复制apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
50Gi存储空间考虑:
创建Service(保存为service.yaml):
yaml复制apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
targetPort: 8000
selector:
app: mistral-7b
type: LoadBalancer
LoadBalancer类型服务提供:
创建Deployment(保存为deployment.yaml):
yaml复制apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
spec:
replicas: 1
template:
spec:
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1
memory: 20G
requests:
nvidia.com/gpu: 1
memory: 6G
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
资源限制设置考虑:
bash复制kubectl apply -f volume.yaml
kubectl apply -f service.yaml
kubectl apply -f deployment.yaml
检查Pod状态:
bash复制kubectl get pods -w
查看服务详情:
bash复制kubectl get service mistral-7b
获取服务IP:
bash复制export SERVICE_IP=$(kubectl get service mistral-7b -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
发送测试请求:
bash复制curl --location "http://$SERVICE_IP/v1/completions" \
--header 'Content-Type: application/json' \
--data '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"prompt": "Explain how Kubernetes works",
"max_tokens": 50
}'
配置Horizontal Pod Autoscaler(HPA):
bash复制kubectl autoscale deployment mistral-7b --cpu-percent=50 --min=1 --max=5
安装Prometheus和Grafana:
bash复制helm install prometheus prometheus-community/prometheus
helm install grafana grafana/grafana
实施网络策略(保存为network-policy.yaml):
yaml复制apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-llm-traffic
spec:
podSelector:
matchLabels:
app: mistral-7b
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 8000
使用Spot实例:
bash复制az aks nodepool add \
--resource-group $MY_RESOURCE_GROUP_NAME \
--cluster-name $MY_AKS_CLUSTER_NAME \
--name spotnp \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--node-vm-size Standard_NC4as_T4_v3 \
--node-count 1
定时缩放:
bash复制kubectl scale deployment mistral-7b --replicas=0
在非高峰时段缩减副本数以节省成本。
资源利用率监控:
bash复制az monitor metrics list \
--resource /subscriptions/{subscriptionId}/resourceGroups/$MY_RESOURCE_GROUP_NAME/providers/Microsoft.ContainerService/managedClusters/$MY_AKS_CLUSTER_NAME \
--metric "node_cpu_usage_percentage" \
--interval PT1H
GPU未被识别:
bash复制kubectl describe node <gpu-node-name>
kubectl logs -n kube-system <nvidia-device-plugin-pod-name>
内存不足:
bash复制kubectl top pod
滚动更新策略:
bash复制kubectl set image deployment/mistral-7b mistral-7b=vllm/vllm-openai:new-version
kubectl rollout status deployment/mistral-7b
使用Velero进行备份:
bash复制velero backup create llm-backup --include-namespaces default
恢复备份:
bash复制velero restore create --from-backup llm-backup
批处理请求:
--max-batch-size参数--max-sequence-length量化模型:
缓存优化:
并发控制:
--max-concurrent-requests参数在实际部署中,我发现T4 GPU运行Mistral-7B时,最佳并发数通常在4-8之间,具体取决于输入输出长度。建议从较低并发开始,逐步增加并监控响应时间和GPU利用率。