在开始搭建Intel B60 GPU服务器环境前,我们需要做好充分的准备工作。Ubuntu 22.04 LTS作为长期支持版本,提供了稳定的基础系统环境。以下是详细的准备工作流程:
首先登录到服务器后,建议立即执行以下基础配置:
bash复制# 设置时区(以亚洲/上海为例)
sudo timedatectl set-timezone Asia/Shanghai
# 禁用不必要的服务(根据实际需求调整)
sudo systemctl disable --now apparmor
sudo systemctl disable --now unattended-upgrades
# 配置SSH安全参数(编辑/etc/ssh/sshd_config)
sudo sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin no/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo systemctl restart sshd
注意:禁用密码登录前,请确保已将SSH公钥添加到用户目录的~/.ssh/authorized_keys文件中
执行完整的系统更新和基础工具安装:
bash复制# 更新软件源并升级所有已安装包
sudo apt update && sudo apt full-upgrade -y
# 安装基础编译工具链
sudo apt install -y build-essential cmake git wget curl pkg-config \
unzip software-properties-common lsb-release ca-certificates \
gnupg2 apt-transport-https
# 安装常用系统工具
sudo apt install -y htop iotop iftop ncdu tmux tree jq
Ubuntu 22.04默认带有Python 3.10,我们建议使用虚拟环境管理项目依赖:
bash复制# 安装Python基础环境
sudo apt install -y python3 python3-pip python3-venv python3-dev
# 创建项目专用虚拟环境
python3 -m venv ~/venv/b60_gpu
source ~/venv/b60_gpu/bin/activate
# 升级pip并安装基础Python包
pip install --upgrade pip setuptools wheel
pip install numpy pandas matplotlib scikit-learn jupyter
Intel B60 GPU需要特定的驱动支持,我们需要添加Intel的官方软件源:
bash复制# 添加Intel图形驱动仓库密钥
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
# 添加仓库配置
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \
https://repositories.intel.com/gpu/ubuntu jammy client" | \
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
# 更新软件源
sudo apt update
安装Intel B60 GPU所需的核心驱动组件:
bash复制# 安装基础驱动组件
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2
# 安装开发库
sudo apt install -y libigc-dev intel-igc-cm libigdfcl-dev \
libigfxcmrt-dev level-zero-dev
安装完成后,验证GPU是否被正确识别:
bash复制# 检查设备节点
ls /dev/dri
# 查看GPU信息
sudo apt install -y intel-gpu-tools
sudo intel_gpu_top
预期输出应显示Intel GPU设备信息,包括B60型号和当前负载状态。
Intel oneAPI提供了针对Intel硬件的优化计算库:
bash复制# 添加oneAPI仓库密钥
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
# 添加仓库配置
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] \
https://apt.repos.intel.com/oneapi all main" | \
sudo tee /etc/apt/sources.list.d/oneAPI.list
# 更新软件源
sudo apt update
安装oneAPI基础工具包:
bash复制sudo apt install -y intel-basekit
# 设置环境变量
source /opt/intel/oneapi/setvars.sh
验证oneAPI组件是否安装成功:
bash复制# 验证编译器
icpx --version
# 验证MKL库
python -c "import numpy as np; a = np.random.rand(1000,1000); np.linalg.svd(a)"
在Python虚拟环境中安装OpenVINO:
bash复制source ~/venv/b60_gpu/bin/activate
pip install openvino==2024.0.0 openvino-dev==2024.0.0
确保OpenVINO能够识别并使用Intel B60 GPU:
python复制# 创建测试脚本test_gpu.py
import openvino as ov
core = ov.Core()
devices = core.available_devices
print("Available devices:", devices)
if 'GPU' in devices:
print("\nTesting GPU inference...")
try:
model = ov.Core().compile_model("path/to/your/model.xml", "GPU")
print("GPU inference setup successful!")
except Exception as e:
print(f"GPU setup error: {str(e)}")
else:
print("GPU device not detected!")
针对B60 GPU进行性能优化配置:
python复制config = {
"PERFORMANCE_HINT": "THROUGHPUT",
"NUM_STREAMS": "4",
"GPU_THROUGHPUT_STREAMS": "4"
}
compiled_model = core.compile_model(model, "GPU", config)
安装支持oneAPI的PyTorch版本:
bash复制pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 \
--index-url https://download.pytorch.org/whl/cpu
验证PyTorch与oneAPI的集成:
python复制import torch
print(torch.__config__.show())
print("MKL available:", torch.backends.mkl.is_available())
安装支持OpenVINO后端的ONNX Runtime:
bash复制pip install onnx onnxruntime-openvino
创建测试脚本验证ONNX模型在GPU上的推理:
python复制import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
providers = ['OpenVINOExecutionProvider']
provider_options = [{'device_type': 'GPU_FP32'}]
session = ort.InferenceSession("model.onnx", sess_options=sess_options,
providers=providers, provider_options=provider_options)
安装最新版Docker引擎:
bash复制# 添加Docker官方GPG密钥
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# 添加仓库
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# 安装Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# 配置用户权限
sudo usermod -aG docker $USER
newgrp docker
配置Docker以支持Intel GPU加速:
bash复制# 创建docker配置文件
sudo tee /etc/docker/daemon.json <<EOF
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "runc"
}
EOF
# 重启Docker服务
sudo systemctl restart docker
创建针对Intel B60 GPU优化的Dockerfile:
dockerfile复制FROM ubuntu:22.04
# 安装基础工具
RUN apt update && apt install -y wget gnupg2
# 添加Intel仓库
RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg && \
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \
https://repositories.intel.com/gpu/ubuntu jammy client" | \
tee /etc/apt/sources.list.d/intel-gpu-jammy.list
# 安装驱动和工具包
RUN apt update && apt install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
python3 python3-pip
# 安装oneAPI
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | \
gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg && \
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] \
https://apt.repos.intel.com/oneapi all main" | \
tee /etc/apt/sources.list.d/oneAPI.list && \
apt update && apt install -y intel-basekit
# 设置环境变量
ENV PATH=/opt/intel/oneapi/compiler/latest/linux/bin:/opt/intel/oneapi/mkl/latest/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/lib:/opt/intel/oneapi/mkl/latest/lib:$LD_LIBRARY_PATH
# 安装Python包
COPY requirements.txt .
RUN pip install -r requirements.txt
# 设置工作目录
WORKDIR /app
COPY . .
CMD ["python", "app.py"]
安装和配置GPU监控工具:
bash复制sudo apt install -y intel-gpu-tools
sudo intel_gpu_top
设置全面的系统监控:
bash复制# 安装Prometheus node exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvf node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter
# 创建systemd服务
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.gpu \
--collector.intel_gpu
[Install]
WantedBy=multi-user.target
EOF
# 启动服务
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
创建全面的性能测试脚本:
python复制import time
import numpy as np
import openvino as ov
def benchmark_model(model_path, device="GPU", batch_size=1, num_iter=100):
core = ov.Core()
model = core.read_model(model_path)
# 设置性能配置
config = {
"PERFORMANCE_HINT": "THROUGHPUT",
"NUM_STREAMS": "4",
"GPU_THROUGHPUT_STREAMS": "4"
}
compiled_model = core.compile_model(model, device, config)
input_shape = compiled_model.input(0).shape
input_shape[0] = batch_size
# 准备输入数据
input_data = {compiled_model.input(0): np.random.randn(*input_shape).astype(np.float32)}
# 预热运行
for _ in range(10):
compiled_model(input_data)
# 正式测试
start = time.perf_counter()
for _ in range(num_iter):
compiled_model(input_data)
end = time.perf_counter()
avg_latency = (end - start) * 1000 / num_iter
throughput = batch_size * num_iter / (end - start)
print(f"Device: {device}")
print(f"Batch size: {batch_size}")
print(f"Average latency: {avg_latency:.2f} ms")
print(f"Throughput: {throughput:.2f} FPS")
print("-" * 50)
# 测试不同batch size下的性能
for bs in [1, 4, 8, 16]:
benchmark_model("model.xml", device="GPU", batch_size=bs)
症状:OpenVINO或系统工具无法检测到Intel B60 GPU
排查步骤:
检查内核模块加载:
bash复制lsmod | grep i915
如果没有输出,需要加载内核模块:
bash复制sudo modprobe i915
验证设备节点:
bash复制ls /dev/dri
应该看到类似card0 renderD128的输出
检查驱动版本:
bash复制sudo apt list --installed | grep intel-gpu
症状:应用程序报告OpenCL不可用或运行出错
解决方案:
验证OpenCL安装:
bash复制sudo apt install -y clinfo
clinfo | grep -i intel
重新安装运行时:
bash复制sudo apt install --reinstall intel-opencl-icd
症状:推理过程中出现内存不足错误
优化建议:
python复制core = ov.Core()
core.set_property("GPU", {"INFERENCE_PRECISION_HINT": "f16"})
python复制config = {"CACHE_DIR": "cache"} # 启用模型缓存
compiled_model = core.compile_model(model, "GPU", config)
优化措施:
检查电源管理:
bash复制cat /sys/class/drm/card0/device/power_dpm_force_performance_level
设置为高性能模式:
bash复制echo "high" | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level
调整线程亲和性:
bash复制export OMP_NUM_THREADS=$(nproc)
export KMP_AFFINITY=granularity=fine,compact,1,0
使用异步推理:
python复制infer_queue = ov.AsyncInferQueue(compiled_model, 4)
infer_queue.start_async(input_data)
result = infer_queue.wait_all()
在实际部署过程中,我发现Intel B60 GPU在Ubuntu 22.04上的性能表现很大程度上取决于驱动版本和系统配置。建议定期检查Intel官方仓库的更新,特别是在进行大规模部署前,先在测试环境中验证新版本驱动的稳定性和性能表现。对于生产环境,保持所有节点的驱动和软件版本一致可以避免很多兼容性问题。