IBM Power AC922服务器作为一款基于POWER9架构的高性能计算平台,搭配6张Tesla V100-SXM2-16GB GPU,在AI推理和大模型部署领域具有独特优势。本文将详细记录从系统安装到三卡LLM推理全流程的实战经验,特别针对ppc64le架构下的特殊配置和常见问题进行深度解析。
在开始安装前,需要确认AC922的基本硬件状态:
lspci | grep -i nvidia确认6张V100 GPU已被识别Petitboot是POWER架构特有的引导环境,网络安装需特别注意:
ISO准备要点:
服务器端配置:
bash复制# 创建挂载点并挂载ISO
sudo mkdir -p /mnt/cs9-ppc64
sudo mount -o loop CentOS-Stream-9-latest-ppc64le-dvd1.iso /mnt/cs9-ppc64
# 配置HTTP访问
sudo ln -sfn /mnt/cs9-ppc64 /var/www/html/centos9-ppc64
sudo systemctl restart httpd
code复制inst.repo=http://<server_ip>/centos9-ppc64/
ip=dhcp
inst.ks=http://<server_ip>/ks.cfg # 可选自动安装配置
inst.text # 强制文本安装模式
注意:
vmlinuz和initrd.img必须来自同一镜像版本,否则会导致安装失败。
安装完成后需立即进行以下配置:
bash复制# 禁用SELinux(GPU驱动兼容性考虑)
sudo sed -i 's/SELINUX=enforcing/SELINUX=permissive/g' /etc/selinux/config
sudo setenforce 0
# 配置EPEL仓库
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
# 安装基础开发工具
sudo dnf groupinstall -y "Development Tools"
仓库不可用问题:
bash复制sudo sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*
sudo sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
cmake崩溃问题:
text复制cmake: undefined symbol: archive_write_add_filter_zstd
解决方案:
bash复制sudo dnf reinstall -y libarchive cmake
# 或使用conda安装新版cmake
conda install -c conda-forge cmake
bash复制uname -r
sudo dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
bash复制echo "blacklist nouveau" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf
sudo dracut --force
bash复制wget https://us.download.nvidia.com/tesla/550.54.15/NVIDIA-Linux-ppc64le-550.54.15.run
bash复制sudo systemctl isolate multi-user.target
sudo ./NVIDIA-Linux-ppc64le-550.54.15.run
bash复制nvidia-smi
# 应显示6张V100 GPU信息
lsmod | grep nvidia
# 应显示nvidia和nvidia_uvm模块
版本选择:
安装命令:
bash复制sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/ppc64le/cuda-rhel9.repo
sudo dnf install -y cuda-toolkit-12-4
bash复制echo 'export PATH=/usr/local/cuda/bin:$PATH' | sudo tee /etc/profile.d/cuda.sh
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' | sudo tee -a /etc/profile.d/cuda.sh
source /etc/profile
创建测试程序cuda_test.cu:
cpp复制#include <stdio.h>
#include <cuda_runtime.h>
int main() {
int n;
cudaError_t err = cudaGetDeviceCount(&n);
printf("cudaGetDeviceCount: %d (%s), n=%d\n",
err, cudaGetErrorString(err), n);
return 0;
}
编译运行:
bash复制nvcc cuda_test.cu -o cuda_test
./cuda_test
# 期望输出:cudaGetDeviceCount: 0 (no error), n=6
bash复制sudo dnf install -y ninja-build ccache git
bash复制git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout $(git describe --tags --abbrev=0)
针对V100 GPU的优化编译:
bash复制mkdir -p build && cd build
cmake .. -G Ninja \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=70 \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CUDA_FORCE_DMMV=ON \
-DLLAMA_CUDA_MMV_Y=2
ninja
参数说明:
-DCMAKE_CUDA_ARCHITECTURES=70:针对V100的Volta架构优化-DLLAMA_CUDA_FORCE_DMMV=ON:强制使用direct matrix-vector乘法-DLLAMA_CUDA_MMV_Y=2:优化矩阵乘法y维度分块GCC版本问题:
text复制unsupported GNU version! gcc versions later than 13 are not supported!
解决方案:
bash复制sudo dnf install -y gcc-toolset-13
source /opt/rh/gcc-toolset-13/enable
CUDA后端未启用:
检查cmake输出是否包含:
text复制-- GGML CUDA support enabled
-- Found CUDA: /usr/local/cuda (found version "12.4")
bash复制python3 convert.py --input models/llama-2-7b --output models/llama-2-7b-gguf --vocab-type bpe
bash复制./quantize models/llama-2-7b-gguf/ggml-model-f16.gguf models/llama-2-7b-gguf/ggml-model-q4_0.gguf q4_0
基础测试命令:
bash复制env -i PATH=/usr/local/cuda/bin:/usr/sbin:/usr/bin:/sbin:/bin \
CUDA_VISIBLE_DEVICES=0 \
./main -m models/llama-2-7b-gguf/ggml-model-q4_0.gguf \
-p "介绍一下IBM Power AC922服务器" \
-n 256 -c 2048 -ngl 20
关键参数说明:
-ngl 20:将前20层模型卸载到GPU-c 2048:上下文token长度-n 256:生成256个token三卡同组推理(0,1,2):
bash复制env -i PATH=/usr/local/cuda/bin:/usr/sbin:/usr/bin:/sbin:/bin \
CUDA_VISIBLE_DEVICES=0,1,2 \
./main -m models/llama-2-7b-gguf/ggml-model-q4_0.gguf \
-p "比较IBM Power和x86架构在AI工作负载上的优劣" \
-n 512 -c 4096 -ngl 999
性能优化建议:
--tensor-split参数手动分配显存--main-gpu指定主GPU--threads参数匹配CPU核心数查看GPU拓扑:
bash复制nvidia-smi topo -m
AC922典型拓扑:
code复制 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5
GPU0 X NV12 NV12 SYS SYS SYS
GPU1 NV12 X NV12 SYS SYS SYS
GPU2 NV12 NV12 X SYS SYS SYS
GPU3 SYS SYS SYS X NV12 NV12
GPU4 SYS SYS SYS NV12 X NV12
GPU5 SYS SYS SYS NV12 NV12 X
最佳实践:
CUDA_VISIBLE_DEVICES隔离bash复制watch -n 1 "nvidia-smi && echo && lstopo --no-io"
bash复制nvprof ./main -m model.gguf -p "test" -n 128
bash复制watch -n 1 "cat /sys/class/hwmon/hwmon*/temp*_input | awk '{print \$1/1000}'"
创建/etc/systemd/system/llama-server.service:
ini复制[Unit]
Description=LLaMA Inference Server
After=network.target
[Service]
Environment="PATH=/usr/local/cuda/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="CUDA_VISIBLE_DEVICES=0,1,2"
ExecStart=/opt/llama.cpp/build/bin/server \
-m /data/models/llama-2-7b-gguf/ggml-model-q4_0.gguf \
-c 4096 --host 0.0.0.0 --port 8080
Restart=always
User=llama
Group=llama
[Install]
WantedBy=multi-user.target
bash复制sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --reload
bash复制sudo useradd -r -s /sbin/nologin llama
sudo chown -R llama:llama /opt/llama.cpp /data/models
典型错误:
text复制ggml_cuda_init: failed to initialize CUDA: initialization error
排查步骤:
bash复制nvidia-smi
lsmod | grep nvidia
bash复制ls -l /dev/nvidia*
bash复制free -h
cat /proc/meminfo | grep MemAvailable
bash复制dmesg | grep -i nvidia
解决方案:
bash复制--tensor-split 0.4,0.3,0.3
bash复制--main-gpu 0
bash复制nvidia-smi dmon -s pucvmet
创建备份脚本backup_llama.sh:
bash复制#!/bin/bash
BACKUP_DIR=/backup/llama-$(date +%Y%m%d)
mkdir -p $BACKUP_DIR
# 系统配置
cp -r /etc/modprobe.d $BACKUP_DIR
cp /etc/profile.d/cuda.sh $BACKUP_DIR
# 驱动信息
nvidia-smi --query > $BACKUP_DIR/nvidia-smi.txt
cp /proc/driver/nvidia/version $BACKUP_DIR
# 模型和代码
rsync -a /opt/llama.cpp $BACKUP_DIR
rsync -a /data/models $BACKUP_DIR
# 打包
tar -czvf $BACKUP_DIR.tar.gz $BACKUP_DIR
恢复后检查清单:
bash复制nvidia-smi
bash复制./cuda_test
bash复制./main -m model.gguf -p "test" -n 1 -ngl 1
编辑/etc/sysctl.conf:
ini复制# 增加GPU DMA缓冲区
vm.max_map_count=262144
fs.aio-max-nr=1048576
# 网络优化(用于分布式推理)
net.core.rmem_max=16777216
net.core.wmem_max=16777216
设置环境变量:
bash复制export CUDA_MEMCPY_ASYNC=1
export TF_GPU_ALLOCATOR=cuda_malloc_async
优化GPU时钟:
bash复制nvidia-smi -pm 1 # 启用持久模式
nvidia-smi -ac 877,1530 # 设置V100的时钟频率
测试命令:
bash复制./perplexity -m models/llama-2-7b-gguf/ggml-model-q4_0.gguf \
-f test.txt -ngl 99 -t 16
预期指标:
三卡测试:
bash复制mpirun -np 3 ./perplexity -m models/llama-2-7b-gguf/ggml-model-q4_0.gguf \
-f test.txt -ngl 33 -t 8
理想扩展比:
检查兼容性矩阵:
灰度升级步骤:
bash复制# 保留旧驱动
sudo mv /usr/bin/nvidia-* /tmp/
sudo mv /usr/lib64/libnvidia* /tmp/
# 安装新驱动
sudo ./NVIDIA-Linux-ppc64le-XXX.XX.run --no-kernel-module
bash复制/data/models/
├── llama-2-7b-gguf-v1
└── llama-2-7b-gguf-v2
bash复制ln -sfn /data/models/llama-2-7b-gguf-v2 /data/models/current
内存带宽优势:
NVLink性能:
模型切分策略:
批处理优化:
混合精度计算: