在深度学习与AI绘画领域,DGX系列工作站因其强大的GPU算力成为专业用户的首选设备。而ComfyUI作为一款基于节点式工作流的Stable Diffusion前端,相比传统WebUI具有更低显存占用和更高灵活性的优势。本教程将解决一个具体痛点:如何在DGX服务器上基于Spark分布式环境部署ComfyUI,实现多GPU协同的AI绘画工作流。
这个方案特别适合:
DGX系列服务器通常配备多块NVIDIA Tesla GPU(如A100/V100)。在开始前需要确认:
bash复制nvidia-smi # 查看GPU型号和数量
lscpu # 检查CPU和内存配置
df -h # 确认存储空间
注意:ComfyUI对显存的要求取决于模型尺寸,建议单卡至少24GB显存用于运行SDXL模型
bash复制wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark
bash复制wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda create -n comfy python=3.10
conda activate comfy
使用Git克隆官方仓库并安装依赖:
bash复制git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
main.py添加Spark初始化代码:python复制from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ComfyUI-Spark") \
.config("spark.executor.memory", "16g") \
.config("spark.driver.memory", "8g") \
.getOrCreate()
python复制def distributed_inference(workflow_json):
# 将工作流分发到各GPU节点执行的逻辑
...
建议将模型文件存放在共享存储(如NFS):
code复制models/
├── checkpoints/
├── loras/
├── vae/
└── ...
在extra_model_paths.yaml中配置:
yaml复制base_path: /mnt/nfs/models
checkpoints: checkpoints
loras: loras
vae: vae
通过修改execution.py实现动态负载分配:
python复制def queue_prompt(prompt):
gpu_util = get_gpu_utilization()
least_loaded = np.argmin(gpu_util)
return send_to_device(prompt, device=f"cuda:{least_loaded}")
创建systemd服务文件/etc/systemd/system/comfyui.service:
ini复制[Unit]
Description=ComfyUI Service
After=network.target
[Service]
User=comfy
Group=comfy
WorkingDirectory=/opt/ComfyUI
ExecStart=/opt/miniconda3/envs/comfy/bin/python main.py
Restart=always
[Install]
WantedBy=multi-user.target
bash复制openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout /etc/ssl/private/comfyui.key \
-out /etc/ssl/certs/comfyui.crt
nginx复制server {
listen 443 ssl;
server_name yourdomain.com;
ssl_certificate /etc/ssl/certs/comfyui.crt;
ssl_certificate_key /etc/ssl/private/comfyui.key;
location / {
proxy_pass http://localhost:8188;
proxy_set_header Host $host;
}
}
使用自动化测试脚本评估吞吐量:
python复制def benchmark():
test_cases = [
("txt2img", "512x512", 20),
("img2img", "768x768", 10),
("inpainting", "1024x1024", 5)
]
for case in test_cases:
start = time.time()
run_workflow(case)
duration = time.time() - start
print(f"{case[0]}@{case[1]}: {case[2]/duration:.2f} it/s")
python复制spark.conf.set("spark.executor.cores", 4)
spark.conf.set("spark.executor.memoryOverhead", "2g")
python复制import gc
gc.collect()
torch.cuda.empty_cache()
| 错误代码 | 可能原因 | 解决方案 |
|---|---|---|
| CUDA OOM | 显存不足 | 降低batch_size或分辨率 |
| Spark timeout | 网络延迟 | 调整spark.network.timeout |
| 模型加载失败 | 文件损坏 | 重新下载模型文件 |
关键日志位置:
/opt/spark/logs//opt/ComfyUI/logs/使用grep快速定位问题:
bash复制grep -i "error" /opt/ComfyUI/logs/*.log
grep -A 5 -B 5 "Exception" /opt/spark/logs/*
示例图像后处理节点:
python复制class ImageEnhancer:
@classmethod
def INPUT_TYPES(cls):
return {
"required": {
"image": ("IMAGE",),
"contrast": ("FLOAT", {"default": 1.0}),
}
}
FUNCTION = "enhance"
CATEGORY = "image/postprocessing"
def enhance(self, image, contrast):
# 实现对比度调整逻辑
return (enhanced_image,)
将复杂工作流拆分为子任务:
python复制def split_workflow(workflow):
tasks = []
for node in workflow["nodes"]:
if node["type"] == "sampler":
tasks.append(create_spark_task(node))
return spark.parallelize(tasks).collect()