在边缘计算设备上部署目标检测模型一直是计算机视觉领域的热门话题。Jetson Nano作为NVIDIA推出的低成本AI开发板,凭借其4核ARM Cortex-A57 CPU和128核Maxwell架构GPU,成为运行YOLOv7这类先进目标检测模型的理想平台。本文将详细记录我在Jetson Nano上部署YOLOv7的完整过程,包括环境配置、模型转换、性能优化等关键环节。
YOLOv7作为YOLO系列的最新版本,在速度和精度上都有显著提升。但将其部署到Jetson Nano这样的资源受限设备上仍面临诸多挑战:ARM架构的兼容性问题、CUDA核心的充分利用、TensorRT的优化转换等。通过本文的实践指南,即使是刚接触边缘计算的开发者也能顺利完成部署。
Jetson Nano出厂时预装了Ubuntu 18.04和JetPack SDK。建议首先执行系统更新:
bash复制sudo apt update
sudo apt full-upgrade -y
然后安装必要的开发工具:
bash复制sudo apt install -y build-essential cmake git libpython3-dev python3-pip
注意:Jetson Nano的ARM架构与常规x86平台不同,许多Python包需要从源码编译安装,耗时较长。
JetPack已包含CUDA和cuDNN,但需要确认环境变量设置正确。在~/.bashrc末尾添加:
bash复制export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda/bin:$PATH
执行source ~/.bashrc后,验证安装:
bash复制nvcc --version # 应显示CUDA 10.2
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2 # 检查cuDNN版本
虽然JetPack预装了OpenCV,但为兼容YOLOv7,建议重新编译:
bash复制git clone --branch 4.5.5 https://github.com/opencv/opencv.git
mkdir -p opencv/build && cd opencv/build
cmake -D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_INSTALL_PREFIX=/usr/local \
-D WITH_CUDA=ON \
-D CUDA_ARCH_BIN=5.3 \
-D CUDA_ARCH_PTX="" \
-D WITH_CUDNN=ON \
-D OPENCV_DNN_CUDA=ON \
-D ENABLE_FAST_MATH=ON \
-D CUDA_FAST_MATH=ON \
-D WITH_CUBLAS=ON \
-D OPENCV_ENABLE_NONFREE=ON \
-D WITH_GSTREAMER=ON \
-D WITH_LIBV4L=ON \
-D BUILD_opencv_python3=ON \
-D BUILD_TESTS=OFF \
-D BUILD_PERF_TESTS=OFF \
-D BUILD_EXAMPLES=OFF ..
make -j4
sudo make install
实操心得:编译过程约2-3小时,建议使用散热良好的外壳并接上5V 4A电源。
bash复制git clone https://github.com/WongKinYiu/yolov7.git
cd yolov7
pip3 install -r requirements.txt
下载预训练权重(以yolov7-tiny为例):
bash复制wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-tiny.pt
测试PyTorch推理:
python复制import torch
model = torch.load('yolov7-tiny.pt', map_location='cpu')['model'].float()
model.eval()
print(model)
bash复制python3 export.py --weights yolov7-tiny.pt --grid --simplify --img-size 640 640
转换后的onnx模型可通过Netron可视化验证:
bash复制pip3 install netron
python3 -m netron yolov7-tiny.onnx
安装TensorRT的Python包:
bash复制pip3 install nvidia-pyindex
pip3 install nvidia-tensorrt==8.2.1.8
使用trtexec工具转换:
bash复制/usr/src/tensorrt/bin/trtexec --onnx=yolov7-tiny.onnx \
--saveEngine=yolov7-tiny.trt \
--fp16 \
--workspace=1024
关键参数说明:
--fp16: 启用FP16精度,提升推理速度--workspace: 内存工作区大小(MB),Jetson Nano建议不超过1024创建inference_trt.py:
python复制import cv2
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
class YOLOv7_TRT:
def __init__(self, engine_path):
self.logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, "rb") as f, trt.Runtime(self.logger) as runtime:
self.engine = runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
# 分配输入输出缓冲区
self.inputs, self.outputs, self.bindings = [], [], []
for binding in self.engine:
size = trt.volume(self.engine.get_binding_shape(binding))
dtype = trt.nptype(self.engine.get_binding_dtype(binding))
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
self.bindings.append(int(device_mem))
if self.engine.binding_is_input(binding):
self.inputs.append({'host': host_mem, 'device': device_mem})
else:
self.outputs.append({'host': host_mem, 'device': device_mem})
def infer(self, img):
# 预处理
img = cv2.resize(img, (640, 640))
img = img.transpose((2, 0, 1)).astype(np.float32) / 255.0
np.copyto(self.inputs[0]['host'], img.ravel())
# 推理
cuda.memcpy_htod(self.inputs[0]['device'], self.inputs[0]['host'])
self.context.execute_v2(bindings=self.bindings)
cuda.memcpy_dtoh(self.outputs[0]['host'], self.outputs[0]['device'])
# 后处理
output = self.outputs[0]['host']
return output.reshape(1, -1, 85)
python复制# 在模型转换和推理时保持相同的输入尺寸
img = cv2.resize(img, (640, 640)) # 与export.py的--img-size一致
bash复制# 在trtexec转换时添加--fp16参数
/usr/src/tensorrt/bin/trtexec --onnx=yolov7-tiny.onnx --fp16
python复制stream = cuda.Stream()
self.context.enqueue_v2(bindings=self.bindings, stream_handle=stream.handle)
stream.synchronize()
python复制# 在转换时指定最大批处理大小
/usr/src/tensorrt/bin/trtexec --onnx=yolov7-tiny.onnx --minShapes=input:1x3x640x640 --optShapes=input:4x3x640x640 --maxShapes=input:8x3x640x640
| 模型变体 | 精度 | 推理时间(ms) | 内存占用(MB) |
|---|---|---|---|
| YOLOv7-tiny | FP32 | 45.2 | 780 |
| YOLOv7-tiny | FP16 | 28.7 | 420 |
| YOLOv7 | FP32 | 不适用(内存不足) | - |
| YOLOv7x | FP16 | 不适用(内存不足) | - |
实测数据:Jetson Nano 4GB版本,JetPack 4.6,电源模式设置为MAXN
bash复制sudo nvpmodel -m 0 # 设置为最大性能模式
sudo jetson_clocks # 强制锁定最高频率
监控状态:
bash复制tegrastats # 查看CPU/GPU频率和温度
bash复制sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# 添加到/etc/fstab实现开机自动挂载
bash复制sudo systemctl stop graphical.target # 关闭GUI节省内存
bash复制/usr/src/tensorrt/bin/trtexec --onnx=yolov7-tiny.onnx --int8 --calib=calib.cache
问题现象:
code复制[TRT] [E] 2: [optimizer.cpp::computeCosts::1815] Error Code 2: Internal Error
解决方案:
问题现象:检测框位置或类别错误
排查步骤:
python复制# 检查引擎绑定信息
for i in range(self.engine.num_bindings):
name = self.engine.get_binding_name(i)
dtype = self.engine.get_binding_dtype(i)
shape = self.engine.get_binding_shape(i)
print(f"Binding {i}: {name} {dtype} {shape}")
优化检查清单:
python复制def process_video(trt_model, video_path, output_path=None):
cap = cv2.VideoCapture(video_path)
if output_path:
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_path, fourcc, 30, (640,640))
while cap.isOpened():
ret, frame = cap.read()
if not ret: break
# 推理
detections = trt_model.infer(frame)
# 绘制结果
vis = visualize_detections(frame, detections)
if output_path:
out.write(vis)
else:
cv2.imshow('Output', vis)
if cv2.waitKey(1) == ord('q'):
break
cap.release()
if output_path: out.release()
cv2.destroyAllWindows()
python复制from threading import Thread
import queue
class CameraThread(Thread):
def __init__(self, cam_id, queue):
Thread.__init__(self)
self.cam_id = cam_id
self.queue = queue
def run(self):
cap = cv2.VideoCapture(self.cam_id)
while True:
ret, frame = cap.read()
if ret:
self.queue.put(frame)
else:
break
class InferenceThread(Thread):
def __init__(self, model, input_queue, output_queue):
Thread.__init__(self)
self.model = model
self.input_queue = input_queue
self.output_queue = output_queue
def run(self):
while True:
frame = self.input_queue.get()
if frame is None: break
detections = self.model.infer(frame)
self.output_queue.put((frame, detections))
python复制class ModelManager:
def __init__(self):
self.current_model = None
self.model_pool = {}
def load_model(self, name, trt_path):
if name in self.model_pool:
self.current_model = self.model_pool[name]
else:
model = YOLOv7_TRT(trt_path)
self.model_pool[name] = model
self.current_model = model
def get_model(self):
return self.current_model
# 使用示例
manager = ModelManager()
manager.load_model('yolov7-tiny', 'yolov7-tiny.trt')
model = manager.get_model()
在Jetson Nano上成功部署YOLOv7后,我最大的体会是边缘设备的性能优化需要全方位考虑:从模型选择、精度权衡到系统级的电源和内存管理。实际部署时,yolov7-tiny在640x640分辨率下能达到接近30FPS的表现,完全能满足大多数实时检测场景。如果检测精度要求更高,可以尝试量化后的yolov7模型,但需要仔细平衡性能和精度。