基于SAM3的智能图像标注工具设计与实现

Clark Liew

1. 项目概述：智能图像标注工具的设计初衷

在计算机视觉领域，数据标注一直是制约模型开发效率的瓶颈环节。传统标注工具如LabelImg或CVAT需要人工逐个框选目标或精确描边，标注100张图像往往需要耗费一整天时间。这种低效的工作流程严重阻碍了模型迭代速度，特别是在需要快速构建数据集的场景中。

2025年11月，Meta发布的SAM3（Segment Anything with Concepts）模型带来了革命性的改变。该模型首次实现了开放词汇分割功能——用户只需输入任意文本短语（如"person"、"crack"或"cell"），模型就能自动分割图像中所有匹配的实例。这项技术突破使得标注效率从"逐个描边"跃升到"说一个词就全标好"的水平。

然而，SAM3本身只是一个AI模型，并非完整的标注工具。它缺乏用户界面、标注管理系统和数据导出功能。基于这一现状，我们开发了这套Web端智能标注工具，将SAM3的强大分割能力与完整的标注工作流相结合。工具采用React+FastAPI技术栈，实现了文本驱动分割、点击交互分割、框选分割等核心功能，并支持YOLO和COCO格式导出，可直接用于模型训练。

2. 技术架构与核心组件设计

2.1 整体技术选型

在技术选型上，我们综合考虑了性能、开发效率和生态兼容性等因素：

层级	技术选型	选择理由
AI模型	SAM3（本地部署）	支持开放词汇分割和交互式分割，完美契合标注场景需求
后端框架	FastAPI	Python生态，与SAM3天然兼容；异步高性能特性适合处理图像推理任务
前端框架	React + TypeScript	组件生态成熟，TypeScript提供更好的类型安全
UI组件库	Ant Design	提供丰富的企业级UI组件，加速界面开发
画布渲染	react-konva	基于Canvas的2D渲染库，支持图片叠加、鼠标交互和图形拖拽等复杂操作
掩码处理	pycocotools	行业标准的COCO RLE格式编码，确保与主流训练框架的兼容性

2.2 系统架构设计

工具采用典型的三栏布局设计，各功能区划分明确：

code复制┌──────────────┬──────────────────────────┬──────────────────┐
│  图片列表     │       画布区域            │   工具面板        │
│              │                          │                  │
│ • 批量上传    │                          │ • 单张上传        │
│ • 批量自动标注│   图片 + 掩码叠加         │ • 文本/点击/框选  │
│ • 缩略图列表  │   • 点击标记              │ • 分割结果列表    │
│ • 标注状态    │   • 框选预览              │ • 已保存标注      │
│              │   • 多边形顶点编辑         │ • 导出YOLO/COCO  │
└──────────────┴──────────────────────────┴──────────────────┘

前后端通过REST API进行通信，考虑到掩码数据体积较大，采用RLE（Run-Length Encoding）编码进行压缩传输。掩码可视化（包括半透明填充和轮廓描边）由后端生成PNG图像，通过base64编码传给前端渲染。

3. 后端核心实现细节

3.1 SAM3模型服务封装

后端核心是SAM3Service类，负责模型加载、图像特征缓存和分割推理。考虑到SAM3模型体积庞大（通常超过2GB），加载耗时可能达到数秒，我们实现了懒加载机制：

python复制class SAM3Service:
    def __init__(self, max_cache_size=10):
        self._model = None
        self._processor = None
        self._lock = threading.Lock()
        self._state_cache = OrderedDict()  # LRU缓存
        self._max_cache_size = max_cache_size

    def _ensure_model(self):
        if self._processor is not None:
            return
        with self._lock:
            if self._processor is not None:
                return
            self._model = build_sam3_image_model(
                enable_inst_interactivity=True,  # 关键参数：启用点击分割支持
            )
            self._processor = Sam3Processor(self._model, confidence_threshold=0.5)

关键参数enable_inst_interactivity=True启用了SAM1兼容的交互式预测器，这是支持点击和框选分割的基础。

3.2 图像特征缓存策略

set_image()操作需要运行完整的视觉编码器，是系统中最耗时的步骤（通常需要2-3秒）。而后续的分割操作只需要运行轻量的文本编码或解码头。因此，合理的缓存策略至关重要：

python复制def load_image(self, image_id, image):
    self._ensure_model()
    with torch.autocast("cuda", dtype=torch.bfloat16), torch.inference_mode():
        state = self._processor.set_image(image)
    self._put_state(image_id, state)  # 存入LRU缓存
    return {"image_id": image_id, "width": image.size[0], "height": image.size[1]}

我们采用LRU（Least Recently Used）缓存策略，当缓存超过上限时自动淘汰最久未使用的state，并主动释放GPU显存：

python复制def _put_state(self, image_id, state):
    self._state_cache[image_id] = state
    self._state_cache.move_to_end(image_id)
    while len(self._state_cache) > self._max_cache_size:
        _, evicted = self._state_cache.popitem(last=False)
        self._release_state_tensors(evicted)  # 显式释放GPU张量

3.3 三种分割模式实现

3.3.1 文本驱动分割

文本分割是最直观的标注方式，用户只需输入文本短语，模型返回所有匹配实例的掩码：

python复制def text_prompt(self, image_id, text):
    state = self._get_or_load_state(image_id)
    state = self._processor.set_text_prompt(text, state)
    return self._format_result(state)

3.3.2 点击交互分割

点击分割通过model.predict_inst()（SAM1兼容接口）实现，支持累积正负点：

python复制def click_prompt(self, image_id, points, labels):
    state = self._get_or_load_state(image_id)
    point_coords = np.array([[p[0] * img_w, p[1] * img_h] for p in points])
    point_labels = np.array(labels)
    use_multimask = len(points) == 1  # 单点用multimask选最佳，多点用single mask
    masks_np, scores_np, _ = self._model.predict_inst(
        state,
        point_coords=point_coords,
        point_labels=point_labels,
        multimask_output=use_multimask,
    )

这里的关键细节是predict_inst方法会复用set_image()计算好的backbone_out特征，避免重复运行视觉编码器，这使得首次加载图片较慢（几秒），但后续点击分割极快（毫秒级）。

3.3.3 框选分割

框选分割同样使用predict_inst，但传入box参数：

python复制def box_prompt(self, image_id, box, label):
    state = self._get_or_load_state(image_id)
    cx, cy, w, h = box
    box_pixels = np.array([
        (cx - w/2) * img_w, (cy - h/2) * img_h,
        (cx + w/2) * img_w, (cy + h/2) * img_h,
    ])
    masks_np, scores_np, _ = self._model.predict_inst(
        state, box=box_pixels, multimask_output=False,
    )

3.4 掩码可视化生成

掩码可视化由后端生成PNG图像，包含半透明填充和轮廓描边效果：

python复制def _generate_overlay(masks, img_h, img_w, colors=None):
    overlay = np.zeros((img_h, img_w, 4), dtype=np.uint8)
    for i, mask in enumerate(masks):
        color = colors[i % len(colors)]
        binary = mask > 0.5
        # 半透明填充
        overlay[binary, :3] = color
        overlay[binary, 3] = 80
        # 轮廓检测
        edge = np.zeros_like(binary, dtype=bool)
        edge[1:, :] |= binary[1:, :] != binary[:-1, :]
        edge[:-1, :] |= binary[1:, :] != binary[:-1, :]
        edge[:, 1:] |= binary[:, 1:] != binary[:, :-1]
        edge[:, :-1] |= binary[:, 1:] != binary[:, :-1]
        thick_edge = binary_dilation(edge, iterations=1)
        overlay[thick_edge, :3] = color
        overlay[thick_edge, 3] = 255
    img = PILImage.fromarray(overlay, 'RGBA')
    buf = io.BytesIO()
    img.save(buf, format='PNG', optimize=True)
    return base64.b64encode(buf.getvalue()).decode('utf-8')

轮廓检测的原理是：如果一个像素是前景（mask=1）但其四邻域有背景像素（mask=0），则该像素属于边缘。通过binary_dilation进行1像素膨胀使轮廓更清晰。

4. 前端实现关键技术

4.1 画布交互设计

前端基于react-konva实现画布交互，核心挑战是在同一Canvas上叠加渲染原始图片、掩码overlay、点击标记、框选预览和多边形编辑。

4.1.1 图片自适应缩放

画布需要根据容器尺寸和图像原始尺寸计算最佳显示比例：

typescript复制const maxWidth = containerWidth - 16;
const maxHeight = window.innerHeight * 0.85;
const scaleByWidth = imageWidth > 0 ? maxWidth / imageWidth : 1;
const scaleByHeight = imageHeight > 0 ? maxHeight / imageHeight : 1;
const scale = Math.min(scaleByWidth, scaleByHeight, 1);
const displayWidth = imageWidth * scale;
const displayHeight = imageHeight * scale;

4.1.2 点击交互处理

由于Canvas的onClick事件不响应右键，我们改用onMouseUp统一处理：

typescript复制const handleMouseUp = useCallback((e) => {
  const isRightClick = e.evt.button === 2;
  if (toolMode === 'click') {
    const label = isRightClick ? 0 : 1;  // 右键=负向点，左键=正向点
    onClickPrompt({ x: nx, y: ny, label });
  }
  if (toolMode === 'box' && boxStart) {
    onBoxPrompt([cx, cy, nw, nh], !isRightClick);
  }
}, [...]);

同时需要禁用默认的右键菜单：

typescript复制const handleContextMenu = useCallback((e) => {
  e.evt.preventDefault();
}, []);

4.2 多边形编辑功能

已保存的标注可以转换为多边形轮廓进行精细编辑。后端使用OpenCV提取并简化轮廓：

python复制def mask_to_polygon(mask, tolerance=2.0):
    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    polygons = []
    for contour in contours:
        approx = cv2.approxPolyDP(contour, tolerance, True)
        if len(approx) >= 3:
            polygons.append(approx.reshape(-1).tolist())
    return polygons

前端使用react-konva的Line和Circle组件渲染多边形和顶点，支持以下交互：

拖拽顶点调整形状
双击顶点删除
在边中点点击插入新顶点

5. 批量处理与数据导出

5.1 批量自动标注

批量标注功能通过SSE（Server-Sent Events）实现进度实时推送：

python复制@app.post("/api/batch/auto_label")
async def batch_auto_label(req: dict):
    def generate():
        for idx, image_id in enumerate(image_ids):
            # 按需加载图片特征
            if sam3_service._get_state(image_id) is None:
                image = Image.open(file_path).convert("RGB")
                sam3_service.load_image(image_id, image)
            # 文本分割
            result = sam3_service.text_prompt(image_id, text)
            # 保存标注
            for i in range(result["count"]):
                _annotations.append({...})
            yield f"data: {json.dumps({'status': 'done', 'count': saved})}\n\n"
    return StreamingResponse(generate(), media_type="text/event-stream")

前端使用EventSource API接收进度更新，实时显示进度条和完成数量。

5.2 数据导出格式

5.2.1 YOLO格式

导出为zip压缩包，包含图片文件夹和标注文件：

python复制# 边界框坐标转换
cx = ((box[0] + box[2]) / 2) / img_w
cy = ((box[1] + box[3]) / 2) / img_h
w = (box[2] - box[0]) / img_w
h = (box[3] - box[1]) / img_h
line = f"{class_id} {cx:.6f} {cy:.6f} {w:.6f} {h:.6f}"

5.2.2 COCO格式

使用pycocotools的标准RLE编码，确保与主流训练框架兼容：

python复制from pycocotools import mask as coco_mask
rle = coco_mask.encode(np.asfortranarray(mask.astype(np.uint8)))

6. 开发经验与问题排查

6.1 关键问题解决方案

reset_all_prompts陷阱
SAM3的Sam3Processor.reset_all_prompts()是原地修改state而不返回新对象。错误写法state = processor.reset_all_prompts(state)会导致state变为None。正确方式应直接调用不赋值。
点击分割的multimask策略
SAM推荐：单点使用multimask_output=True（返回3个候选取最佳），多点使用multimask_output=False（返回1个综合结果）。多点使用multimask可能导致模型选择局部掩码。
框选分割的正确实现
避免使用add_geometric_prompt，它需要先有文本prompt。独立框选应使用predict_inst的box参数。
RLE编解码的行列顺序
COCO的RLE是按列展开（Fortran order）。建议始终使用pycocotools的标准实现，避免手动编解码。
antd Upload组件的重复触发
directory模式下，beforeUpload会被每个文件触发一次。需要使用ref记录已处理文件防止重复上传。