遥感AI训练数据集构建全流程指南-AI智能范式网

遥感AI训练数据集构建全流程指南

Mu Tian

1. 遥感AI训练数据集构建概述

在计算机视觉领域，高质量的训练数据集是模型性能的基础保障。对于遥感影像分析这一特殊领域，数据集构建面临着诸多独特挑战：多模态数据融合、大尺度地理空间配准、旋转目标标注等。本文将系统介绍从原始遥感数据到标准COCO/YOLO格式数据集的完整生产流程。

遥感数据集的特殊性主要体现在三个方面：首先，数据来源多样，包括光学、SAR、LiDAR等多种传感器；其次，目标具有任意方向性，常规水平框标注难以准确描述；最后，地理空间特性要求严格的几何精度保证。这些特点使得通用图像标注工具和方法在遥感场景下往往水土不服。

2. 多源数据获取与预处理

2.1 数据源选择策略

构建遥感数据集的第一步是获取合适的原始数据。不同类型的数据源各有优劣：

商业卫星数据（如WorldView系列）：提供0.3-0.5米的高分辨率，适合精细目标检测，但成本高昂
开源卫星数据（如Sentinel-2）：免费获取，10米分辨率，适合大范围监测
SAR数据（如Sentinel-1）：不受天气影响，适合多云地区
航空/UAV数据：灵活获取超高分辨率影像，但覆盖范围有限

在实际项目中，我们通常采用混合策略：以开源数据为骨架，在关键区域补充商业高分辨率数据。例如，在建筑物检测任务中，可以使用Sentinel-2进行大范围初筛，再对重点区域购买WorldView数据。

2.2 数据预处理流水线

原始遥感数据必须经过严格预处理才能用于标注。典型预处理流程包括：

辐射校正：将DN值转换为地表反射率

python复制# 使用rasterio进行简单的辐射校正
import rasterio

with rasterio.open('raw_image.tif') as src:
    # 假设已知增益和偏置参数
    gain = 0.01
    bias = 0
    image = src.read() * gain + bias
    profile = src.profile
    profile.update(dtype=rasterio.float32)

    with rasterio.open('calibrated.tif', 'w', **profile) as dst:
        dst.write(image.astype(rasterio.float32))

几何校正：消除地形和传感器姿态影响

bash复制# 使用GDAL进行正射校正
gdalwarp -tps -r bilinear -dstalpha -et 0.01 \
         -to "RPC_DEM=/path/to/dem.tif" \
         input.tif output_ortho.tif

影像融合：全色与多光谱数据融合示例

python复制from skimage.transform import pyramid_expand

# 全色影像上采样到多光谱分辨率
pan_resized = pyramid_expand(pan_img, upscale=4, sigma=3)

# 简单的Brovey融合
fused = np.zeros_like(ms_img)
for i in range(ms_img.shape[2]):
    fused[:,:,i] = ms_img[:,:,i] * (pan_resized / (ms_img.mean(axis=2) + 1e-6))

预处理阶段需要特别注意保持几何精度，任何坐标偏差都会导致后续标注失效。建议在每个处理步骤后使用QGIS等工具检查影像与参考底图的对齐情况。

3. 标注规范与工具选择

3.1 遥感特有标注规范

与自然图像不同，遥感目标标注有几个关键特点：

旋转边界框(OBB)：船舶、飞机等目标具有方向性，需使用五参数表示法(x,y,w,h,θ)
多尺度标注：同一类目标在不同分辨率影像中表现差异巨大
上下文标注：除目标本身外，还需标注周围环境特征

我们扩展了COCO格式以支持旋转框：

json复制{
  "annotations": [{
    "id": 1,
    "image_id": 1,
    "category_id": 1,
    "bbox": [x,y,w,h,theta],  // 旋转框参数
    "area": w*h,
    "segmentation": [[x1,y1,x2,y2,...]],  // 多边形顶点
    "iscrowd": 0
  }]
}

3.2 标注工具对比

针对遥感特点，我们对主流标注工具进行了适配性评估：

工具	旋转框支持	多光谱支持	大影像性能	协作功能
LabelImg	否	否	差	无
RoLabelImg	是	否	一般	无
CVAT	是	有限	较好	强
X-AnyLabeling	是	是	优	中

对于大型项目，推荐使用CVAT进行团队协作标注。其优势包括：

支持任务分配和进度跟踪
内置质量审查工作流
可扩展的服务器端部署

对于个人研究者，X-AnyLabeling是不错的选择，它集成了SAM等AI辅助标注模型，能显著提升标注效率。

4. 数据增强与质量控制

4.1 遥感专用数据增强

不同于自然图像，遥感数据增强需要考虑地理空间一致性：

几何增强：旋转、翻转需同步调整坐标参考系
辐射增强：保持光谱特征不变性
多时相增强：确保时间序列的时序关系不被破坏

使用Albumentations库的示例：

python复制import albumentations as A

transform = A.Compose([
    A.RandomRotate90(p=0.5),
    A.Flip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.GaussNoise(var_limit=(10,50), p=0.1),
    A.Cutout(num_holes=8, max_h_size=32, max_w_size=32, p=0.3)
], bbox_params=A.BboxParams(format='coco', label_fields=['category_ids']))

4.2 质量控制系统

数据集质量直接影响模型性能，我们建立了三级质检体系：

标注一致性检查：计算标注员间IoU，要求>0.85
几何精度验证：随机抽取5%样本进行人工复核
光谱真实性检测：确保增强后的数据符合物理规律

自动化质检脚本示例：

python复制def check_annotation_quality(ann_df):
    """执行自动化质量检查"""
    issues = []
    
    # 检查边界框有效性
    invalid_boxes = ann_df[ann_df['width'] <= 0].index
    issues.extend(f"无效宽度 in {idx}" for idx in invalid_boxes)
    
    # 检查类别标签
    unknown_cats = ann_df[~ann_df['category_id'].isin(VALID_CATEGORIES)].index
    issues.extend(f"未知类别 in {idx}" for idx in unknown_cats)
    
    # 检查坐标范围
    out_of_bounds = ann_df[
        (ann_df['x'] < 0) | (ann_df['y'] < 0) |
        (ann_df['x'] + ann_df['width'] > IMG_WIDTH) |
        (ann_df['y'] + ann_df['height'] > IMG_HEIGHT)
    ].index
    issues.extend(f"超出边界 in {idx}" for idx in out_of_bounds)
    
    return issues

5. 格式转换与数据集发布

5.1 COCO格式转换要点

将自定义标注转换为COCO格式时需注意：

类别ID必须从1开始连续编号
每个图像条目需包含width和height信息
标注区域面积(area)应精确计算

转换脚本核心逻辑：

python复制def convert_to_coco(input_dir, output_json):
    coco_dict = {
        "images": [],
        "annotations": [],
        "categories": []
    }
    
    # 添加类别
    for i, cat in enumerate(CATEGORIES, 1):
        coco_dict["categories"].append({
            "id": i,
            "name": cat,
            "supercategory": "object"
        })
    
    # 处理每个图像
    ann_id = 1
    for img_file in os.listdir(os.path.join(input_dir, 'images')):
        img_path = os.path.join(input_dir, 'images', img_file)
        img_id = len(coco_dict["images"]) + 1
        
        # 添加图像信息
        with Image.open(img_path) as img:
            width, height = img.size
        
        coco_dict["images"].append({
            "id": img_id,
            "file_name": img_file,
            "width": width,
            "height": height
        })
        
        # 添加对应标注
        ann_file = os.path.join(input_dir, 'annotations', 
                               img_file.replace('.jpg', '.txt'))
        with open(ann_file) as f:
            for line in f:
                cat_id, x, y, w, h = map(float, line.strip().split())
                area = w * h
                
                coco_dict["annotations"].append({
                    "id": ann_id,
                    "image_id": img_id,
                    "category_id": int(cat_id),
                    "bbox": [x, y, w, h],
                    "area": area,
                    "iscrowd": 0
                })
                ann_id += 1
    
    # 保存结果
    with open(output_json, 'w') as f:
        json.dump(coco_dict, f)

5.2 YOLO格式注意事项

YOLO格式使用归一化坐标，转换时需要特别注意：

坐标中心化：(x_center, y_center) = (x + w/2)/img_width
归一化：所有值应在[0,1]范围内
类别ID从0开始

转换示例：

python复制def coco_to_yolo(coco_ann, output_dir):
    # 创建目录结构
    os.makedirs(os.path.join(output_dir, 'labels'), exist_ok=True)
    os.makedirs(os.path.join(output_dir, 'images'), exist_ok=True)
    
    # 处理每个图像
    for img_info in coco_ann['images']:
        img_id = img_info['id']
        img_w = img_info['width']
        img_h = img_info['height']
        
        # 收集该图像的所有标注
        anns = [a for a in coco_ann['annotations'] if a['image_id'] == img_id]
        
        # 生成YOLO格式标注文件
        label_file = os.path.join(output_dir, 'labels', 
                                img_info['file_name'].replace('.jpg', '.txt'))
        with open(label_file, 'w') as f:
            for ann in anns:
                # 转换为YOLO格式
                x, y, w, h = ann['bbox']
                x_center = (x + w/2) / img_w
                y_center = (y + h/2) / img_h
                width = w / img_w
                height = h / img_h
                
                # YOLO格式: class_id x_center y_center width height
                line = f"{ann['category_id']-1} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}\n"
                f.write(line)
        
        # 复制图像文件
        shutil.copy(os.path.join(input_dir, 'images', img_info['file_name']),
                  os.path.join(output_dir, 'images', img_info['file_name']))

6. 实战经验与避坑指南

在多个遥感数据集构建项目中，我们总结了以下关键经验：

坐标系统一致性：确保所有数据使用同一CRS（推荐EPSG:4326或UTM），混合使用不同坐标系会导致标注错位
多时相数据配准：变化检测任务中，不同时相影像必须严格配准，建议使用SIFT特征匹配结合RANSAC：

python复制import cv2

def align_images(img1, img2):
    # 初始化SIFT检测器
    sift = cv2.SIFT_create()
    
    # 查找关键点和描述符
    kp1, des1 = sift.detectAndCompute(img1, None)
    kp2, des2 = sift.detectAndCompute(img2, None)
    
    # FLANN匹配器
    flann = cv2.FlannBasedMatcher(dict(algorithm=1, trees=5), dict(checks=50))
    matches = flann.knnMatch(des1, des2, k=2)
    
    # 筛选优质匹配
    good = []
    for m,n in matches:
        if m.distance < 0.7*n.distance:
            good.append(m)
    
    # 计算单应性矩阵
    src_pts = np.float32([kp1[m.queryIdx].pt for m in good]).reshape(-1,1,2)
    dst_pts = np.float32([kp2[m.trainIdx].pt for m in good]).reshape(-1,1,2)
    
    M, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)
    
    # 应用变换
    aligned = cv2.warpPerspective(img1, M, (img2.shape[1], img2.shape[0]))
    
    return aligned, M

标注团队培训：遥感目标标注需要专业知识，必须对标注员进行充分培训，包括：
- 典型目标特征识别
- 遮挡和截断目标处理规范
- 困难样本标注标准
版本控制：使用DVC管理数据集版本：

bash复制# 初始化DVC
dvc init
dvc add data/raw_images
dvc add data/annotations

# 添加远程存储
dvc remote add -d myremote /path/to/remote

# 推送数据
dvc push

计算资源规划：大规模数据集处理需要合理配置：
- 使用并行处理加速预处理：

python复制from multiprocessing import Pool

def process_image(args):
    img_path, output_dir = args
    # 执行各种处理...
    return result

if __name__ == '__main__':
    img_list = [...]  # 所有待处理图像路径
    args_list = [(img, 'output') for img in img_list]
    
    with Pool(8) as p:  # 使用8个进程
        results = p.map(process_image, args_list)

对大文件使用分块处理：

python复制with rasterio.open('large.tif') as src:
    for ji, window in src.block_windows():
        chunk = src.read(window=window)
        # 处理分块数据

构建高质量的遥感训练数据集是一项系统工程，需要严谨的工作流程和严格的质量控制。本文介绍的方法已在多个实际项目中验证，可产出符合主流深度学习框架要求的标准数据集。