Rust实现高性能张量库：从内存布局到并行计算

硅谷IT胖子

1. 项目概述

在深度学习领域，张量(Tensor)是最基础的数据结构。作为多维数组的扩展形式，张量支撑着现代机器学习框架的核心计算。这个系列教程将带您用Rust语言从零开始构建张量库，本部分聚焦于张量的数据操作实现。

Rust作为系统级语言，其内存安全特性和零成本抽象使其成为实现高性能张量运算的理想选择。不同于Python生态中NumPy等成熟库，用Rust构建张量需要更深入地理解内存布局和计算优化。

2. 核心设计思路

2.1 张量表示方案

我们采用结构体封装张量的核心属性：

rust复制pub struct Tensor<T> {
    data: Vec<T>,        // 连续存储的数据
    shape: Vec<usize>,   // 各维度大小
    strides: Vec<isize>, // 各维度步长
}

这种设计的关键优势在于：

data使用连续内存提升缓存局部性
strides实现灵活的视图(view)机制
泛型<T>支持多种数据类型

2.2 内存布局策略

行优先(row-major)存储是深度学习领域的通用选择：

rust复制// 3x2矩阵的内存布局示例
let tensor = Tensor {
    data: vec![1, 2, 3, 4, 5, 6],
    shape: vec![3, 2],
    strides: vec![2, 1], // 行步长=2，列步长=1
};

这种布局与BLAS等数学库兼容，同时：

最后一维连续存储适合SIMD向量化
步长计算支持广播(broadcasting)操作
便于与C/C++库交互

3. 基础操作实现

3.1 创建方法

提供多种构造方式：

rust复制// 从向量创建
Tensor::from_vec(vec![1,2,3], vec![3]);

// 全零张量
Tensor::zeros(vec![2,2]);

// 随机初始化
Tensor::rand_uniform(vec![3,3], 0.0, 1.0);

3.2 索引与切片

实现Index特质支持高效访问：

rust复制impl<T> Index<usize> for Tensor<T> {
    type Output = [T];
    fn index(&self, idx: usize) -> &[T] {
        let offset = idx * self.strides[0] as usize;
        &self.data[offset..offset+self.strides[0] as usize]
    }
}

// 使用示例
let mat = Tensor::from_vec(vec![1,2,3,4], vec![2,2]);
assert_eq!(mat[1][0], 3);

切片操作通过调整shape和strides实现零拷贝：

rust复制fn slice(&self, dim: usize, start: usize, end: usize) -> Self {
    let mut new_shape = self.shape.clone();
    new_shape[dim] = end - start;
    
    let offset = start * self.strides[dim] as usize;
    let new_data = &self.data[offset..];
    
    Tensor { data: new_data, shape: new_shape, strides: self.strides.clone() }
}

4. 数学运算实现

4.1 逐元素运算

基础算术运算通过特质重载：

rust复制impl<T: Add<Output=T>> Add for Tensor<T> {
    type Output = Self;
    fn add(self, rhs: Self) -> Self {
        assert_eq!(self.shape, rhs.shape);
        let data = self.data.iter().zip(&rhs.data).map(|(a,b)| *a + *b).collect();
        Tensor { data, shape: self.shape, strides: self.strides }
    }
}

4.2 矩阵乘法

实现优化的GEMM内核：

rust复制fn matmul(a: &Tensor<f32>, b: &Tensor<f32>) -> Tensor<f32> {
    assert_eq!(a.shape[1], b.shape[0]);
    let m = a.shape[0];
    let n = b.shape[1];
    let k = a.shape[1];
    
    let mut res = Tensor::zeros(vec![m, n]);
    
    // 分块优化
    const BLOCK_SIZE: usize = 64;
    for i in (0..m).step_by(BLOCK_SIZE) {
        for j in (0..n).step_by(BLOCK_SIZE) {
            for kk in (0..k).step_by(BLOCK_SIZE) {
                // 微内核计算
                for ii in i..(i+BLOCK_SIZE).min(m) {
                    for jj in j..(j+BLOCK_SIZE).min(n) {
                        let mut sum = 0.0;
                        for kkk in kk..(kk+BLOCK_SIZE).min(k) {
                            sum += a[[ii, kkk]] * b[[kkk, jj]];
                        }
                        res[[ii, jj]] += sum;
                    }
                }
            }
        }
    }
    res
}

5. 高级操作实现

5.1 广播机制

处理不同形状张量运算：

rust复制fn broadcast_shapes(a: &[usize], b: &[usize]) -> Option<Vec<usize>> {
    let max_rank = a.len().max(b.len());
    let mut shape = vec![1; max_rank];
    
    for (i, &dim) in a.iter().rev().enumerate() {
        shape[max_rank - 1 - i] = dim;
    }
    
    for (i, &dim) in b.iter().rev().enumerate() {
        let idx = max_rank - 1 - i;
        if shape[idx] != 1 && dim != 1 && shape[idx] != dim {
            return None;
        }
        shape[idx] = shape[idx].max(dim);
    }
    
    Some(shape)
}

5.2 转置操作

通过调整步长实现零拷贝转置：

rust复制fn transpose(&self, dim0: usize, dim1: usize) -> Self {
    let mut new_shape = self.shape.clone();
    new_shape.swap(dim0, dim1);
    
    let mut new_strides = self.strides.clone();
    new_strides.swap(dim0, dim1);
    
    Tensor { data: self.data.clone(), shape: new_shape, strides: new_strides }
}

6. 性能优化技巧

6.1 内存对齐

使用对齐分配提升SIMD效率：

rust复制use std::alloc::{alloc, Layout};

fn aligned_vec<T>(capacity: usize) -> Vec<T> {
    let layout = Layout::from_size_align(
        capacity * std::mem::size_of::<T>(),
        64 // 缓存行对齐
    ).unwrap();
    
    unsafe {
        let ptr = alloc(layout) as *mut T;
        Vec::from_raw_parts(ptr, 0, capacity)
    }
}

6.2 并行计算

利用Rayon实现数据并行：

rust复制use rayon::prelude::*;

impl Tensor<f32> {
    fn parallel_map(&self, f: impl Fn(f32) -> f32 + Sync + Send) -> Self {
        let new_data = self.data.par_iter().map(|&x| f(x)).collect();
        Tensor { data: new_data, shape: self.shape.clone(), strides: self.strides.clone() }
    }
}

7. 测试与验证

7.1 单元测试

确保核心操作正确性：

rust复制#[test]
fn test_matmul() {
    let a = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2]);
    let b = Tensor::from_vec(vec![5.0, 6.0, 7.0, 8.0], vec![2, 2]);
    let c = matmul(&a, &b);
    assert_eq!(c.data, vec![19.0, 22.0, 43.0, 50.0]);
}

7.2 基准测试

使用criterion进行性能分析：

rust复制fn bench_matmul(c: &mut Criterion) {
    let a = Tensor::rand_uniform(vec![256, 256], 0.0, 1.0);
    let b = Tensor::rand_uniform(vec![256, 256], 0.0, 1.0);
    
    c.bench_function("matmul 256x256", |bench| {
        bench.iter(|| matmul(&a, &b))
    });
}

8. 扩展与展望

当前实现已支持基础张量操作，后续可扩展：

自动微分支持
GPU加速后端
稀疏张量存储
分布式计算

在内存管理方面，考虑引入引用计数或写时复制机制优化大张量处理。对于特殊操作如卷积，可以设计专门的存储布局提升性能。

已经到底了哦