大语言模型预训练中的序列打包技术解析

feizai yun

1. 大语言模型预训练中的序列打包技术

在大语言模型(LLM)预训练过程中，处理变长文本序列是一个常见挑战。传统方法通过填充(padding)使所有序列达到相同长度，但这会浪费计算资源在无意义的填充token上。序列打包(Packed Sequences)技术通过巧妙的数据重组和注意力掩码设计，可以显著提升训练效率。

1.1 传统填充方法的局限性

假设我们有以下三个句子需要批量处理：

"The cat sat on the mat"
"The dog ate my homework"
"My aunt is a teacher"

采用传统填充方法时，我们需要将所有序列填充到最长序列的长度（这里是6个token）。这意味着第二个句子需要填充1个token，第一个句子需要填充2个token。在批量大小为3、序列长度为6的情况下，实际有效token只有14个，却有4个是填充token，浪费了约22%的计算资源。

更严重的是，Transformer模型的自注意力机制会对这些填充token进行计算，消耗宝贵的GPU内存和计算周期。当处理大规模预训练数据时，这种浪费会累积成巨大的资源损耗。

1.2 序列打包的核心思想

序列打包技术通过以下方式解决这个问题：

将多个较短序列连接成一个长序列
使用特殊token（如EOS）标记序列边界
设计定制化的注意力掩码防止跨序列信息泄露

这种处理方式完全消除了填充token，使每个计算周期都用于处理真实数据。在典型的大规模预训练场景中，序列打包可以将有效吞吐量提升15-30%，具体取决于原始序列长度的分布情况。

2. 序列打包的完整实现方案

2.1 基础准备工作

首先需要设置基本的模型和分词器。我们以GPT-2为例：

python复制import torch
torch.set_printoptions(linewidth=200)
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config)

2.2 单一样本的序列打包

对于单个打包序列，我们需要特别注意注意力掩码的设计。标准的因果掩码(causal mask)会允许后续序列关注前面序列的内容，这不符合语言建模的独立性假设。

python复制sentences = ["The cat sat on the mat", 
             "The dog ate my homework",
             "My aunt is a teacher"]

# 分词并添加EOS token
tokenized_sentences = tokenizer(sentences, return_attention_mask=False, 
                              add_special_tokens=False)["input_ids"]
tokenized_sentences = [t for s in tokenized_sentences 
                      for t in s + [tokenizer.eos_token_id]]
tokenized_sentences = torch.tensor(tokenized_sentences)

正确的注意力掩码应该满足两个条件：

保持因果性 - 每个token只能关注自身及之前的token
防止跨序列关注 - 每个序列只能关注自身序列内的token

python复制def get_attention_mask_for_packed_sequence(x, eos_token_id):
    T = x.size(0)
    eos_indices = (x == eos_token_id).nonzero().squeeze()
    reps = torch.cat([eos_indices[[0]]+1, eos_indices[1:] - eos_indices[:-1]])
    
    repeated_idx = torch.repeat_interleave(eos_indices, reps).view(1,-1).expand(T, -1)
    mask_indices = torch.arange(T).view(-1,1).expand(-1, T)
    
    mask = torch.ones(T, T, dtype=torch.bool).tril()
    mask.masked_fill_(mask_indices > repeated_idx, False)
    return mask

2.3 位置ID的调整

在序列打包场景下，位置ID需要从每个序列的开头重新计数，而不是延续前一个序列的位置：

python复制pos_ids = torch.arange(T) - torch.repeat_interleave(
    torch.cat([torch.tensor([0]), eos_indices+1], dim=0)[:-1], 
    reps)

这种处理方式明确标记了序列边界，帮助模型区分不同序列的上下文。

3. 批量处理的实现技巧

实际训练中我们需要处理批量数据，这带来了额外的复杂性。关键在于如何高效地构造跨样本的注意力掩码。

3.1 创建批量打包数据

python复制sentences2 = ["Rome wasn't built in a day",
              "My hovercraft is full of eels"]
              
tokenized_sentences2 = tokenizer(sentences2, return_attention_mask=False,
                               add_special_tokens=False)["input_ids"]
tokenized_sentences2 = torch.tensor(
    [t for s in tokenized_sentences2 for t in s + [tokenizer.eos_token_id]])

batch = torch.nn.utils.rnn.pad_sequence(
    [tokenized_sentences, tokenized_sentences2],
    batch_first=True, 
    padding_value=tokenizer.eos_token_id
)
B, T = batch.shape

3.2 批量注意力掩码构造

批量处理的核心挑战是正确识别各序列边界并构造相应的掩码：

python复制def get_batched_attention_mask(x, token_id, eos=True):
    B, T = x.shape
    eos_idx = (x.view(-1) == token_id).nonzero(as_tuple=True)[0] + eos
    eos_idx_expanded = torch.cat([eos_idx, torch.arange(0,B*T+1,T)]).unique().sort()[0]
    
    normalized_idx = eos_idx_expanded - (eos_idx_expanded // T) * T
    normalized_idx = torch.where(normalized_idx == 0, T, normalized_idx)
    
    reps = normalized_idx[1:] - normalized_idx[:-1]
    reps = torch.where(reps < 1, normalized_idx[1:], reps)
    
    repeated_idx = torch.repeat_interleave(normalized_idx[1:], reps).view(B,1,T).expand(-1,T,-1)
    mask_indices = torch.arange(T).view(1,-1,1).expand(B, -1, T)
    
    mask = torch.ones(T, T, dtype=torch.bool).tril().expand(B, -1, -1)
    mask = mask.masked_fill(mask_indices >= repeated_idx, False)
    return mask

3.3 批量位置ID生成

类似地，批量位置ID也需要特殊处理：

python复制pos_ids = (torch.arange(B*T) - torch.repeat_interleave(
    eos_idx_expanded[:-1], reps)).view(B,T)

4. 实际应用中的注意事项

4.1 性能优化技巧

内存布局优化：打包后的长序列可能导致内存访问模式不规则。建议：
- 对输入序列按长度降序排序后再打包
- 使用torch.as_strided等低阶操作优化内存访问

动态批处理：实现动态批处理策略，平衡序列长度和批量大小：

python复制def dynamic_batching(sequences, max_tokens=4096):
    batches = []
    current_batch = []
    current_length = 0
    
    for seq in sorted(sequences, key=len, reverse=True):
        if current_length + len(seq) > max_tokens:
            batches.append(current_batch)
            current_batch = []
            current_length = 0
        current_batch.append(seq)
        current_length += len(seq)
    
    if current_batch:
        batches.append(current_batch)
    return batches

4.2 常见问题排查

注意力泄露：如果模型表现异常，首先检查注意力掩码是否正确：
- 确保没有token能跨序列关注
- 验证EOS token是否被正确处理
位置ID错误：位置ID错误会导致模型混淆序列边界：
- 检查每个序列的位置ID是否从0开始
- 验证位置ID是否与注意力掩码对齐
性能下降：如果训练速度不如预期：
- 检查序列长度分布是否均衡
- 监控GPU利用率，确保没有内存瓶颈

4.3 进阶应用场景

混合精度训练：序列打包与AMP(自动混合精度)结合时需注意：
- 确保注意力掩码使用正确的数据类型(bool或uint8)
- 位置ID应保持为long类型
分布式训练：在多GPU环境中：
- 确保各设备接收的打包序列长度相近
- 考虑使用梯度累积补偿可能的批量大小变化
可变长度微调：将序列打包应用于微调阶段时：
- 对于分类任务，注意保留[CLS]等特殊token
- 对于生成任务，确保解码阶段也使用正确的掩码逻辑