在深度学习领域,Transformer架构已经成为了自然语言处理任务的事实标准。2017年Google Brain团队发表的《Attention is All You Need》论文彻底改变了序列建模的方式,摒弃了传统的RNN和CNN结构,完全基于自注意力机制构建模型。
Transformer之所以能取代RNN成为主流架构,主要得益于以下几个关键特性:
一个完整的Transformer架构包含以下核心组件:
位置编码是Transformer中非常精巧的设计,它需要满足两个条件:
python复制class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) *
(-math.log(10000.0) / d_model))
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0))
这里使用正弦和余弦函数的组合有几个关键考虑:
实际项目中,当序列长度超过训练时的最大长度时,可以考虑使用线性插值或学习的位置编码扩展方法。
多头注意力是Transformer最核心的组件,它允许模型同时关注来自不同位置的不同表示子空间的信息。
python复制class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
super().__init__()
assert d_model % num_heads == 0
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, query, key, value, attn_mask=None):
batch_size = query.size(0)
# 线性变换 + 分头
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# 缩放点积注意力
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if attn_mask is not None:
scores = scores.masked_fill(attn_mask == 0, -1e9)
attn = F.softmax(scores, dim=-1)
context = torch.matmul(attn, V)
# 合并多头
context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.W_o(context), attn
实现中的几个关键点:
位置感知前馈网络为模型提供了非线性变换能力:
python复制class PositionWiseFeedForward(nn.Module):
def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.linear2(self.dropout(F.relu(self.linear1(x))))
这里d_ff通常设置为d_model的4倍,为模型提供足够的表达能力。实际应用中,可以考虑使用GELU激活函数或更先进的SwiGLU变体。
编码器层由自注意力机制和前馈网络组成,每个子层都采用残差连接和层归一化:
python复制class EncoderLayer(nn.Module):
def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# 自注意力子层
attn_output, _ = self.self_attn(x, x, x, mask)
x = x + self.dropout(attn_output)
x = self.norm1(x)
# 前馈子层
ff_output = self.feed_forward(x)
x = x + self.dropout(ff_output)
x = self.norm2(x)
return x
残差连接有两个重要作用:
解码器层比编码器更复杂,包含三种注意力机制:
python复制class DecoderLayer(nn.Module):
def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
self.cross_attn = MultiHeadAttention(d_model, num_heads, dropout)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
# 掩码自注意力
self_attn_out, _ = self.self_attn(x, x, x, tgt_mask)
x = self.norm1(x + self_attn_out)
# 编码器-解码器注意力
cross_attn_out, _ = self.cross_attn(x, enc_output, enc_output, src_mask)
x = self.norm2(x + cross_attn_out)
# 前馈网络
ff_out = self.feed_forward(x)
x = self.norm3(x + ff_out)
return x
解码器的关键特点:
将各个组件组合成完整的Transformer模型:
python复制class Transformer(nn.Module):
def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, num_heads=8,
num_layers=6, d_ff=2048, max_seq_len=5000, dropout=0.1):
super().__init__()
self.src_embed = nn.Embedding(src_vocab_size, d_model)
self.tgt_embed = nn.Embedding(tgt_vocab_size, d_model)
self.pos_enc = PositionalEncoding(d_model, max_seq_len, dropout)
self.encoder = nn.ModuleList([
EncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.decoder = nn.ModuleList([
DecoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.final_linear = nn.Linear(d_model, tgt_vocab_size)
self.d_model = d_model
参数配置建议:
实现完整的前向计算流程:
python复制def forward(self, src, tgt, src_mask=None, tgt_mask=None):
# 源语言嵌入
src = self.src_embed(src) * math.sqrt(self.d_model)
src = self.pos_enc(src)
# 目标语言嵌入
tgt = self.tgt_embed(tgt) * math.sqrt(self.d_model)
tgt = self.pos_enc(tgt)
# 编码器处理
memory = src
for layer in self.encoder:
memory = layer(memory, src_mask)
# 解码器处理
output = tgt
for layer in self.decoder:
output = layer(output, memory, src_mask, tgt_mask)
# 输出投影
return self.final_linear(output)
嵌入层乘以√d_model是为了保持数值稳定性,防止经过位置编码后值变得太小。
解码器需要使用三角掩码确保自回归性质:
python复制def generate_square_subsequent_mask(sz):
mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf'))
return mask.masked_fill(mask == 1, float(0.0))
这个掩码确保每个位置只能关注之前的位置,未来信息被完全屏蔽。
为了获得更好的训练效果,建议采用以下配置:
python复制optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
python复制scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)
python复制criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
python复制torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
处理变长序列时需要特别注意:
python复制def create_padding_mask(seq, pad_idx):
return (seq != pad_idx).unsqueeze(1).unsqueeze(2)
# 使用示例
src_mask = create_padding_mask(src, pad_idx)
tgt_mask = generate_square_subsequent_mask(tgt.size(1)) & create_padding_mask(tgt, pad_idx)
填充掩码确保模型不会关注填充token,这对翻译质量至关重要。
原始Transformer使用Post-LN,现代实现更倾向于Pre-LN:
python复制class EncoderLayer(nn.Module):
def forward(self, x, mask=None):
# Pre-LN实现
residual = x
x = self.norm1(x)
x = self.self_attn(x, x, x, mask)[0]
x = residual + self.dropout(x)
residual = x
x = self.norm2(x)
x = self.feed_forward(x)
x = residual + self.dropout(x)
return x
Pre-LN的优势:
SwiGLU激活函数相比ReLU有显著提升:
python复制class SwiGLUFFN(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False)
self.w3 = nn.Linear(d_model, d_ff, bias=False)
self.w2 = nn.Linear(d_ff, d_model, bias=False)
def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))
SwiGLU的特点:
使用PyTorch原生高效注意力:
python复制with torch.backends.cuda.sdp_kernel(enable_flash=True):
output = F.scaled_dot_product_attention(
Q, K, V,
attn_mask=attn_mask,
is_causal=is_causal
)
Flash Attention的优势:
旋转位置编码(RoPE)已成为新标准:
python复制def apply_rotary(x, freqs_cis):
x_ = x.float().reshape(*x.shape[:-1], -1, 2)
x_complex = torch.view_as_complex(x_)
rotated = x_complex * freqs_cis
return torch.view_as_real(rotated).flatten(-2).type_as(x)
RoPE的优势:
Transformer对参数初始化非常敏感,建议:
python复制def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Embedding):
nn.init.normal_(m.weight, mean=0, std=0.02)
model.apply(init_weights)
特别需要注意:
使用AMP加速训练并减少显存占用:
python复制scaler = torch.cuda.amp.GradScaler()
with torch.amp.autocast(device_type='cuda', dtype=torch.float16):
outputs = model(src, tgt)
loss = criterion(outputs.view(-1, outputs.size(-1)), tgt.view(-1))
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
注意事项:
生产环境部署时的优化技巧:
python复制quantized_model = torch.quantization.quantize_dynamic(
model, {nn.Linear}, dtype=torch.qint8
)
python复制compiled_model = torch.compile(model, mode='max-autotune')
python复制# 自回归生成时缓存之前的KV
past_key_values = None
for _ in range(max_len):
outputs = model(input_ids, past_key_values=past_key_values)
past_key_values = outputs.past_key_values
遇到训练问题时可以检查:
移除编码器相关部分:
python复制class DecoderOnlyTransformer(nn.Module):
def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6):
super().__init__()
self.embed = nn.Embedding(vocab_size, d_model)
self.layers = nn.ModuleList([
DecoderLayer(d_model, num_heads) for _ in range(num_layers)
])
self.norm = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size)
适用于:
将Transformer应用于图像:
python复制class ViT(nn.Module):
def __init__(self, image_size=224, patch_size=16, num_classes=1000):
super().__init__()
num_patches = (image_size // patch_size) ** 2
self.patch_embed = nn.Conv2d(3, d_model, kernel_size=patch_size, stride=patch_size)
self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, d_model))
self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))
self.transformer = TransformerEncoder(num_layers, d_model, num_heads)
关键修改:
处理文本和图像的联合输入:
python复制class MultiModalTransformer(nn.Module):
def __init__(self):
super().__init__()
self.text_embed = nn.Embedding(text_vocab_size, d_model)
self.image_embed = nn.Linear(image_feat_size, d_model)
self.encoder = TransformerEncoder(num_layers, d_model, num_heads)
def forward(self, text, image):
text_emb = self.text_embed(text)
image_emb = self.image_embed(image)
combined = torch.cat([text_emb, image_emb], dim=1)
return self.encoder(combined)
应用场景:
处理长序列时的优化方法:
python复制from torch.nn.functional import scaled_dot_product_attention
def memory_efficient_attention(Q, K, V, mask=None):
with torch.backends.cuda.sdp_kernel(enable_mem_efficient=True):
return scaled_dot_product_attention(Q, K, V, attn_mask=mask)
优势:
训练深层模型时节省显存:
python复制from torch.utils.checkpoint import checkpoint
def custom_forward(*inputs):
# 定义前向计算
return layer(*inputs)
output = checkpoint(custom_forward, hidden_states)
工作原理:
多GPU分布式训练:
python复制from torch.distributed.tensor.parallel import parallelize_module
model = Transformer(...)
parallel_plan = {
"encoder.layers.0.self_attn": ColwiseParallel(),
"encoder.layers.0.feed_forward": RowwiseParallel(),
}
model = parallelize_module(model, device_mesh, parallel_plan)
支持的模式:
在实现Transformer模型的过程中,我深刻体会到几个关键点:
注意力机制的本质:它实际上是一种可学习的记忆检索机制,通过query-key-value的三元组操作,模型可以灵活地决定从历史的哪些部分获取信息。
残差连接的重要性:在深层网络中,残差连接不仅仅是训练技巧,它实际上创建了多条信息高速公路,让模型可以自由选择信息的流动路径。
位置编码的玄机:虽然理论上任何位置感知方法都可以,但良好的位置编码设计(如RoPE)能显著提升模型的外推能力和长程依赖建模。
缩放因子的必要性:点积注意力的缩放操作(√d_k)看似简单,但对稳定训练至关重要,特别是在深层网络中。
模块化设计的好处:将Transformer分解为可复用的组件(如MultiHeadAttention、FeedForward等)不仅使代码更清晰,也方便后续替换和改进单个组件。
在实际项目中,我建议先从小规模模型开始(如4层、d_model=256),验证模型能够正常学习和收敛,然后再逐步扩大规模。同时,要特别注意监控训练动态,包括梯度范数、参数更新幅度、注意力分布等指标,这些都能提供有价值的调试信息。