2017年,Google Brain团队在论文《Attention Is All You Need》中提出的Transformer架构,彻底改变了深度学习领域的发展轨迹。作为一名长期从事NLP和计算机视觉研究的工程师,我亲眼见证了Transformer如何从一篇学术论文演变为当今AI领域最基础、最重要的架构范式。
在Transformer出现之前,序列建模主要依赖RNN及其变体LSTM、GRU。这些模型存在两个致命缺陷:
顺序计算的诅咒:RNN必须按时间步顺序处理序列,无法充分利用现代GPU的并行计算能力。在处理长文档时(如1000个token以上),训练速度会变得极其缓慢。
长期依赖困境:虽然LSTM通过门控机制缓解了梯度消失问题,但当序列长度超过100时,模型仍然难以有效捕捉远距离token之间的关系。
我在2016年参与的一个机器翻译项目中,使用双向LSTM训练一个中等规模的英德翻译模型,在8块P100 GPU上需要近两周时间。而改用Transformer架构后,同样的训练数据和硬件配置,收敛时间缩短到了3天,同时BLEU分数提升了2.3分。
Transformer的革命性在于完全摒弃了循环结构,代之以**自注意力机制(Self-Attention)和位置编码(Positional Encoding)**的组合。这种设计带来了三个关键优势:
下面是一个简化版的Self-Attention实现,展示了其核心计算逻辑:
python复制import torch
import torch.nn as nn
import math
class SelfAttention(nn.Module):
def __init__(self, embed_size):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
# 定义Q、K、V的线性变换
self.values = nn.Linear(embed_size, embed_size, bias=False)
self.keys = nn.Linear(embed_size, embed_size, bias=False)
self.queries = nn.Linear(embed_size, embed_size, bias=False)
def forward(self, x):
# x形状: (batch_size, seq_len, embed_size)
batch_size, seq_len, _ = x.shape
# 计算Q、K、V
Q = self.queries(x) # (batch_size, seq_len, embed_size)
K = self.keys(x) # (batch_size, seq_len, embed_size)
V = self.values(x) # (batch_size, seq_len, embed_size)
# 计算注意力分数
attention_scores = torch.matmul(Q, K.transpose(1,2)) / math.sqrt(self.embed_size)
attention_weights = torch.softmax(attention_scores, dim=-1)
# 加权求和
out = torch.matmul(attention_weights, V)
return out
标准的Transformer由编码器(Encoder)和解码器(Decoder)两部分组成,每部分都包含多个相同的层。以原始论文中的基础配置为例:
编码器:6个相同的层,每层包含:
解码器:6个相同的层,每层比编码器多一个编码器-解码器注意力层
多头注意力是Transformer最具创新性的设计,它允许模型在不同的表示子空间中学习信息。以下是完整的实现:
python复制class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8, dropout=0.1):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# 线性变换层
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def scaled_dot_product_attention(self, Q, K, V, mask=None):
"""
Q: [batch_size, num_heads, seq_len, d_k]
K: [batch_size, num_heads, seq_len, d_k]
V: [batch_size, num_heads, seq_len, d_k]
"""
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = torch.softmax(scores, dim=-1)
attention_weights = self.dropout(attention_weights)
output = torch.matmul(attention_weights, V)
return output, attention_weights
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# 线性变换并分割多头
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# 计算注意力
attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
# 合并多头
attention_output = attention_output.transpose(1, 2).contiguous().view(
batch_size, -1, self.d_model
)
# 输出线性变换
output = self.W_o(attention_output)
return output, attention_weights
由于Transformer没有循环结构,需要通过位置编码注入序列的顺序信息。原始论文使用正弦和余弦函数的组合:
python复制class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
(-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # [1, max_len, d_model]
self.register_buffer('pe', pe)
def forward(self, x):
# x: [batch_size, seq_len, d_model]
x = x + self.pe[:, :x.size(1)]
return x
实际应用中,对于不超过512个token的序列,直接使用学习得到的位置嵌入(Learned Positional Embedding)通常效果更好且实现更简单。
Transformer在NLP领域催生了两个最重要的模型家族:
BERT (Bidirectional Encoder Representations from Transformers)
GPT (Generative Pre-trained Transformer)
以下是BERT风格的文本分类实现示例:
python复制class TransformerForClassification(nn.Module):
def __init__(self, vocab_size, num_classes, d_model=768,
num_layers=12, num_heads=12, dropout=0.1):
super().__init__()
self.transformer = TransformerEncoder(
vocab_size, d_model, num_layers, num_heads,
d_ff=d_model*4, dropout=dropout
)
# [CLS] token用于分类
self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))
self.classifier = nn.Sequential(
nn.LayerNorm(d_model),
nn.Dropout(dropout),
nn.Linear(d_model, d_model),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_model, num_classes)
)
def forward(self, input_ids, attention_mask=None):
batch_size = input_ids.size(0)
# 添加[CLS] token
cls_tokens = self.cls_token.expand(batch_size, -1, -1)
input_embeddings = self.transformer.embedding(input_ids)
input_embeddings = torch.cat([cls_tokens, input_embeddings], dim=1)
# 获取Transformer输出
transformer_output = self.transformer(input_embeddings)
# 取[CLS] token对应的输出
cls_output = transformer_output[:, 0, :]
# 分类
logits = self.classifier(cls_output)
return logits
传统CNN通过局部感受野逐步构建全局理解,而ViT直接将图像分割为patch序列,用纯Transformer架构处理:
python复制class VisionTransformer(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_channels=3,
num_classes=1000, embed_dim=768, depth=12,
num_heads=12, mlp_ratio=4., dropout=0.1):
super().__init__()
# Patch嵌入
self.patch_embed = nn.Conv2d(in_channels, embed_dim,
kernel_size=patch_size,
stride=patch_size)
num_patches = (img_size // patch_size) ** 2
# [CLS] token和位置编码
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.pos_drop = nn.Dropout(dropout)
# Transformer编码器
self.blocks = nn.ModuleList([
TransformerBlock(embed_dim, num_heads, embed_dim*mlp_ratio, dropout)
for _ in range(depth)
])
# 分类头
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x):
batch_size = x.shape[0]
# Patch嵌入 [B, C, H, W] -> [B, num_patches, embed_dim]
x = self.patch_embed(x).flatten(2).transpose(1, 2)
# 添加[CLS] token
cls_tokens = self.cls_token.expand(batch_size, -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
# 添加位置编码
x = x + self.pos_embed
x = self.pos_drop(x)
# 通过Transformer块
for block in self.blocks:
x = block(x)
# 分类
x = self.norm(x)
logits = self.head(x[:, 0])
return logits
在实际项目中,对于中小规模数据集,通常会在ViT前加入一个轻量级CNN作为特征提取器,这种混合架构(Hybrid Architecture)往往比纯ViT表现更好。
原始Transformer的注意力机制具有O(n²)的时间和空间复杂度,对于长序列(如1000+ token)非常不友好。以下是几种主流优化方案:
稀疏注意力(Sparse Attention)
低秩近似(Low-Rank Approximation)
核方法近似(Kernel Approximation)
以下是稀疏注意力的简化实现:
python复制class SparseAttention(nn.Module):
def __init__(self, d_model, num_heads, block_size=64, dropout=0.1):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.block_size = block_size
self.d_k = d_model // num_heads
# 线性变换层
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def block_sparse_attention(self, Q, K, V):
batch_size, num_heads, seq_len, d_k = Q.shape
# 将序列分块
num_blocks = seq_len // self.block_size
Q_blocks = Q.view(batch_size, num_heads, num_blocks,
self.block_size, d_k)
K_blocks = K.view(batch_size, num_heads, num_blocks,
self.block_size, d_k)
V_blocks = V.view(batch_size, num_heads, num_blocks,
self.block_size, d_k)
# 只计算相邻块的注意力
attention_outputs = []
for i in range(num_blocks):
# 计算当前块与前一个、当前、后一个块的注意力
start_idx = max(0, i - 1)
end_idx = min(num_blocks, i + 2)
Q_block = Q_blocks[:, :, i]
K_neighbors = K_blocks[:, :, start_idx:end_idx]
V_neighbors = V_blocks[:, :, start_idx:end_idx]
# 重塑以计算注意力
K_neighbors = K_neighbors.view(
batch_size, num_heads, -1, d_k
)
V_neighbors = V_neighbors.view(
batch_size, num_heads, -1, d_k
)
# 计算注意力
scores = torch.matmul(Q_block, K_neighbors.transpose(-2, -1))
scores = scores / math.sqrt(self.d_k)
attention_weights = torch.softmax(scores, dim=-1)
block_output = torch.matmul(attention_weights, V_neighbors)
attention_outputs.append(block_output)
# 合并所有块的输出
output = torch.cat(attention_outputs, dim=2)
return output.view(batch_size, num_heads, seq_len, d_k)
在实际训练Transformer模型时,以下几个技巧可以显著提升效果:
学习率预热(Learning Rate Warmup)
梯度裁剪(Gradient Clipping)
标签平滑(Label Smoothing)
混合精度训练(Mixed Precision Training)
以下是学习率预热的实现示例:
python复制class WarmupScheduler:
def __init__(self, optimizer, d_model, warmup_steps=4000):
self.optimizer = optimizer
self.d_model = d_model
self.warmup_steps = warmup_steps
self.current_step = 0
def step(self):
self.current_step += 1
lr = (self.d_model ** -0.5) * min(
self.current_step ** -0.5,
self.current_step * (self.warmup_steps ** -1.5)
)
for param_group in self.optimizer.param_groups:
param_group['lr'] = lr
问题1:训练初期损失不下降
问题2:验证集表现波动大
问题3:长序列训练OOM(显存不足)
高效实现技巧:
torch.jit.script编译关键模块torch.einsum进行张量运算matmul而非bmm内存优化技巧:
torch.utils.checkpoint实现梯度检查点torch.cuda.empty_cache()pin_memory=True加速数据加载分布式训练技巧:
以下是一个使用梯度检查点的Transformer层实现:
python复制from torch.utils.checkpoint import checkpoint
class CheckpointedTransformerLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads, dropout)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# 使用梯度检查点包装注意力计算
def create_attn(x):
attn_out, _ = self.attention(x, x, x, mask)
return attn_out
attn_out = checkpoint(create_attn, x)
x = self.norm1(x + self.dropout(attn_out))
# 前馈网络同样使用检查点
def create_ffn(x):
return self.ffn(x)
ffn_out = checkpoint(create_ffn, x)
x = self.norm2(x + self.dropout(ffn_out))
return x
在实际部署Transformer模型时,需要考虑以下因素:
延迟与吞吐量权衡:
硬件适配:
量化策略:
以下是一个简单的模型量化示例:
python复制# 动态量化
model = TransformerEncoder(vocab_size=10000, d_model=512)
quantized_model = torch.quantization.quantize_dynamic(
model, {nn.Linear}, dtype=torch.qint8
)
# 静态量化(需要校准)
model.eval()
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
quantized_model = torch.quantization.prepare(model, inplace=False)
# 运行校准数据...
quantized_model = torch.quantization.convert(quantized_model)
虽然Transformer已经取得了巨大成功,但仍有许多值得探索的方向:
更高效的架构设计:
多模态统一架构:
自监督学习新范式:
模型可解释性:
在实际研究项目中,我发现将Transformer与图神经网络(GNN)结合,在处理结构化数据时表现出色。例如,在分子属性预测任务中,通过将分子图转换为序列并添加特殊的边信息编码,可以同时利用Transformer的全局建模能力和GNN的结构感知能力。