Transformer是一种革命性的神经网络架构,彻底改变了自然语言处理领域。2017年Google发表的《Attention Is All You Need》论文首次提出这一架构,其核心创新在于完全基于注意力机制,摒弃了传统的循环神经网络(RNN)和卷积神经网络(CNN)结构。
这个架构之所以重要,是因为它解决了序列建模中的几个关键问题:
Transformer采用经典的编码器-解码器框架:
编码器负责将输入序列(如源语言句子)转换为富含语义信息的中间表示。它由N个相同的编码器层堆叠而成(原论文中N=6),每层包含两个主要子层:
解码器则根据编码器的输出和已生成的部分输出序列,逐步预测下一个token。同样由N个解码器层堆叠,但每层包含三个子层:
Transformer使用标准的嵌入层将离散的token转换为连续向量表示。假设词汇表大小为V,模型维度为d_model(通常512),则嵌入矩阵维度为V×d_model。
实际实现中需要注意:
python复制class Embeddings(nn.Module):
def __init__(self, vocab_size, d_model, pad_id):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_id)
def forward(self, x):
# x shape: (batch_size, seq_len)
return self.embedding(x) * math.sqrt(self.embedding.embedding_dim)
由于Transformer不包含循环或卷积结构,需要显式地注入位置信息。原论文使用正弦和余弦函数的固定模式:
PE(pos,2i) = sin(pos/10000^(2i/d_model))
PE(pos,2i+1) = cos(pos/10000^(2i/d_model))
其中pos是位置索引,i是维度索引。这种编码方式具有以下优点:
python复制class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe)
def forward(self, x):
# x shape: (batch_size, seq_len, d_model)
return x + self.pe[:x.size(1)]
注意力机制的核心思想是根据查询(Query)和键(Key)的相似度,对值(Value)进行加权求和。具体计算分为四步:
数学表达式:
Attention(Q,K,V) = softmax(QK^T/√d_k)V
python复制def scaled_dot_product_attention(q, k, v, mask=None):
# q,k,v shapes: (..., seq_len, d_k)
matmul_qk = torch.matmul(q, k.transpose(-2, -1))
dk = q.size(-1)
scaled_attention_logits = matmul_qk / math.sqrt(dk)
if mask is not None:
scaled_attention_logits += (mask * -1e9)
attention_weights = F.softmax(scaled_attention_logits, dim=-1)
output = torch.matmul(attention_weights, v)
return output, attention_weights
单一注意力机制只能关注一种模式的关系。多头注意力将Q、K、V投影到h个不同的子空间,并行计算h个注意力头,然后将结果拼接:
MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O
其中head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
优势:
python复制class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.depth = d_model // num_heads
self.wq = nn.Linear(d_model, d_model)
self.wk = nn.Linear(d_model, d_model)
self.wv = nn.Linear(d_model, d_model)
self.dense = nn.Linear(d_model, d_model)
def split_heads(self, x, batch_size):
x = x.view(batch_size, -1, self.num_heads, self.depth)
return x.transpose(1, 2)
def forward(self, q, k, v, mask):
batch_size = q.size(0)
q = self.wq(q)
k = self.wk(k)
v = self.wv(v)
q = self.split_heads(q, batch_size)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
scaled_attention, attention_weights = scaled_dot_product_attention(
q, k, v, mask)
scaled_attention = scaled_attention.transpose(1, 2).contiguous()
concat_attention = scaled_attention.view(batch_size, -1, self.d_model)
output = self.dense(concat_attention)
return output, attention_weights
每个编码器和解码器层都包含一个全连接前馈网络,由两个线性变换和一个ReLU激活组成:
FFN(x) = max(0, xW1 + b1)W2 + b2
典型配置:
python复制class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.linear2(self.dropout(F.relu(self.linear1(x))))
每个子层输出都采用以下处理方式:
LayerNorm(x + Sublayer(x))
这种设计带来三个好处:
python复制class SublayerConnection(nn.Module):
def __init__(self, size, dropout):
super().__init__()
self.norm = nn.LayerNorm(size)
self.dropout = nn.Dropout(dropout)
def forward(self, x, sublayer):
return x + self.dropout(sublayer(self.norm(x)))
python复制class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = FeedForward(d_model, d_ff, dropout)
self.sublayer = nn.ModuleList([
SublayerConnection(d_model, dropout) for _ in range(2)
])
def forward(self, x, mask):
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask)[0])
return self.sublayer[1](x, self.feed_forward)
解码器需要注意两点特殊处理:
python复制class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.src_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = FeedForward(d_model, d_ff, dropout)
self.sublayer = nn.ModuleList([
SublayerConnection(d_model, dropout) for _ in range(3)
])
def forward(self, x, memory, src_mask, tgt_mask):
m = memory
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask)[0])
x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask)[0])
return self.sublayer[2](x, self.feed_forward)
学习率调度:使用带热启动的Adam优化器
python复制class TransformerScheduler:
def __init__(self, optimizer, d_model, warmup_steps):
self.optimizer = optimizer
self.d_model = d_model
self.warmup_steps = warmup_steps
self.step_num = 0
def step(self):
self.step_num += 1
lr = self.d_model**-0.5 * min(
self.step_num**-0.5,
self.step_num * self.warmup_steps**-1.5)
for param_group in self.optimizer.param_groups:
param_group['lr'] = lr
标签平滑:防止模型对预测结果过于自信
python复制criterion = nn.KLDivLoss(reduction='batchmean')
smoothed_labels = (1.0 - label_smoothing) * one_hot + label_smoothing / num_classes
梯度裁剪:防止梯度爆炸
python复制torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Transformer架构已经衍生出众多变体和应用:
在实际部署时需要考虑:
Transformer架构的成功证明了注意力机制在序列建模中的强大能力。理解其核心原理不仅有助于使用现有模型,也为开发新架构提供了坚实基础。