在深度学习模型训练过程中,学习率调度是一个至关重要的超参数调节环节。其中Warmup策略作为一种特殊的初始阶段学习率控制方法,近年来在各种SOTA模型中得到了广泛应用。我第一次接触这个概念是在2018年训练BERT模型时,当时发现直接使用标准学习率衰减策略会导致模型在前几百步就出现梯度异常,后来引入Warmup后才使训练稳定下来。
Warmup的核心思想很简单:在训练初期使用较低的学习率进行"预热",然后再按照预定策略(如余弦退火、阶梯下降等)进行学习率衰减。这种方法特别适合以下几种场景:
模型权重在初始化时通常是随机生成的,这时的参数空间与最优解可能相距甚远。如果直接使用较大的学习率,会导致两个主要问题:
通过实验观察,在没有Warmup的情况下,BERT模型前100步的梯度范数通常是Warmup训练的3-5倍。这种剧烈的波动会导致两个后果:
随着GPU显存的增大,使用大Batch Size训练已成为趋势。但大Batch Size带来了新的挑战:
Warmup策略完美解决了这个矛盾。以Transformer模型为例,当Batch Size从256增加到2048时,最优Warmup步数通常需要从1000步增加到8000步左右。
小学习率的Warmup阶段让模型有机会:
这在迁移学习场景下尤为重要。当预训练模型和下游任务差异较大时,Warmup给了模型参数"转向"的空间。
线性Warmup是最简单直观的实现方式,公式为:
code复制current_lr = base_lr * min(current_step / warmup_steps, 1.0)
具体实现要点:
PyTorch实现示例:
python复制def linear_warmup(current_step, warmup_steps, base_lr):
if current_step < warmup_steps:
return base_lr * (current_step / warmup_steps)
return base_lr
指数Warmup增长曲线更陡峭,公式为:
code复制current_lr = base_lr * (1 - exp(-current_step / warmup_steps))
特点:
最简单的策略,在Warmup阶段保持固定小学习率:
code复制current_lr = warmup_lr if current_step < warmup_steps else base_lr
优点:
缺点:
这是目前最流行的组合策略,公式分为两个阶段:
code复制lr_t = base_lr * (t / T_warmup)
code复制lr_t = lr_min + 0.5*(base_lr-lr_min)*(1+cos(π*(t-T_warmup)/(T_total-T_warmup)))
PyTorch完整实现:
python复制def cosine_with_warmup(optimizer, warmup_steps, total_steps, num_cycles=0.5):
def lr_lambda(current_step):
if current_step < warmup_steps:
return float(current_step) / float(max(1, warmup_steps))
progress = float(current_step - warmup_steps) / float(max(1, total_steps - warmup_steps))
return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)))
return LambdaLR(optimizer, lr_lambda)
另一种常见组合,适合需要明确学习率阶段的场景:
python复制def step_with_warmup(optimizer, warmup_steps, decay_steps, decay_rate=0.1):
def lr_lambda(current_step):
if current_step < warmup_steps:
return float(current_step) / float(max(1, warmup_steps))
return decay_rate ** (current_step // decay_steps)
return LambdaLR(optimizer, lr_lambda)
简单有效的组合,HuggingFace Transformers库中的标准实现:
python复制from transformers import get_linear_schedule_with_warmup
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=500,
num_training_steps=10000
)
根据多年实践,我总结了以下经验法则:
| 训练类型 | 建议Warmup比例 | 典型值示例 |
|---|---|---|
| 大规模预训练 | 1-2% | 10k steps (1M total) |
| 中等规模训练 | 5-10% | 5k steps (50k total) |
| 微调任务 | 10-20% | 1k steps (10k total) |
| 小数据集 | 20-30% | 500 steps (2k total) |
注意事项:
Warmup起始学习率通常有两种设置方式:
从0开始:
lr = base_lr * (t / warmup_steps)从小值开始(如base_lr的10%):
lr = 0.1*base_lr + 0.9*base_lr*(t/warmup_steps)在多任务学习中,Warmup需要特别注意:
共享Warmup:
独立Warmup:
实现示例:
python复制class MultiTaskWarmupScheduler:
def __init__(self, optimizer, tasks, warmup_steps):
self.task_step = {task:0 for task in tasks}
self.warmup_steps = warmup_steps
self.optimizer = optimizer
def step(self, task):
self.task_step[task] += 1
progress = min(self.task_step[task] / self.warmup_steps, 1.0)
for param_group in self.optimizer.param_groups:
param_group['lr'] = param_group['initial_lr'] * progress
症状:
解决方案:
症状:
解决方案:
在分布式训练中需要注意:
最佳实践:
python复制# 在DDP训练中确保同步
def get_global_step():
if is_dist_avail_and_initialized():
# 所有进程同步步数
torch.distributed.all_reduce(step_tensor, op=torch.distributed.ReduceOp.MAX)
return step_tensor.item()
return current_step
完整训练循环示例:
python复制optimizer = AdamW(model.parameters(), lr=5e-4)
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=1000,
num_training_steps=20000,
num_cycles=0.5
)
for epoch in range(epochs):
for step, batch in enumerate(train_loader):
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
if step % 100 == 0:
current_lr = optimizer.param_groups[0]['lr']
print(f"Step {step}, LR: {current_lr:.2e}, Loss: {loss.item():.4f}")
使用Keras Callback的方式:
python复制class WarmupCosineDecay(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, base_lr, warmup_steps, total_steps):
super().__init__()
self.base_lr = base_lr
self.warmup_steps = warmup_steps
self.total_steps = total_steps
def __call__(self, step):
if step < self.warmup_steps:
return self.base_lr * (step / self.warmup_steps)
progress = (step - self.warmup_steps) / (self.total_steps - self.warmup_steps)
return 0.5 * self.base_lr * (1 + tf.cos(np.pi * progress))
# 使用示例
lr_schedule = WarmupCosineDecay(1e-3, 1000, 20000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)
Transformers库提供了开箱即用的支持:
python复制from transformers import AdamW, get_linear_schedule_with_warmup
optimizer = AdamW(model.parameters(), lr=5e-5, correct_bias=False)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=500,
num_training_steps=len(train_dataloader) * epochs
)
# 训练循环
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
基于训练指标的自动调节:
python复制class AdaptiveWarmup:
def __init__(self, optimizer, max_warmup=2000, patience=100):
self.optimizer = optimizer
self.max_warmup = max_warmup
self.patience = patience
self.best_loss = float('inf')
self.no_improve = 0
self.current_steps = 0
def step(self, current_loss):
self.current_steps += 1
if current_loss < self.best_loss:
self.best_loss = current_loss
self.no_improve = 0
else:
self.no_improve += 1
if self.no_improve >= self.patience and self.current_steps < self.max_warmup:
# 提前结束Warmup
self.current_steps = self.max_warmup
progress = min(self.current_steps / self.max_warmup, 1.0)
for param_group in self.optimizer.param_groups:
param_group['lr'] = param_group['initial_lr'] * progress
对不同网络层使用不同的Warmup策略:
python复制def layer_specific_warmup(optimizer, warmup_steps, layer_multipliers):
def lr_lambda(current_step):
if current_step < warmup_steps:
return current_step / warmup_steps
return 1.0
for i, param_group in enumerate(optimizer.param_groups):
param_group['lr_lambda'] = lambda step: lr_lambda(step) * layer_multipliers[i]
return optimizer
最佳配合方式:
实现示例:
python复制def adaptive_clip(step, warmup_steps, max_norm=1.0):
if step < warmup_steps:
return max_norm * (step / warmup_steps)
return max_norm
# 在训练循环中
torch.nn.utils.clip_grad_norm_(
model.parameters(),
adaptive_clip(current_step, warmup_steps)
)
使用Matplotlib绘制学习率变化:
python复制def plot_lr_schedule(scheduler, total_steps):
lrs = []
for step in range(total_steps):
scheduler.step()
lrs.append(scheduler.get_last_lr()[0])
plt.figure(figsize=(10, 5))
plt.plot(lrs)
plt.xlabel('Training Steps')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedule')
plt.grid()
plt.show()
在Warmup阶段监控梯度统计量:
python复制def log_gradient_stats(model, step):
total_norm = 0
for p in model.parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
if step % 100 == 0:
print(f"Step {step}: Grad Norm {total_norm:.4f}")
通过可视化理解Warmup效果:
python复制def visualize_loss_landscape(model, dataloader, directions, steps=50):
# directions是两个随机参数方向
alphas = np.linspace(-1, 1, steps)
betas = np.linspace(-1, 1, steps)
losses = np.zeros((len(alphas), len(betas)))
for i, alpha in enumerate(alphas):
for j, beta in enumerate(betas):
# 沿方向扰动参数
for (name, param), (d1, d2) in zip(model.named_parameters(), directions):
param.data = original_params[name] + alpha*d1 + beta*d2
# 计算损失
loss = evaluate(model, dataloader)
losses[i,j] = loss
# 绘制3D曲面
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
X, Y = np.meshgrid(alphas, betas)
ax.plot_surface(X, Y, losses, cmap='viridis')
ax.set_xlabel('Direction 1')
ax.set_ylabel('Direction 2')
ax.set_zlabel('Loss')
标准配置:
python复制# HuggingFace实现
from transformers import AdamW, get_linear_schedule_with_warmup
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=10000,
num_training_steps=1000000
)
推荐配置:
python复制# PyTorch实现
def vit_scheduler(optimizer, warmup_epochs, total_epochs, steps_per_epoch):
warmup_steps = warmup_epochs * steps_per_epoch
total_steps = total_epochs * steps_per_epoch
def lr_lambda(current_step):
if current_step < warmup_steps:
return current_step / warmup_steps
progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
return 0.5 * (1 + math.cos(math.pi * progress))
return LambdaLR(optimizer, lr_lambda)
特殊考虑:
示例配置:
python复制# 带重启的Warmup
def get_restart_warmup_scheduler(optimizer, warmup_steps, total_steps, num_restarts=3):
restart_interval = total_steps // num_restarts
def lr_lambda(current_step):
phase = current_step % restart_interval
warmup = min(warmup_steps, restart_interval//4)
if phase < warmup:
return phase / warmup
return 1.0 - (phase - warmup) / (restart_interval - warmup)
return LambdaLR(optimizer, lr_lambda)