梯度下降(Gradient Descent)是机器学习中最基础也最重要的优化算法之一。想象一下,你被蒙上双眼站在一座崎岖的山上,目标是要找到最低的山谷。你只能通过感受脚下的坡度来决定下一步的移动方向——这就是梯度下降的直观比喻。
在机器学习中,这个"山"就是我们的损失函数(loss function),"坡度"就是梯度(gradient),而"最低点"则对应着模型参数的最优解。梯度下降的核心思想很简单:计算当前位置的梯度,然后沿着梯度的反方向(即下降最快的方向)更新参数,逐步逼近最小值。
梯度下降的数学表达式非常简单:
θ = θ - η·∇J(θ)
其中:
这个公式告诉我们:参数更新的方向与梯度方向相反(因为梯度指向函数增长最快的方向),而更新的大小则由学习率控制。
在实际应用中,梯度通常通过反向传播(backpropagation)算法计算。对于简单的线性回归模型,损失函数关于参数w的梯度可以表示为:
∇J(w) = (2/n)Xᵀ(Xw - y)
其中X是特征矩阵,y是目标值,n是样本数量。这个公式可以直接通过矩阵运算高效实现。
注意:在实际编程实现时,我们通常会避免直接计算XᵀX这样的矩阵乘法,而是采用更高效的分批计算方式,特别是当数据量很大时。
批量梯度下降是最原始的形式,它在每次迭代时使用全部训练数据计算梯度:
python复制def batch_gradient_descent(X, y, learning_rate=0.01, epochs=1000):
n_samples, n_features = X.shape
weights = np.zeros(n_features)
for _ in range(epochs):
gradient = (2/n_samples) * X.T.dot(X.dot(weights) - y)
weights -= learning_rate * gradient
return weights
优点:每次更新方向准确,收敛稳定
缺点:计算量大,不适合大数据集
随机梯度下降每次只使用一个样本计算梯度:
python复制def stochastic_gradient_descent(X, y, learning_rate=0.01, epochs=100):
n_samples, n_features = X.shape
weights = np.zeros(n_features)
for _ in range(epochs):
for i in range(n_samples):
random_idx = np.random.randint(n_samples)
xi = X[random_idx:random_idx+1]
yi = y[random_idx:random_idx+1]
gradient = 2 * xi.T.dot(xi.dot(weights) - yi)
weights -= learning_rate * gradient
return weights
优点:计算快,可以跳出局部最小值
缺点:更新方向波动大,收敛不稳定
小批量梯度下降是前两者的折中方案,每次使用一小批样本计算梯度:
python复制def mini_batch_gradient_descent(X, y, learning_rate=0.01, batch_size=32, epochs=100):
n_samples, n_features = X.shape
weights = np.zeros(n_features)
for _ in range(epochs):
indices = np.random.permutation(n_samples)
X = X[indices]
y = y[indices]
for i in range(0, n_samples, batch_size):
X_batch = X[i:i+batch_size]
y_batch = y[i:i+batch_size]
gradient = (2/batch_size) * X_batch.T.dot(X_batch.dot(weights) - y_batch)
weights -= learning_rate * gradient
return weights
优点:兼具计算效率和稳定性
缺点:需要调整batch_size超参数
动量法通过引入"速度"概念来加速收敛并减少震荡:
python复制def momentum_gradient_descent(X, y, learning_rate=0.01, momentum=0.9, epochs=1000):
n_samples, n_features = X.shape
weights = np.zeros(n_features)
velocity = np.zeros(n_features)
for _ in range(epochs):
gradient = (2/n_samples) * X.T.dot(X.dot(weights) - y)
velocity = momentum * velocity - learning_rate * gradient
weights += velocity
return weights
动量系数通常设为0.9左右,可以帮助算法"冲过"一些小的局部最小值。
Adam结合了动量法和自适应学习率的优点,是目前最常用的优化器:
python复制def adam_optimizer(X, y, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, epochs=1000):
n_samples, n_features = X.shape
weights = np.zeros(n_features)
m = np.zeros(n_features) # 第一矩估计
v = np.zeros(n_features) # 第二矩估计
for t in range(1, epochs+1):
gradient = (2/n_samples) * X.T.dot(X.dot(weights) - y)
# 更新有偏一阶矩估计
m = beta1 * m + (1 - beta1) * gradient
# 更新有偏二阶矩估计
v = beta2 * v + (1 - beta2) * (gradient ** 2)
# 计算修正后的一阶矩估计
m_hat = m / (1 - beta1**t)
# 计算修正后的二阶矩估计
v_hat = v / (1 - beta2**t)
# 更新参数
weights -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
return weights
Adam的自适应学习率特性使其对初始学习率的选择不太敏感,通常设为0.001即可。
固定学习率可能导致后期震荡或收敛缓慢,可以采用学习率衰减策略:
python复制def learning_rate_schedule(initial_lr, epoch, decay_rate=0.1, decay_steps=100):
return initial_lr * (decay_rate ** (epoch // decay_steps))
常见衰减策略包括:
训练初期使用较小学习率,逐步增加到目标值:
python复制def warmup_schedule(initial_lr, epoch, warmup_epochs=10):
if epoch < warmup_epochs:
return initial_lr * (epoch / warmup_epochs)
return initial_lr
预热可以避免初期参数更新过大导致的不稳定。
当网络层数很深时,梯度可能在反向传播过程中变得极小(消失)或极大(爆炸)。解决方案包括:
python复制# 梯度裁剪示例
max_grad_norm = 1.0
grad_norm = np.linalg.norm(gradient)
if grad_norm > max_grad_norm:
gradient = gradient * (max_grad_norm / grad_norm)
在高维空间中,真正的局部最小值很少见,更多遇到的是鞍点。应对策略:
不同特征尺度差异大会导致收敛困难,务必进行特征标准化:
python复制# 标准化处理
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_normalized = (X - X_mean) / X_std
下面是一个完整的线性回归实现,包含多种优化算法比较:
python复制import numpy as np
import matplotlib.pyplot as plt
class LinearRegression:
def __init__(self, optimizer='sgd', learning_rate=0.01, momentum=0.9):
self.optimizer = optimizer
self.lr = learning_rate
self.momentum = momentum
self.weights = None
self.loss_history = []
def fit(self, X, y, epochs=100, batch_size=32):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
velocity = np.zeros(n_features)
for epoch in range(epochs):
# 小批量梯度下降
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
for i in range(0, n_samples, batch_size):
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
# 计算梯度
predictions = X_batch.dot(self.weights)
gradient = (2/batch_size) * X_batch.T.dot(predictions - y_batch)
# 根据优化器更新参数
if self.optimizer == 'sgd':
self.weights -= self.lr * gradient
elif self.optimizer == 'momentum':
velocity = self.momentum * velocity - self.lr * gradient
self.weights += velocity
elif self.optimizer == 'adam':
# 简化的Adam实现
if not hasattr(self, 'm'):
self.m = np.zeros(n_features)
self.v = np.zeros(n_features)
self.t = 0
self.t += 1
self.m = 0.9 * self.m + 0.1 * gradient
self.v = 0.999 * self.v + 0.001 * (gradient ** 2)
m_hat = self.m / (1 - 0.9**self.t)
v_hat = self.v / (1 - 0.999**self.t)
self.weights -= self.lr * m_hat / (np.sqrt(v_hat) + 1e-8)
# 记录损失
predictions = X.dot(self.weights)
loss = np.mean((predictions - y)**2)
self.loss_history.append(loss)
if epoch % 100 == 0:
print(f'Epoch {epoch}, Loss: {loss:.4f}')
def predict(self, X):
return X.dot(self.weights)
def plot_loss(self):
plt.plot(self.loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title(f'Training Loss ({self.optimizer.upper()})')
plt.show()
# 生成测试数据
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# 添加偏置项
X_b = np.c_[np.ones((100, 1)), X]
# 比较不同优化器
optimizers = ['sgd', 'momentum', 'adam']
models = {}
for opt in optimizers:
print(f"\nTraining with {opt.upper()}...")
model = LinearRegression(optimizer=opt, learning_rate=0.01)
model.fit(X_b, y, epochs=500)
models[opt] = model
model.plot_loss()
# 可视化结果
plt.scatter(X, y)
x_plot = np.linspace(0, 2, 100).reshape(-1, 1)
x_plot_b = np.c_[np.ones((100, 1)), x_plot]
for opt, model in models.items():
y_plot = model.predict(x_plot_b)
plt.plot(x_plot, y_plot, label=opt.upper())
plt.legend()
plt.show()
在深度神经网络中,梯度下降通过反向传播算法实现。现代深度学习框架如TensorFlow和PyTorch都内置了自动微分功能,使得梯度计算变得透明:
python复制import torch
import torch.nn as nn
import torch.optim as optim
# 定义一个简单的神经网络
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(10, 50)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(50, 1)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
# 初始化模型和优化器
model = SimpleNN()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 模拟训练过程
for epoch in range(100):
# 模拟输入数据和标签
inputs = torch.randn(32, 10) # batch_size=32, input_dim=10
targets = torch.randn(32, 1)
# 前向传播
outputs = model(inputs)
loss = criterion(outputs, targets)
# 反向传播和优化
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f'Epoch {epoch}, Loss: {loss.item():.4f}')
在实际应用中,我们还需要考虑:
梯度下降的概念最早可以追溯到1847年,由法国数学家Augustin-Louis Cauchy提出。但直到20世纪中期计算机出现后,这一算法才得到广泛应用。近年来,随着深度学习的发展,梯度下降的各种改进版本不断涌现:
每种优化器都有其适用场景,但Adam因其优秀的自适应性和鲁棒性,已成为大多数深度学习任务的首选。