PyTorch交叉熵损失函数在多分类与二分类任务中的应用

成为夏目

1. 项目概述：交叉熵损失函数的实战应用

在计算机视觉领域，图像分类是最基础也最核心的任务之一。PyTorch作为当前最流行的深度学习框架，其灵活的张量操作和自动微分机制为分类任务提供了强大支持。这次我们将深入探讨如何利用PyTorch中的交叉熵损失函数，分别实现多分类（如MNIST手写数字识别）和二分类（如猫狗分类）任务。

交叉熵损失（Cross-Entropy Loss）是分类任务中最常用的损失函数，它能有效衡量预测概率分布与真实分布之间的差异。在多分类任务中我们使用nn.CrossEntropyLoss，而二分类任务则可以使用nn.BCEWithLogitsLoss。这两种实现方式在PyTorch中有细微但重要的区别，这也是很多初学者容易混淆的地方。

2. 核心原理与技术解析

2.1 交叉熵损失的数学本质

交叉熵源于信息论，用于衡量两个概率分布之间的差异。给定真实分布p和预测分布q，交叉熵定义为：

H(p,q) = -Σ p(x) log q(x)

在分类任务中，p是one-hot编码的真实标签，q是模型输出的概率分布。最小化交叉熵等价于最大化预测正确的对数似然。

PyTorch中的实现做了两个重要优化：

将Softmax运算整合到损失函数中（CrossEntropyLoss）
提供数值稳定的计算方式（BCEWithLogitsLoss内置Sigmoid）

2.2 多分类与二分类的技术差异

多分类任务（如10类MNIST分类）：

输出层神经元数=类别数
使用nn.CrossEntropyLoss
不需要手动添加Softmax层
标签格式为[0,1,...,n_class-1]

二分类任务（如猫狗分类）：

输出层1个神经元即可
使用nn.BCEWithLogitsLoss
内置Sigmoid激活
标签格式为0或1的浮点数

3. 多分类任务完整实现

3.1 数据准备与模型定义

以CIFAR-10数据集为例：

python复制import torch
import torchvision
import torch.nn as nn
import torch.optim as optim

# 数据加载
transform = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))
])

trainset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=32, shuffle=True)

# 简单CNN模型
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)  # 10类输出

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)  # 注意：没有Softmax
        return x

3.2 训练循环的关键实现

python复制model = Net()
criterion = nn.CrossEntropyLoss()  # 多分类损失
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        
        optimizer.zero_grad()
        
        outputs = model(inputs)
        loss = criterion(outputs, labels)  # 自动处理Softmax
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        if i % 100 == 99:
            print(f'[{epoch+1}, {i+1}] loss: {running_loss/100:.3f}')
            running_loss = 0.0

重要提示：PyTorch的CrossEntropyLoss已经整合了Softmax，因此模型最后一层不需要也不应该再加Softmax激活，否则会影响数值稳定性。

4. 二分类任务完整实现

4.1 数据准备的特殊处理

以猫狗二分类为例，需要注意：

标签需转换为float32类型
输出层只需1个神经元
使用BCEWithLogitsLoss而非普通BCELoss

python复制# 自定义数据集示例
class CatDogDataset(torch.utils.data.Dataset):
    def __init__(self, img_dir, transform=None):
        self.img_dir = img_dir
        self.transform = transform
        self.img_names = os.listdir(img_dir)
        
    def __len__(self):
        return len(self.img_names)
    
    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_names[idx])
        image = Image.open(img_path).convert('RGB')
        
        # 假设文件名包含'cat'或'dog'
        label = 0.0 if 'cat' in self.img_names[idx] else 1.0
        
        if self.transform:
            image = self.transform(image)
            
        return image, torch.tensor(label, dtype=torch.float32)

4.2 模型与训练实现

python复制class BinaryClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 1)  # 二分类只需1个输出
        
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)  # 无Sigmoid
        return x

model = BinaryClassifier()
criterion = nn.BCEWithLogitsLoss()  # 内置Sigmoid
optimizer = optim.Adam(model.parameters(), lr=0.0001)

# 训练循环
for epoch in range(10):
    model.train()
    for images, labels in trainloader:
        optimizer.zero_grad()
        outputs = model(images).squeeze()  # 从[N,1]变为[N]
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

5. 关键问题与实战技巧

5.1 多分类常见陷阱

标签格式错误：
- 错误：将标签转换为one-hot编码
- 正确：直接使用0~n_class-1的整数
输出层添加Softmax：
- CrossEntropyLoss已包含Softmax，额外添加会导致数值不稳定

类别不平衡问题：

python复制# 解决方案：添加类别权重
weights = torch.tensor([1.0, 2.0, 1.0])  # 假设第2类样本较少
criterion = nn.CrossEntropyLoss(weight=weights)

5.2 二分类特殊处理

输出维度处理：
- BCEWithLogitsLoss期望输出形状为[N]或[N,1]
- 使用.squeeze()确保正确形状

概率阈值选择：

python复制# 预测时添加Sigmoid并设置阈值
with torch.no_grad():
    outputs = model(inputs)
    probs = torch.sigmoid(outputs)
    preds = (probs > 0.5).float()  # 默认0.5阈值

数值稳定性技巧：
- BCEWithLogitsLoss比手动Sigmoid+BCELoss更稳定
- 可以设置pos_weight参数处理样本不平衡

5.3 高级优化策略

标签平滑（Label Smoothing）：

python复制criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

自定义损失组合：

python复制def custom_loss(outputs, targets):
    ce_loss = F.cross_entropy(outputs, targets)
    reg_loss = torch.norm(model.fc3.weight, p=2)
    return ce_loss + 0.01*reg_loss

混合精度训练：

python复制scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast():
    outputs = model(inputs)
    loss = criterion(outputs, labels)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

6. 性能评估与模型优化

6.1 多分类评估指标

python复制correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)  # 取概率最大类别
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy: {100 * correct / total}%')

6.2 二分类评估指标

python复制TP, FP, TN, FN = 0, 0, 0, 0
with torch.no_grad():
    for images, labels in testloader:
        outputs = model(images)
        preds = (torch.sigmoid(outputs) > 0.5).float()
        
        TP += ((preds == 1) & (labels == 1)).sum().item()
        FP += ((preds == 1) & (labels == 0)).sum().item()
        TN += ((preds == 0) & (labels == 0)).sum().item()
        FN += ((preds == 0) & (labels == 1)).sum().item()

precision = TP / (TP + FP + 1e-8)
recall = TP / (TP + FN + 1e-8)
print(f'Precision: {precision:.4f}, Recall: {recall:.4f}')

6.3 学习率调度策略

python复制# 余弦退火学习率
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=10, eta_min=1e-6)

for epoch in range(100):
    train(...)
    scheduler.step()

在实际项目中，交叉熵损失的正确使用往往能决定模型的最终性能。根据我的经验，有几点特别值得注意：