在2025年的今天,大型语言模型(LLM)已经具备了解决复杂数学问题的能力。这种能力并非与生俱来,而是通过专门的"后训练"(post-training)技术实现的。本文将详细介绍如何使用Unsloth库和GRPO(Group Relative Policy Optimization)技术,对Qwen3-1.7B-Base模型进行微调,使其具备数学推理能力。
提示:GRPO是一种先进的强化学习技术,相比传统的PPO方法,它通过消除对Value Model的需求,显著降低了计算资源要求。
基础模型(Base Model)是经过海量文本数据训练的原始LLM,其核心能力是预测下一个词。它知识渊博但缺乏对话和遵循指令的能力,就像一个未经驯化的知识引擎。
聊天模型(Chat/Instruct Model)则是经过第二阶段训练的基础模型,通常使用监督微调(SFT)和人类反馈强化学习(RLHF)等技术,使模型能够以对话形式遵循用户指令,具备特定的"个性"和响应风格。
GRPO是PPO(Proximal Policy Optimization)的改进版本,其核心创新在于:
生成一组输出:模型为每个提示生成多个响应变体
计算奖励:每个响应由奖励函数评分
基于组统计估算优势:使用公式计算每个响应的相对优势:
A^i,t = (ri - mean(r)) / std(r)
这种方法完全消除了对Value Model的需求,大幅降低了内存占用和计算复杂度。
对于标准Python环境:
bash复制pip install unsloth vllm
对于Google Colab环境:
bash复制pip install --no-deps unsloth vllm==0.8.5.post1
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0>" "huggingface_hub>=0.34.0" hf_transfer
使用Unsloth的FastLanguageModel加载基础模型,并配置LoRA(Low-Rank Adaptation)进行参数高效微调:
python复制from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
lora_rank = 32
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-1.7B-Base",
max_seq_length = max_seq_length,
load_in_4bit = False, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.7, # Reduce if out of memory
)
model = FastLanguageModel.get_peft_model(
model,
r = lora_rank,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha = lora_rank * 2,
use_gradient_checkpointing = "unsloth",
random_state = 3407,
)
注意:lora_alpha设置为lora_rank的两倍可以加速训练过程,这是Unsloth特有的优化技巧。
为了引导模型展示推理过程,我们设计特殊标记来结构化输出:
python复制reasoning_start = "<think>"
reasoning_end = "</think>"
solution_start = "<SOLUTION>"
solution_end = "</SOLUTION>"
期望的输出格式:
code复制<think>...推理过程...</think><SOLUTION>...最终答案...</SOLUTION>
python复制system_prompt = f"""
You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start} and {solution_end}
"""
chat_template = """
{% if messages[0]['role'] == 'system' %}
{{ messages[0]['content'] + eos_token }}
{% set loop_messages = messages[1:] %}
{% else %}
{{ '{system_prompt}' + eos_token }}
{% set loop_messages = messages %}
{% endif %}
{% for message in loop_messages %}
{% if message['role'] == 'user' %}
{{ message['content'] }}
{% elif message['role'] == 'assistant' %}
{{ message['content'] + eos_token }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}{{ '{reasoning_start}' }}
{% endif %}
"""
chat_template = chat_template.replace("'{system_prompt}'", f"'{system_prompt}'")
chat_template = chat_template.replace("'{reasoning_start}'", f"'{reasoning_start}'")
tokenizer.chat_template = chat_template
使用NVIDIA的Open Math Reasoning数据集的小型子集进行监督微调:
python复制from datasets import load_dataset
import pandas as pd
import numpy as np
dataset = load_dataset("unsloth/OpenMathReasoning-mini", split="cot")
dataset = dataset.to_pandas()[["expected_answer", "problem", "generated_solution"]]
is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors="coerce").notnull()
dataset = dataset.iloc[np.where(is_number)[0]]
def format_sft_dataset(x):
thoughts = x["generated_solution"].replace("<think>", "").replace("</think>", "").strip()
final_prompt = reasoning_start + thoughts + reasoning_end + solution_start + x["expected_answer"] + solution_end
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": x["problem"]},
{"role": "assistant", "content": final_prompt},
]
dataset["Messages"] = dataset.apply(format_sft_dataset, axis=1)
from datasets import Dataset
dataset["text"] = tokenizer.apply_chat_template(
dataset["Messages"].values.tolist(), tokenize=False)
dataset = Dataset.from_pandas(dataset)
python复制from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 1,
gradient_accumulation_steps = 1,
warmup_steps = 5,
num_train_epochs = 2,
learning_rate = 2e-4,
logging_steps = 5,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none",
),
)
trainer.train()
GRPO训练的核心是设计合理的奖励函数,我们使用四个关键函数:
python复制import re
solution_end_regex = r"</SOLUTION>[\s]{0,}" + "(?:" + re.escape(tokenizer.eos_token) + ")?"
match_format = re.compile(rf"{reasoning_end}.*?{solution_start}(.+?){solution_end_regex}",
flags=re.MULTILINE | re.DOTALL)
def match_format_exactly(completions, **kwargs):
scores = []
for completion in completions:
score = 0
response = completion[0]["content"]
if match_format.search(response) is not None:
score += 3.0
scores.append(score)
return scores
python复制def match_format_approximately(completions, **kwargs):
scores = []
for completion in completions:
score = 0
response = completion[0]["content"]
score += 0.5 if response.count(reasoning_end) == 1 else -1.0
score += 0.5 if response.count(solution_start) == 1 else -1.0
score += 0.5 if response.count(solution_end) == 1 else -1.0
scores.append(score)
return scores
python复制def check_answer(prompts, completions, answer, **kwargs):
question = prompts[0][-1]["content"]
responses = [completion[0]["content"] for completion in completions]
extracted_responses = [
guess.group(1) if (guess := match_format.search(r)) is not None else None
for r in responses
]
scores = []
for guess, true_answer in zip(extracted_responses, answer):
score = 0
if guess is None:
scores.append(-2.0)
continue
if guess == true_answer:
score += 5.0
elif guess.strip() == true_answer.strip():
score += 3.5
else:
try:
ratio = float(guess) / float(true_answer)
if ratio >= 0.9 and ratio <= 1.1:
score += 2.0
elif ratio >= 0.8 and ratio <= 1.2:
score += 1.5
else:
score -= 2.5
except:
score -= 4.5
scores.append(score)
return scores
python复制match_numbers = re.compile(
solution_start + r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
flags=re.MULTILINE | re.DOTALL
)
def check_numbers(prompts, completions, answer, **kwargs):
question = prompts[0][-1]["content"]
responses = [completion[0]["content"] for completion in completions]
extracted_responses = [
guess.group(1) if (guess := match_numbers.search(r)) is not None else None
for r in responses
]
scores = []
for guess, true_answer in zip(extracted_responses, answer):
if guess is None:
scores.append(-2.5)
continue
try:
true_num = float(true_answer.strip())
guess_num = float(guess.strip().replace(",", ""))
scores.append(3.5 if guess_num == true_num else -1.5)
except:
scores.append(0)
return scores
python复制from trl import GRPOConfig, GRPOTrainer
from vllm import SamplingParams
vllm_sampling_params = SamplingParams(
min_p = 0.1,
top_p = 1.0,
top_k = -1,
seed = 3407,
stop = [tokenizer.eos_token],
include_stop_str_in_output = True,
)
training_args = GRPOConfig(
vllm_sampling_params = vllm_sampling_params,
temperature = 1.0,
learning_rate = 5e-6,
weight_decay = 0.01,
warmup_ratio = 0.1,
lr_scheduler_type = "linear",
optim = "adamw_8bit",
logging_steps = 1,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 1,
num_generations = 4,
max_prompt_length = max_seq_length,
max_completion_length = max_seq_length,
max_steps = 100,
save_steps = 100,
report_to = "none",
output_dir = "outputs",
)
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [
match_format_exactly,
match_format_approximately,
check_answer,
check_numbers,
],
args = training_args,
train_dataset = dataset,
)
trainer.train()
python复制# 保存LoRA适配器
model.save_lora("grpo_saved_lora")
# 准备推理输入
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What is the sqrt of 101?"},
]
text = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
tokenize = False,
)
# 生成响应
output = model.fast_generate(
text,
sampling_params = sampling_params,
lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text
print(output)
Unsloth提供多种保存格式:
python复制# 16位合并
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
# 4位合并
model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit")
python复制model.save_pretrained_lora("lora_adapters")
python复制# 保存为8位Q8_0 GGUF
model.save_pretrained_gguf("model", tokenizer)
# 保存为q4_k_m GGUF
model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")
在实际操作中,我总结了以下关键经验:
内存管理:GRPO训练会生成多个响应,显存消耗较大。如果遇到OOM错误,可以尝试:
奖励函数设计:奖励函数需要精心设计平衡:
学习率选择:GRPO通常需要比SFT更低的学习率(如5e-6),过高的学习率可能导致训练不稳定
温度参数:训练初期可以使用较高温度(1.0)鼓励探索,后期可逐渐降低(如0.7)聚焦优质响应
数据集质量:确保训练数据中的问题和答案准确对应,错误标注会误导模型学习
通过这种方法训练的模型,在数学推理任务上表现出显著提升。一个典型的成功响应如下:
code复制<think>
To find the square root of 101:
1. We know that 10^2 = 100
2. And 10.05^2 ≈ 101
3. More precisely, √101 ≈ 10.0498756
</think>
<SOLUTION>10.0498756</SOLUTION>
这种结构化的推理过程不仅提高了答案的正确率,也使模型的思考过程更加透明和可解释。