基于PPO算法的星际争霸2智能体开发实战

DR阿福

1. 项目概述：基于PPO的星际争霸2智能体开发

在强化学习领域，星际争霸2一直被视为最具挑战性的测试环境之一。这个项目展示了如何使用PPO算法训练一个能够自主决策的星际争霸2智能体。与传统的脚本AI不同，我们的目标是让AI通过与环境交互来自主学习游戏策略。

项目采用上下位机架构设计：

上位机负责决策逻辑（PPO算法实现）
下位机负责游戏环境交互
通过transaction.pkl文件实现进程间通信

这种架构设计有三大优势：

训练过程可视化：可以实时观察AI的学习过程
算法与环境解耦：方便更换不同的强化学习算法
资源利用率高：可以充分利用多核CPU进行并行训练

2. 环境配置与核心组件

2.1 Gymnasium环境接口解析

Gymnasium作为OpenAI Gym的进化版，提供了更完善的强化学习环境接口。其核心API设计遵循"四个函数+两个属性"的原则：

核心交互函数

reset(): 初始化环境状态
- 关键参数：seed用于复现实验
- 返回：初始观测值(observation)和辅助信息(info)
step(action): 执行动作并返回环境反馈
- 返回五元组：(observation, reward, terminated, truncated, info)
- terminated表示回合正常结束
- truncated表示回合被强制终止
render(): 环境可视化
- 支持多种模式：human(窗口显示)、rgb_array(返回图像帧)
close(): 释放资源

核心空间属性

action_space: 定义动作空间
- 离散动作：spaces.Discrete(n)
- 连续动作：spaces.Box(low, high, shape)
observation_space: 定义观测空间
- 图像观测：spaces.Box(0, 255, (h,w,c), np.uint8)
- 向量观测：spaces.Box(-inf, inf, (n,))

2.2 星际争霸2环境封装

我们自定义的StarCraft2Env类继承自gym.Env，关键实现如下：

python复制class StarCraft2Env(gym.Env):
    def __init__(self):
        super().__init__()
        # 定义244x244的RGB图像作为观测空间
        self.observation_space = spaces.Box(low=0, high=255, 
                                           shape=(244,244,3), 
                                           dtype=np.uint8)
        # 定义6个离散动作
        self.action_space = spaces.Discrete(6)  
        
    def reset(self):
        # 初始化游戏状态
        initial_map = np.zeros((224,224,3), dtype=np.uint8)
        transaction = {
            'observation': initial_map,
            'reward': 0,
            'action': None,
            'terminated': False,
            'truncated': False
        }
        # 通过pickle保存初始状态
        with open('transaction.pkl','wb') as f:
            pickle.dump(transaction, f)
        return initial_map, {}

3. 下位机实现细节

3.1 基础建设逻辑

下位机的核心是WorkerRushBot类，继承自sc2.BotAI。建设逻辑采用优先级队列设计：

python复制async def on_step(self, iteration: int):
    # 读取上位机指令
    with open('transaction.pkl','rb') as f:
        action = pickle.load(f)['action']
    
    if action == 0:  # 基础建设
        have_built = False
        
        # 1. 优先补水晶塔(防卡人口)
        if self.supply_left < 4:
            if self.can_afford(UnitTypeId.PYLON):
                await self.build(UnitTypeId.PYLON, near=self.townhalls.first)
                have_built = True
                
        # 2. 补农民(经济基础)        
        if not have_built:
            for nexus in self.townhalls:
                if len(self.workers.closer_than(10, nexus)) < 22:
                    if nexus.is_idle:
                        nexus.train(UnitTypeId.PROBE)
                        have_built = True

3.2 科技发展策略

科技树建设采用分阶段策略：

传送门(Gateway)：基础军事单位生产
控制核心(CyberneticsCore)：解锁高级单位
星门(Stargate)：生产虚空辉光舰

python复制if action == 1:  # 科技发展
    # 限制最多4个星门
    max_stargates = 4
    
    for nexus in self.townhalls:
        # 每个基地配1个传送门
        if not self.structures(UnitTypeId.GATEWAY).closer_than(10, nexus):
            await self.build(UnitTypeId.GATEWAY, near=nexus)
            
        # 每个基地配1个控制核心    
        if not self.structures(UnitTypeId.CYBERNETICSCORE).closer_than(10, nexus):
            await self.build(UnitTypeId.CYBERNETICSCORE, near=nexus)
            
        # 全局最多4个星门
        if self.structures(UnitTypeId.STARGATE).amount < max_stargates:
            await self.build(UnitTypeId.STARGATE, near=nexus)

3.3 军事单位生产

虚空辉光舰(Voidray)作为主力兵种，具有以下优势：

对空对地全能
机动性强
集群作战效果好

生产逻辑：

python复制if action == 2:  # 训练虚空舰
    for sg in self.structures(UnitTypeId.STARGATE).ready.idle:
        if self.can_afford(UnitTypeId.VOIDRAY):
            sg.train(UnitTypeId.VOIDRAY)

4. 战斗系统实现

4.1 侦察机制

侦察是星际争霸中获取信息的关键手段。我们的实现特点：

控制侦察频率(每100帧一次)
优先使用空闲探机
自动前往敌方出生点

python复制if action == 3:  # 侦察
    if (iteration - self.last_sent) > 100:
        try:
            # 优先选择空闲探机
            if self.units(UnitTypeId.PROBE).idle.exists:
                probe = random.choice(self.units(UnitTypeId.PROBE).idle)
            else:
                probe = random.choice(self.units(UnitTypeId.PROBE))
                
            probe.attack(self.enemy_start_locations[0])
            self.last_sent = iteration
        except:
            pass  # 容错处理

4.2 进攻策略

采用优先级攻击策略：

就近敌方单位
就近敌方建筑
随机敌方单位
随机敌方建筑
敌方出生点

python复制if action == 4:  # 进攻
    for voidray in self.units(UnitTypeId.VOIDRAY).idle:
        if self.enemy_units.closer_than(10, voidray):
            voidray.attack(random.choice(self.enemy_units.closer_than(10, voidray)))
        elif self.enemy_structures.closer_than(10, voidray):
            voidray.attack(random.choice(self.enemy_structures.closer_than(10, voidray)))
        elif self.enemy_units:
            voidray.attack(random.choice(self.enemy_units))
        elif self.enemy_structures:
            voidray.attack(random.choice(self.enemy_structures))
        else:
            voidray.attack(self.enemy_start_locations[0])

4.3 防御系统

防御建筑建设流程：

建造熔炉(Forge)
在主基地附近建造光子炮(Photon Cannon)

python复制# 防御建设扩展
if not self.structures(UnitTypeId.FORGE):
    await self.build(UnitTypeId.FORGE, 
                    near=self.structures(UnitTypeId.PYLON).closest_to(nexus))
                    
elif self.structures(UnitTypeId.PHOTONCANNON).amount < 3:
    await self.build(UnitTypeId.PHOTONCANNON, near=nexus)

5. 系统集成与测试

5.1 启动流程

启动下位机：

python复制run_game(maps.get("2000AtmospheresAIE"), [
    Bot(Race.Protoss, WorkerRushBot()),
    Computer(Race.Zerg, Difficulty.Hard)
], realtime=False)

上位机控制循环：

python复制env = StarCraft2Env()
obs, _ = env.reset()

for episode in range(100):
    for step in range(200):
        action = model.predict(obs)  # PPO模型决策
        obs, reward, done, _ = env.step(action)
        
        if done:
            obs, _ = env.reset()
            break

5.2 性能优化技巧

IO优化：减少pickle操作频率
- 使用文件锁确保读写安全
- 适当增加sleep间隔(0.05-0.2秒)
训练加速：
- 设置realtime=False加速游戏
- 使用多个并行环境收集经验
内存管理：
- 定期调用env.close()释放资源
- 监控内存使用情况

6. 常见问题与解决方案

6.1 动作执行失败

问题现象：动作写入transaction.pkl但游戏中没有执行

排查步骤：

检查文件权限
确认pickle版本一致
验证虚拟环境一致性

解决方案：

python复制try:
    with open('transaction.pkl', 'wb') as f:
        pickle.dump(transaction, f, protocol=pickle.HIGHEST_PROTOCOL)
except Exception as e:
    print(f"写入失败: {str(e)}")

6.2 游戏不同步

问题现象：上下位机状态不一致

解决方案：

增加状态校验机制
实现心跳检测
加入超时重试逻辑

python复制def safe_step(self, action, max_retry=3):
    for _ in range(max_retry):
        try:
            return self.step(action)
        except Exception as e:
            print(f"重试中... {str(e)}")
            time.sleep(0.1)
    raise Exception("超过最大重试次数")