PyTorch 2.2 中文官方教程（七）（3）-阿里云开发者社区

PyTorch 2.2 中文官方教程（七）（2）https://developer.aliyun.com/article/1482517

强化学习

强化学习（DQN）教程

原文：pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

译者：飞龙

协议：CC BY-NC-SA 4.0

注意

点击这里下载完整示例代码

作者：Adam Paszke

Mark Towers

本教程展示了如何使用 PyTorch 在 CartPole-v1 任务上训练深度 Q 学习（DQN）代理，来自Gymnasium。

任务

代理必须在两个动作之间做出决定 - 将小车向左或向右移动 - 以使连接到其上的杆保持竖直。您可以在Gymnasium 的网站上找到有关环境和其他更具挑战性的环境的更多信息。

CartPole

当代理观察到环境的当前状态并选择一个动作时，环境会转换到一个新状态，并返回一个指示动作后果的奖励。在这个任务中，每个增量时间步的奖励为+1，如果杆倒下太远或小车离中心移动超过 2.4 个单位，环境将终止。这意味着表现更好的情况将运行更长时间，累积更大的回报。

CartPole 任务设计为代理的输入是 4 个实数值，代表环境状态（位置、速度等）。我们将这 4 个输入不经过任何缩放，通过一个小型全连接网络，输出 2 个值，分别对应两个动作。网络被训练来预测每个动作的期望值，给定输入状态。然后选择具有最高期望值的动作。

包

首先，让我们导入所需的包。首先，我们需要gymnasium用于环境，通过 pip 安装。这是原始 OpenAI Gym 项目的一个分支，自 Gym v0.19 以来由同一团队维护。如果您在 Google Colab 中运行此代码，请运行：

%%bash
pip3  install  gymnasium[classic_control]

我们还将使用 PyTorch 中的以下内容：

神经网络（torch.nn）
优化（torch.optim）
自动微分（torch.autograd）

import gymnasium as gym
import math
import random
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
env = gym.make("CartPole-v1")
# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display
plt.ion()
# if GPU is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

重放内存

我们将使用经验重放内存来训练我们的 DQN。它存储代理观察到的转换，允许我们稍后重用这些数据。通过随机抽样，构成一个批次的转换是不相关的。已经证明这极大地稳定和改进了 DQN 训练过程。

为此，我们将需要两个类：

Transition - 一个命名元组，表示环境中的单个转换。它基本上将（状态、动作）对映射到它们的（下一个状态、奖励）结果，其中状态是后面描述的屏幕差异图像。
ReplayMemory - 一个有界大小的循环缓冲区，保存最近观察到的转换。它还实现了一个.sample()方法，用于选择用于训练的随机批次的转换。

Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))
class ReplayMemory(object):
    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)
    def push(self, *args):
  """Save a transition"""
        self.memory.append(Transition(*args))
    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)
    def __len__(self):
        return len(self.memory)

现在，让我们定义我们的模型。但首先，让我们快速回顾一下什么是 DQN。

DQN 算法

我们的环境是确定性的，因此这里呈现的所有方程也是确定性的，为简单起见。在强化学习文献中，它们还会包含对环境中随机转换的期望。

select_action - 将根据ε贪婪策略选择一个动作。简单来说，我们有时会使用我们的模型来选择动作，有时我们只是均匀地随机采样一个。选择随机动作的概率将从EPS_START开始指数衰减到EPS_END。EPS_DECAY控制衰减的速率。

class DQN(nn.Module):
    def __init__(self, n_observations, n_actions):
        super(DQN, self).__init__()
        self.layer1 = nn.Linear(n_observations, 128)
        self.layer2 = nn.Linear(128, 128)
        self.layer3 = nn.Linear(128, n_actions)
    # Called with either one element to determine next action, or a batch
    # during optimization. Returns tensor([[left0exp,right0exp]...]).
    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return self.layer3(x)

# BATCH_SIZE is the number of transitions sampled from the replay buffer
# GAMMA is the discount factor as mentioned in the previous section
# EPS_START is the starting value of epsilon
# EPS_END is the final value of epsilon
# EPS_DECAY controls the rate of exponential decay of epsilon, higher means a slower decay
# TAU is the update rate of the target network
# LR is the learning rate of the ``AdamW`` optimizer
BATCH_SIZE = 128
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 1000
TAU = 0.005
LR = 1e-4
# Get number of actions from gym action space
n_actions = env.action_space.n
# Get the number of state observations
state, info = env.reset()
n_observations = len(state)
policy_net = DQN(n_observations, n_actions).to(device)
target_net = DQN(n_observations, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.AdamW(policy_net.parameters(), lr=LR, amsgrad=True)
memory = ReplayMemory(10000)
steps_done = 0
def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            # t.max(1) will return the largest column value of each row.
            # second column on max result is index of where max element was
            # found, so we pick action with the larger expected reward.
            return policy_net(state).max(1).indices.view(1, 1)
    else:
        return torch.tensor([[env.action_space.sample()]], device=device, dtype=torch.long)
episode_durations = []
def plot_durations(show_result=False):
    plt.figure(1)
    durations_t = torch.tensor(episode_durations, dtype=torch.float)
    if show_result:
        plt.title('Result')
    else:
        plt.clf()
        plt.title('Training...')
    plt.xlabel('Episode')
    plt.ylabel('Duration')
    plt.plot(durations_t.numpy())
    # Take 100 episode averages and plot them too
    if len(durations_t) >= 100:
        means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
        means = torch.cat((torch.zeros(99), means))
        plt.plot(means.numpy())
    plt.pause(0.001)  # pause a bit so that plots are updated
    if is_ipython:
        if not show_result:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        else:
            display.display(plt.gcf())

def optimize_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
    # detailed explanation). This converts batch-array of Transitions
    # to Transition of batch-arrays.
    batch = Transition(*zip(*transitions))
    # Compute a mask of non-final states and concatenate the batch elements
    # (a final state would've been the one after which simulation ended)
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state
                                                if s is not None])
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)
    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(1, action_batch)
    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1).values
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1).values
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch
    # Compute Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    # In-place gradient clipping
    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
    optimizer.step()

下面，您可以找到主要的训练循环。在开始时，我们重置环境并获取初始的state张量。然后，我们采样一个动作，执行它，观察下一个状态和奖励（始终为 1），并优化我们的模型一次。当 episode 结束时（我们的模型失败），我们重新开始循环。

如果有 GPU 可用，则将 num_episodes 设置为 600，否则将安排 50 个 episodes，以便训练不会太长。然而，50 个 episodes 对于观察 CartPole 的良好性能是不足够的。您应该看到模型在 600 个训练 episodes 内不断达到 500 步。训练 RL 代理可能是一个嘈杂的过程，因此如果没有观察到收敛，重新开始训练可能会产生更好的结果。

if torch.cuda.is_available():
    num_episodes = 600
else:
    num_episodes = 50
for i_episode in range(num_episodes):
    # Initialize the environment and get its state
    state, info = env.reset()
    state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
    for t in count():
        action = select_action(state)
        observation, reward, terminated, truncated, _ = env.step(action.item())
        reward = torch.tensor([reward], device=device)
        done = terminated or truncated
        if terminated:
            next_state = None
        else:
            next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)
        # Store the transition in memory
        memory.push(state, action, next_state, reward)
        # Move to the next state
        state = next_state
        # Perform one step of the optimization (on the policy network)
        optimize_model()
        # Soft update of the target network's weights
        # θ′ ← τ θ + (1 −τ )θ′
        target_net_state_dict = target_net.state_dict()
        policy_net_state_dict = policy_net.state_dict()
        for key in policy_net_state_dict:
            target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
        target_net.load_state_dict(target_net_state_dict)
        if done:
            episode_durations.append(t + 1)
            plot_durations()
            break
print('Complete')
plot_durations(show_result=True)
plt.ioff()
plt.show()

/opt/conda/envs/py_3.10/lib/python3.10/site-packages/gymnasium/utils/passive_env_checker.py:249: DeprecationWarning:
`np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
Complete

这是说明整体结果数据流的图表。

动作是随机选择的，或者基于策略选择，从 gym 环境中获取下一步样本。我们将结果记录在重放内存中，并在每次迭代中运行优化步骤。优化从重放内存中选择一个随机批次来训练新策略。在优化中还使用“较旧”的 target_net 来计算预期的 Q 值。其权重的软更新在每一步中执行。

脚本的总运行时间：（12 分钟 45.506 秒）

下载 Python 源代码：reinforcement_q_learning.py

下载 Jupyter 笔记本：reinforcement_q_learning.ipynb

Sphinx-Gallery 生成的图库

使用 TorchRL 的强化学习（PPO）教程

原文：pytorch.org/tutorials/intermediate/reinforcement_ppo.html

译者：飞龙

协议：CC BY-NC-SA 4.0

注意

点击这里下载完整示例代码

作者：Vincent Moens

本教程演示了如何使用 PyTorch 和torchrl来训练一个参数化策略网络，以解决来自OpenAI-Gym/Farama-Gymnasium 控制库的倒立摆任务。

倒立摆

关键学习：

如何在 TorchRL 中创建环境，转换其输出，并从该环境收集数据;
如何使用TensorDict使您的类彼此通信;
使用 TorchRL 构建训练循环的基础知识：

如何计算策略梯度方法的优势信号;
如何使用概率神经网络创建随机策略;
如何创建一个动态回放缓冲区，并从中进行无重复采样。

我们将介绍 TorchRL 的六个关键组件：

如果您在 Google Colab 中运行此代码，请确保安装以下依赖项：

!pip3  install  torchrl
!pip3  install  gym[mujoco]
!pip3  install  tqdm

Proximal Policy Optimization（PPO）是一种策略梯度算法，其中收集一批数据，并直接用于训练策略以最大化给定一些近似约束条件下的预期回报。您可以将其视为REINFORCE的复杂版本，这是基础策略优化算法。有关更多信息，请参阅Proximal Policy Optimization Algorithms论文。

PPO 通常被认为是一种快速高效的在线、基于策略的强化学习算法。TorchRL 提供了一个损失模块，可以为您完成所有工作，这样您就可以依赖这个实现，专注于解决问题，而不是每次想要训练策略时都要重新发明轮子。

为了完整起见，这里简要概述了损失的计算过程，尽管这已经由我们的ClipPPOLoss模块处理——算法的工作方式如下：1. 通过在环境中执行策略一定数量的步骤来采样一批数据。2. 然后，我们将使用该批次的随机子样本执行一定数量的优化步骤，使用 REINFORCE 损失的剪切版本。3. 剪切将对我们的损失设置一个悲观的边界：较低的回报估计将优先于较高的回报估计。损失的精确公式如下：

在该损失中有两个组件：在最小运算符的第一部分中，我们简单地计算 REINFORCE 损失的重要性加权版本（例如，我们已经校正了当前策略配置滞后于用于数据收集的策略配置的事实）。最小运算符的第二部分是一个类似的损失，当比率超过或低于给定的一对阈值时，我们对比率进行了剪切。

这种损失确保了无论优势是正面还是负面，都会阻止会导致与先前配置产生显著变化的策略更新。

本教程结构如下：

首先，我们将定义一组用于训练的超参数。
接下来，我们将专注于使用 TorchRL 的包装器和转换器创建我们的环境或模拟器。
接下来，我们将设计策略网络和价值模型，这对于损失函数是必不可少的。这些模块将用于配置我们的损失模块。
接下来，我们将创建重放缓冲区和数据加载器。
最后，我们将运行训练循环并分析结果。

在本教程中，我们将使用tensordict库。TensorDict是 TorchRL 的通用语言：它帮助我们抽象出模块读取和写入的内容，更少关心具体的数据描述，更多关注算法本身。

from collections import defaultdict
import matplotlib.pyplot as plt
import torch
from tensordict.nn import TensorDictModule
from tensordict.nn.distributions import NormalParamExtractor
from torch import nn
from torchrl.collectors import SyncDataCollector
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.data.replay_buffers.storages import LazyTensorStorage
from torchrl.envs import (Compose, DoubleToFloat, ObservationNorm, StepCounter,
                          TransformedEnv)
from torchrl.envs.libs.gym import GymEnv
from torchrl.envs.utils import check_env_specs, set_exploration_mode
from torchrl.modules import ProbabilisticActor, TanhNormal, ValueOperator
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE
from tqdm import tqdm

定义超参数

我们设置算法的超参数。根据可用资源，可以选择在 GPU 上或在另一设备上执行策略。frame_skip将控制执行单个动作需要多少帧。其余计算帧数的参数必须根据这个值进行校正（因为一个环境步骤实际上会返回frame_skip帧）。

device = "cpu" if not torch.cuda.is_available() else "cuda:0"
num_cells = 256  # number of cells in each layer i.e. output dim.
lr = 3e-4
max_grad_norm = 1.0

数据收集参数

在收集数据时，我们可以通过定义一个frames_per_batch参数来选择每个批次的大小。我们还将定义我们允许自己使用多少帧（例如与模拟器的交互次数）。一般来说，RL 算法的目标是尽快学会解决任务，以尽可能少的环境交互次数为目标：total_frames越低越好。我们还定义了一个frame_skip：在某些情况下，重复在轨迹过程中多次执行相同动作可能是有益的，因为这会使行为更一致，更少出现异常。然而，“跳过”太多帧会通过降低演员对观察变化的反应性来阻碍训练。

在使用frame_skip时，最好根据我们正在组合在一起的帧数来校正其他帧数。如果我们为训练配置了 X 帧的总数，但使用了 Y 的frame_skip，那么我们实际上将总共收集XY帧，这超出了我们预先定义的预算。

frame_skip = 1
frames_per_batch = 1000 // frame_skip
# For a complete training, bring the number of frames up to 1M
total_frames = 50_000 // frame_skip

PPO 参数

在每次数据收集（或批量收集）中，我们将在一定数量的epochs上运行优化，每次都会在嵌套的训练循环中消耗我们刚刚获取的所有数据。在这里，sub_batch_size与上面的frames_per_batch不同：请记住，我们正在处理来自我们收集器的“数据批次”，其大小由frames_per_batch定义，并且我们将在内部训练循环中进一步分割为更小的子批次。这些子批次的大小由sub_batch_size控制。

sub_batch_size = 64  # cardinality of the sub-samples gathered from the current data in the inner loop
num_epochs = 10  # optimization steps per batch of data collected
clip_epsilon = (
    0.2  # clip value for PPO loss: see the equation in the intro for more context.
)
gamma = 0.99
lmbda = 0.95
entropy_eps = 1e-4

PyTorch 2.2 中文官方教程（七）（4）https://developer.aliyun.com/article/1482524

PyTorch 2.2 中文官方教程（七）（3）

强化学习

强化学习（DQN）教程

重放内存

DQN 算法

使用 TorchRL 的强化学习（PPO）教程

定义超参数

数据收集参数

PPO 参数

热门文章

最新文章

相关课程

相关电子书

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

PyTorch 2.2 中文官方教程（七）（3）

强化学习

强化学习（DQN）教程

重放内存

DQN 算法

使用 TorchRL 的强化学习（PPO）教程

定义超参数

数据收集参数

PPO 参数

热门文章

最新文章

相关课程

相关电子书

推荐镜像