需要源码请点赞关注收藏后评论区留言私信~~~
直接优化策略
直接优化策略强化学习算法中,根据采用的是确定性策略还是随机性策略,又分为确定性策略搜索和随机性策略搜索两类。随机性策略搜索算法有策略梯度法和TRPO等,确定性策略搜索算法有DDPG等。
策略可以看作是从状态空间S到动作空间A的映射:S→A
直接优化策略的方法可采用与值函数逼近相似的思路,即先确定逼近策略的结构,再优化结构的参数。
下面分别用神经元和神经网络作为逼近策略的结构,来控制倒立摆实验中小车的运动,分别通过爬山法和梯度下降法来优化他们的系数
启发式算法直接优化策略控制倒立摆
用随机的连接系数和阈值的神经元来参数化策略
结果如下
爬山法优化神经元的连接系数和阈值
爬山法的思路是在当前位置试探周边,并一直向最好的方向前进,但是容易陷入局部最优点
用神经网络来参数化策略
该神经网络只有一个隐层,该隐层的神经元个数为10,采用ReLU激活函数,输出层采用Sigmoid激活函数
训练结果如下
虽然效果较好,实际上这种从大量尝试得到的轨迹中筛选出少量优质样本的做法,在现实应用中并不容易实现
部分代码如下
delta = 0.01 # 爬山法中试探的步长 top_rewards = 0 top_paras = None for _ in range(100): # 多次爬山,选取最好的结果 score = 0 paras = np.random.rand(5) # 随机产生神经元的连接系数和阈值 most_rewards = rewards_by_paras(env, paras) for i in range(200): best_paras = paras cur_rewards = most_rewards rewards = rewards_by_paras(env, paras + [ delta, 0, 0, 0, 0 ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ delta, 0, 0, 0, 0 ] rewards = rewards_by_paras(env, paras + [ -delta, 0, 0, 0, 0 ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ -delta, 0, 0, 0, 0 ] rewards = rewards_by_paras(env, paras + [ 0, delta, 0, 0, 0 ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ 0, delta, 0, 0, 0 ] rewards = rewards_by_paras(env, paras + [ 0, -delta, 0, 0, 0 ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ 0, -delta, 0, 0, 0 ] rewards = rewards_by_paras(env, paras + [ 0, 0, delta, 0, 0 ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ 0, 0, delta, 0, 0 ] rewards = rewards_by_paras(env, paras + [ 0, 0, -delta, 0, 0 ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ 0, 0, -delta, 0, 0 ] rewards = rewards_by_paras(env, paras + [ 0, 0, 0, delta, 0 ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ 0, 0, 0, delta, 0 ] rewards = rewards_by_paras(env, paras + [ 0, 0, 0, -delta, 0 ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ 0, 0, 0, -delta, 0 ] rewards = rewards_by_paras(env, paras + [ 0, 0, 0, 0, delta ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ 0, 0, 0, 0, delta ] rewards = rewards_by_paras(env, paras + [ 0, 0, 0, 0, -delta ]) if rewards > most_rewards: most_rewards = rewards best_paras = paras + [ 0, 0, 0, 0, -delta ] if (cur_rewards == most_rewards) or (most_rewards >= 200): # 到了山顶,或者已经达到要求 break else: paras = best_paras #print(most_rewards, paras) if most_rewards > top_rewards: top_rewards = most_rewards top_paras = paras print(top_rewards, top_paras)
创作不易 觉得有帮助请点赞关注收藏~~~