【5分钟 Paper】Prioritized Experience Replay-阿里云开发者社区

【5分钟 Paper】Prioritized Experience Replay

2023-08-03 92

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 【5分钟 Paper】Prioritized Experience Replay

论文题目：Prioritized Experience Replay

所解决的问题？

Experience replay能够让强化学习去考虑过去的一些经验，在【1】这篇文章之前通常采用随机采样的方式在记忆库中采样。但是有一些记忆比较关键，因此随机采样的方式就不太好。作者提出了一种prioritizing experience的方式，来提高学习的效率。

参考文献【1】：Lin, Long-Ji. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.

背景

之前的做法像DQN基本上都是从记忆库中sample一些experience data出来之后给model update一次之后就被丢弃了。但是这里会有些问题，就是如果采样方式比较好的话一来会切断数据之间的相关性，二来，对于一些相似度高的数据可以少采样一点，而很少见的数据可以多拿来更新几次。

作者从以下文献获得灵感：

Experiences with high magnitude TD error also appear to be replayed more often(Singer & Frank, 2009; McNamara et al., 2014).

参考文献1：Singer, Annabelle C and Frank, Loren M. Rewarded outcomes enhance reactivation of experience in the hippocampus (海马体). Neuron, 64(6):910–921, 2009
参考文献2：McNamara, Colin G, Tejero-Cantero, ´Alvaro, Trouche, St´ephanie, Campo-Urriza, Natalia, and Dupret, David. Dopaminergic neurons promote hippocampal reactivation and spatial memory persistence. Nature neuroscience, 2014.

The TD error provides one way to measure these priorities (van Seijen & Sutton, 2013). 作者将这种方法用于model-free的强化学习中，而非model-base的方法中。

van Seijen, Harm and Sutton, Richard. Planning by prioritized sweeping with small backups. In Proceedings of The 30th International Conference on Machine Learning, pp. 361–369, 2013.

做replay memory之前我们需要明确两个点。选择什么样的experiences去存储，选择什么样的experience去replay，怎么实现？作者只解决后面这个问题。

所采用的方法？

prioritized replay 中一个核心的问题就是如何来选择这个transition (s,a,r,s')，作者采用TD-error来衡量transition的重要性(how far the value is from its next-step bootstrap estimate (Andre et al., 1998))。

参考文献1：Andre, David, Friedman, Nir, and Parr, Ronald. Generalized prioritized sweeping. In Advances in Neural Information Processing Systems. Citeseer, 1998.

greedy TD-error prioritization会产生一些问题：1. TD-error的样本可能永远不会被采样到；2. 整个算法对噪声会非常敏感；3. TD-error大的样本很容易使得神经网络过拟合(因为一直采样TD-error大的样本)。

stochastic sampling method

基于以上几点，作者提出stochastic sampling method，介于pure greedy prioritization和uniform random sampling之间的一种采样方法。the probability of sampling transition i ii as：

其中p i > 0 p_{i} >0pi>0， is the priority of transition i ii，指数α \alphaα determines how much prioritization is used，当α = 0 \alpha =0α=0时，就是随机选(uniform case)。

对于上述的P ( i ) P(i)P(i)，作者提出了两个变种：

对于上述算法的实现细节：如下所示：

For the rank-based variant：我们可以用一个分段线性函数来近似累积密度函数，k kk段的概率是相等的。分段边界可以预先计算出来（因为只与N NN和α \alphaα有关系）。在运行时，我们选择一个片段，然后在这个片段中的所有transition中均匀地采样。选k kk为minibatch的大小,从每一个片段中选出一个transition-这是一种分层抽样,可以平衡minibatch。意思就是先划分片段，然后从里面随机抽。
Proportional prioritization

Annealing the bias(为减少bias)

随机更新对期望值的估计依赖于与预期相同的分布相对应的更新。优先重放机制引入了bias，它以一种不受控制的方式改变了这个分布，因此改变收敛结果（即使策略和状态分布是固定的）。通过引入importance-sample (IS) weights来弥补：

其中1 N \frac{1}{N}N1表示样本最开始服从的分布，1 P ( i ) \frac{1}{P(i)}P(i)1表示的是样本引入优先级之后的分布。但是我们就是要做有偏估计，所以引入β \betaβ系数控制有偏和无偏的量，一旦有偏估计之后算法收敛性无法保证，因此可以随着迭代次数增加β \betaβ慢慢变成1。

算法伪代码如下图所示：

取得的效果？

可以看出，rank-based的方法和proportional的方法都能加速收敛。

所出版信息？作者信息？

这篇文章是ICLR2016上面的一篇文章。第一作者Tom Schaul是Google DeepMind的Senior research Scientist，PostDoc at New York University from 2011-2013, PhD Student at IDSIA from 2007-2011。

参考链接

https://zhuanlan.zhihu.com/p/38358183

扩展阅读

Some transitions may not be immediately useful to the agent,but might become so when the agent competence increases (Schmidhuber,1991).

参考文献：Schmidhuber, J¨urgen. Curious model-building control systems. In Neural Networks, 1991. 1991 IEEE International Joint Conference on, pp. 1458–1463. IEEE, 1991.

TD-errors同时也有被用于 explore (White et al., 2014) or which features to select (Geramifard et al., 2011; Sun et al., 2011)

参考文献 1：White, Adam, Modayil, Joseph, and Sutton, Richard S. Surprise and curiosity for big data robotics. In Workshops at the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence, 2014.
参考文献 2 Geramifard, Alborz, Doshi, Finale, Redding, Joshua, Roy, Nicholas, and How,Jonathan. Online discovery of feature dependencies . In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 881–888, 2011.
参考文献 3：Sun, Yi, Ring, Mark, Schmidhuber, J¨urgen, and Gomez, Faustino J. Incremental basis construction from temporal difference error. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 481–488, 2011.

我的微信公众号名称：深度学习与先进智能决策

微信公众号ID：MultiAgent1024

公众号介绍：主要研究分享深度学习、机器博弈、强化学习等相关内容！期待您的关注，欢迎一起学习交流进步！

【5分钟 Paper】Prioritized Experience Replay

所解决的问题？

背景

所采用的方法？

取得的效果？

所出版信息？作者信息？

参考链接

扩展阅读

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【5分钟 Paper】Prioritized Experience Replay

所解决的问题？

背景

所采用的方法？

取得的效果？

所出版信息？作者信息？

参考链接

扩展阅读

热门文章

最新文章

相关电子书