【5分钟 Paper】Prioritized Experience Replay

简介: 【5分钟 Paper】Prioritized Experience Replay
  • 论文题目:Prioritized Experience Replay

所解决的问题?

  Experience replay能够让强化学习去考虑过去的一些经验,在【1】这篇文章之前通常采用随机采样的方式在记忆库中采样。但是有一些记忆比较关键,因此随机采样的方式就不太好。作者提出了一种prioritizing experience的方式,来提高学习的效率。

  • 参考文献【1】:Lin, Long-Ji. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.

背景

  之前的做法像DQN基本上都是从记忆库中sample一些experience data出来之后给model update一次之后就被丢弃了。但是这里会有些问题,就是如果采样方式比较好的话一来会切断数据之间的相关性,二来,对于一些相似度高的数据可以少采样一点,而很少见的数据可以多拿来更新几次。

  作者从以下文献获得灵感:

  Experiences with high magnitude TD error also appear to be replayed more often(Singer & Frank, 2009; McNamara et al., 2014).

  • 参考文献1:Singer, Annabelle C and Frank, Loren M. Rewarded outcomes enhance reactivation of experience in the hippocampus (海马体). Neuron, 64(6):910–921, 2009
  • 参考文献2:McNamara, Colin G, Tejero-Cantero, ´Alvaro, Trouche, St´ephanie, Campo-Urriza, Natalia, and Dupret, David. Dopaminergic neurons promote hippocampal reactivation and spatial memory persistence. Nature neuroscience, 2014.

  The TD error provides one way to measure these priorities (van Seijen & Sutton, 2013). 作者将这种方法用于model-free的强化学习中,而非model-base的方法中。

  • van Seijen, Harm and Sutton, Richard. Planning by prioritized sweeping with small backups. In Proceedings of The 30th International Conference on Machine Learning, pp. 361–369, 2013.

  做replay memory之前我们需要明确两个点。选择什么样的experiences去存储,选择什么样的experiencereplay,怎么实现?作者只解决后面这个问题。

所采用的方法?

   prioritized replay 中一个核心的问题就是如何来选择这个transition (s,a,r,s'),作者采用TD-error来衡量transition的重要性(how far the value is from its next-step bootstrap estimate (Andre et al., 1998))。

  • 参考文献1:Andre, David, Friedman, Nir, and Parr, Ronald. Generalized prioritized sweeping. In Advances in Neural Information Processing Systems. Citeseer, 1998.

  greedy TD-error prioritization会产生一些问题:1. TD-error的样本可能永远不会被采样到;2. 整个算法对噪声会非常敏感;3. TD-error大的样本很容易使得神经网络过拟合(因为一直采样TD-error大的样本)。

  • stochastic sampling method

  基于以上几点,作者提出stochastic sampling method,介于pure greedy prioritizationuniform random sampling之间的一种采样方法。the probability of sampling transition i ii as:

image.png

 其中p i > 0 p_{i} >0pi>0, is the priority of transition i ii,指数α \alphaα determines how much prioritization is used,当α = 0 \alpha =0α=0时,就是随机选(uniform case)。

  对于上述的P ( i ) P(i)P(i),作者提出了两个变种:

image.png

  对于上述算法的实现细节:如下所示:

  • For the rank-based variant:我们可以用一个分段线性函数来近似累积密度函数,k kk段的概率是相等的。分段边界可以预先计算出来(因为只与N NNα \alphaα有关系)。在运行时,我们选择一个片段,然后在这个片段中的所有transition中均匀地采样。选k kkminibatch的大小,从每一个片段中选出一个transition-这是一种分层抽样,可以平衡minibatch。意思就是先划分片段,然后从里面随机抽。

  • Proportional prioritization

  • Annealing the bias(为减少bias)

  随机更新对期望值的估计依赖于与预期相同的分布相对应的更新。优先重放机制引入了bias,它以一种不受控制的方式改变了这个分布,因此改变收敛结果(即使策略和状态分布是固定的)。通过引入importance-sample (IS) weights来弥补:

image.png

其中1 N \frac{1}{N}N1表示样本最开始服从的分布,1 P ( i ) \frac{1}{P(i)}P(i)1表示的是样本引入优先级之后的分布。但是我们就是要做有偏估计,所以引入β \betaβ系数控制有偏和无偏的量,一旦有偏估计之后算法收敛性无法保证,因此可以随着迭代次数增加β \betaβ慢慢变成1。

  算法伪代码如下图所示:

取得的效果?

  可以看出,rank-based的方法和proportional的方法都能加速收敛。

所出版信息?作者信息?

  这篇文章是ICLR2016上面的一篇文章。第一作者Tom SchaulGoogle DeepMindSenior research ScientistPostDoc at New York University from 2011-2013, PhD Student at IDSIA from 2007-2011。

参考链接

扩展阅读

  1. Some transitions may not be immediately useful to the agent,but might become so when the agent competence increases (Schmidhuber,1991).
  • 参考文献:Schmidhuber, J¨urgen. Curious model-building control systems. In Neural Networks, 1991. 1991 IEEE International Joint Conference on, pp. 1458–1463. IEEE, 1991.
  1. TD-errors同时也有被用于 explore (White et al., 2014) or which features to select (Geramifard et al., 2011; Sun et al., 2011)
  • 参考文献 1:White, Adam, Modayil, Joseph, and Sutton, Richard S. Surprise and curiosity for big data robotics. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
  • 参考文献 2 Geramifard, Alborz, Doshi, Finale, Redding, Joshua, Roy, Nicholas, and How,Jonathan. Online discovery of feature dependencies . In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 881–888, 2011.
  • 参考文献 3:Sun, Yi, Ring, Mark, Schmidhuber, J¨urgen, and Gomez, Faustino J. Incremental basis construction from temporal difference error. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 481–488, 2011.

我的微信公众号名称:深度学习与先进智能决策

微信公众号ID:MultiAgent1024

公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!


相关文章
|
机器学习/深度学习 数据采集 人工智能
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
|
机器学习/深度学习 人工智能 开发框架
IJCAI_2020_Channel Pruning via Automatic Structure Search
1 摘要 通道修剪是压缩深层神经网络的主要方法之一。
196 0
IJCAI_2020_Channel Pruning via Automatic Structure Search
|
SQL 存储 算法
The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database
今天我们要介绍的MemSQL就采用这样一种新的形态(Oracle也变为了这种方式 ):即在做transformation时,要基于cost确定其是否可应用。 当然,本篇paper不止讲解了CBQT,还包括一些MemSQL优化器其他方面的介绍,包括一个有意思的heurstic based bushy join的方案。
413 0
The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database
论文笔记:Parallel Tracking and Verifying: A Framework for Real-Time and High Accuracy Visual Tracking
Parallel Tracking and Verifying: A Framework for Real-Time and High Accuracy Visual Tracking    本文目标在于 tracking performance 和 efficiency 之间达到一种平衡。
(转) The care and maintenance of your adviser
The care and maintenance of your adviser   Ever since the advent of graduate school, students have complained about their advisers.
|
Oracle 关系型数据库 MySQL
Strategies for Effective Database Synchronization
This article looks at some of the typical questionsthat arise when dealing with databases. In our discussion, we willnot stick to analyzing only a single type of database.
1809 0
|
C++
Reinforcement Learning in Continuous State and Action Spaces: A Brief Note
Thanks Hado van Hasselt for the great work. Introduction In the problems of sequential decision making in continuous domains with delayed reward signals, the main purpose for the algori
2088 0