Continuous Multi-Step TD, Eligibility Traces and TD(λ): A brief note

简介: Thanks Richard S. Sutton and Andrew G. Barto for their great work in Reinforcement Learning: An Introduction.We focus on episodic case only and deal with continuous state and action space

Thanks Richard S. Sutton and Andrew G. Barto for their great work in Reinforcement Learning: An Introduction.

We focus on episodic case only and deal with continuous state and action spaces. Suppose you already have the basic knowledge of TD(0) methods in reinforcement learning.

Stochastic-gradient and Semi-gradient Methods

Stochastic-gradient methods are among the most widely used of all function approximation methods and are particularly well suited to online reinforcement learning.

Let v^(s,θ) be the approximation of the true value function v(s) with real valued components θ=(θ1,θ2,,θn)T for all sS. We assume that states appear in examples with the same distribution d, over which we are trying to minimize error on the Mean Square Value Error (MSVE(θ)=sSd(s)[vπ(s)v^(s,θ)]2). A straight-forward strategy is to try to minimize the error on the observed examples. Stochastic-gradient methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:

θt+1=θt12αθ[vπ(St)v^(St,θt)]2=θt+α[vπ(St)v^(St,θt)]θv^(St,θt)

SGD methods are gradient descent methods because the overall step in θt is proportional to the negetive gradient of the example’s squared error. This is the direction in which the error falls most rapidly. Meanwhile, stochastic means that the update is done on only a single example, which might have been selected stochastically.

Why we take only a small step in the direction of the gradient? We do not seek to find a value function that has zero error for all states, but only an approximation that balances the errors in different states. In fact, the convergence results for SGD methods assume that α decreases over time. By that way, the SGD method is guaranteed to converge to a local optimum.

However, we do not know the true value function vπ(s) in practice. Hence it is necessary to substitute an estimate Ut in place of vπ(s). If Ut is an unbiased estimate, this is, if E[Ut]=vπ(St) for each t, then θt is guaranteed to converge to a local optimum. For example, the Monte Carlo target Ut=Gt is by definition an unbiased estimate of vπ(St).

If a bootstrapping estimate of vπ(St) is used as the target Ut (such as Ut=Rt+1+γv^(St+1,θt)), one does not obtain the same guarantees since the update step would rely on the target failing to be independent of θt. Bootstrapping methods take into account the effect of changing the weight vector θt on the estimate, but ignore its effect on the target. They include only a part of the gradient and, accordingly, we call them semi-gradient methods.

Although semi-gradient methods do not converge as robustly as gradient methods, they do converge reliably in important case such as the linear case.

Episodic Semi-gradient Control

In the view of control instead of prediction, the action-value function q(St,At) accounts more for the application than value function v(St). The general gradient-descent update for action-value prediction is

θt+1=θt+α[Utq^(St,At,θt)]q^(St,At,θt)

For example, the update for the one-step SARSA method is
θt+1=θt+α[Rt+1+γq^(St+1,At+1,θt)q^(St,At,θt)]q^(St,At,θt)

We call this method episodic semi-gradient one-step SARSA. For a constant policy, this method converges in the same way that TD(0) does, with the same kind of error bound MSVE(θTD)11γminθMSVE(θ).

n-Step Semi-gradient SARSA

We can obtain an n-step version of episodic semi-gradient SARSA by using an n-step return as the update target.

G(n)t=Rt+1+γRt+2++γn1Rt+n+γnq^(St+n,At+n,θt+n1),n1,0t<Tn

The n-step update update equation is
θt+n=θt+n1+α[G(n)tq^(St,At,θt+n1)]q^(St,At,θt+n1)

The performance is the best if an intermediate level of bootstrapping is used, corresponding to an n larger than 1.

The λ-Return

To begin with, we note that a valid update can be done not just toward any n-step return, but toward any average of n-step returns. The composite return possesses an error reduction property similar to that of individual n-step returns and thus can be used to construct backups with guaranteed convergence properties. A backup that average simpler component backups is called a compound backup.

The λ-return is defined by

Gλt=(1λ)n=1λn1G(n)t

which can be understood as one particular way of averaging n-step backups. This average contains all the n-step backups, each weighted proportional to λn1, where λ[0,1], and normalized by a factor of 1λ to ensure that the weights sum to 1. After a terminal state (if exists) has been reached, all subsequent n-step returns are equal to Gt. If we want, we can separate these post-termination terms from the main sum, yielding
Gλt=(1λ)n=1Tt1λn1G(n)t+λTt1Gt

Now it is time to define our first learning algorithm based on λ-return: the off-line λ-return algorithm. As an off-line algorithm, it makes no changes to the weight vector during the episode. Then, at the end of the episode, a whole sequence of off-line updates are made according to usual semi-gradient rule:
θt+1=θt+α[Gλtv^(St,θt)]v^(St,θt)

The approach is what we call the theoretical, or forward view of a learning algorithm. For each state visited, we look forward in time to all the future rewards and decide how best to combine them.

Eligibility Traces

Eligibility traces are one of the basic mechanisms of reinforcement learning. Almost any temporal-difference methods, such as Q-Learning or SARSA, can be combined with eligibility traces to obtain a more general method that may learn more efficiently. What eligibility traces offer beyond these is an elegant algorithmic mechanism with significant computational advantages. Only a single trace vector is required rather than a store of the last n feature vectors. Learning also occurs continually and uniformly in time rather than being delayed and ten catching up at the end of the episode.

The mechanism is a short-term memory vector, the eligibility trace etRn,that parallels the long-term memory weight vector θtRn. The rough idea is that when a component of θt participates in producing an estimated value, then the corresponding component of et is bumped up and then begins to fade away. Learning will then occur in that component of θt is a nonzero TD error occurs before the trace falls back to zero. The trace-decay parameter λ[0,1] determines the rate at which the trace falls.

The idea behind eligibility traces is very simple. Each time a state is visited it initializes a short-term memory process, a trace, which then decays gradually over time. This trace marks the state as eligibility for learning. If an unexpectedly good or bad event occurs while the trace is non-zero, then the state is assigned credit accordingly. There have been two kinds of traces: the accumulating trace and the replacing trace. In a accumulating trace, the trace builds up each time the state is entered. In a replacing trace, on the other hand, each time the state is visited the trace is reset to 1 regardless of the presence of a prior trace. Typically, eligibility traces decay exponentially according to the product of a decay parameter, λ, 0λ1, and a discount-rate parameter, γ, 0γ1. The accumulating trace is defined by:

et+1(s)={γλet(s),γλet(s)+1,ssts=st

where et(s) represents the trace for state s at time t. and st is the actual state at time t. The corresponding replacing trace is defined by:
et+1(s)={γλet(s),1ssts=st

In the gradient descent case, the (accumulating) eligibility trace is initialized to zero at the beginning of the episode, is incremental on each time step by the value gradient, and then fades away by γλ:

e0et=0,=v^(St,θt)+γλet1

The eligibility trace keeps track of which components of the weight vector have contributed, positively or negatively, to recent state valuations, where recent is defined in terms γλ.

TD(λ) Methods

TD(λ) improves over the off-line λ-return algorithm in three ways:

  1. It updates the weight vector on every step of an episode rather than only at the end.
  2. Its computations are equally distributed in time rather that all at the end of the episode.
  3. It can be applied to continuing problems rather than just episodic problems.

To begin with, the TD error for state-value prediction is:

δt=Rt+1+γv^(St+1,θt)v^(St,θt)

In TD(λ), the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:
θt+1=θt+αδtet

TD(λ) is oriented backward in time. At each moment we look at the current TD error and assign it backward to each prior state according to how much that state contributed to the current eligibility trace at that time. The complete update rule of TD(λ) (with so-called accumulating traces) is as follows:
  1. e=0
  2. Choose Aπ(|S)
  3. Observe R, S
  4. e=γλe+v^(S,θ)
  5. δ=R+γv^(S,θ)v^(S,θ)
  6. θ=θ+αδe
  7. S=S

In another variant of the algorithm, the eligibility traces are updated according to:

e=max(γλe,v^(S,θ))

This is called the replacing traces update.

TD(0) is the case in which only the one state preceding the current one is changed by the TD error. For larger values of λ, but still λ<1, more of the preceding states are changed but each more temporally distant state is changed less because the corresponding eligibility trace is smaller. We say that the earlier states are given less credit for the TD error.

TD(1) is a way of implementing Monte Carlo algorithms. On-line TD(1) learns in an n-step TD way from the incomplete ongoing episode, where the n steps are all the way up to the current step. If something unusually good or bad happens during an episode, control methods based on TD(1) can learn immediately and alter their behavior on that same episode.

相关文章
|
5月前
|
前端开发 UED 开发者
告别卡顿!React 18 `useTransition` 优化交互流畅度
告别卡顿!React 18 `useTransition` 优化交互流畅度
247 77
|
5月前
|
存储 数据采集 数据可视化
Java 大视界 -- 基于 Java 的大数据可视化在城市地下管网管理与风险预警中的应用(275)
本文系统阐述 Java 与大数据可视化技术在城市地下管网管理中的应用,涵盖数据采集、三维建模、风险预警及性能优化,结合真实案例提供可落地的技术方案。
|
5月前
|
缓存 JavaScript 开发者
鸿蒙5开发宝藏案例分享---长列表性能优化解析
鸿蒙长列表性能优化全揭秘!通过五大实战技巧(LazyForEach懒加载、cachedCount缓存、Prefetcher动态预加载、@Reusable组件复用及布局优化),有效解决卡顿、白块和高内存问题。万条数据测试显示,首屏加载提速77%,滑动零丢帧,内存占用降低86%。针对不同数据量场景提供避坑指南,助你开发流畅的HarmonyOS应用!
|
8月前
|
人工智能 自然语言处理 程序员
下载量突破400万,百万开发者首选的 AI 编码工具通义灵码是如何炼成的?
下载量突破400万,百万开发者首选的 AI 编码工具通义灵码是如何炼成的?
315 3
|
9月前
|
机器学习/深度学习 传感器 数据采集
《DeepSeek赋能工业互联网:解锁数据深度分析新姿势》
DeepSeek作为AI大模型领域的佼佼者,为工业互联网的数据深度分析开辟了新路径。其智能传感器融合技术精准高效地采集各类工业设备数据,并结合边缘计算进行预处理,确保数据实时传输。强大的深度学习算法能挖掘复杂工业数据中的潜在价值,预测生产趋势并实时监测异常,多模态数据融合分析则实现全面洞察。自适应学习能力保障模型持续优化,助力企业降本增效、创新发展,推动制造业迈向新高度。
255 3
|
机器学习/深度学习 监控 自动驾驶
【传知代码】从零开始搭建图像去雾神经网络-论文复现
本文介绍了基于集成学习的双分支非均匀去雾神经网络的复现,该网络由迁移学习子网和数据拟合子网组成,分别处理全局表示和数据拟合。网络使用Res2Net作为编码器,并结合通道和像素注意力模块。代码可在提供的链接下载。网络在交通监控、自动驾驶、航海和目标跟踪等领域有广泛应用,通过提升图像质量来提高系统性能。实验在O-Haze、I-Haze和NH-Haze数据集上进行,展示了网络在去除雾霾方面的效果,尽管存在细节模糊和色彩饱和度低的问题。
600 1
|
Web App开发
火狐浏览器之伪造IP地址
前言: 前段时间,测试过程中需要伪造来源IP地址,百思不得其解,因而发现火狐浏览器的这个Modify Headers插件,十分好用,推荐给大家。 步骤: 1、安装插件Modify Headers 进入网址:https://addons.mozilla.org/zh-CN/firefox/,搜索Modify Headers,点击添加到Firefox。
2586 0
|
存储 缓存 运维
阿里云数据库 ClickHouse 云原生版产品解析
本文简要概述了开源 ClickHouse 应用行业场景及性能优势,以及业界对于ClickHouse的关注度趋势和使用情况。 详细说明阿里云数据库ClickHouse 云原生版架构和功能特性,分析自建ClickHouse 用户使用过程中遇到的问题,以及如何通过云原生版本解决自建的部分难点问题。最后介绍云原生版本试用申请流程。
3609 1
阿里云数据库 ClickHouse 云原生版产品解析
|
算法 数据安全/隐私保护
以太网与 IEEE 802.3
以太网与 IEEE 802.3 标准
1035 0
以太网与 IEEE 802.3
|
存储 前端开发 JavaScript
Sentry For React 完整接入详解(2021 Sentry v21.8.x)前方高能预警!三万字,慎入!(一)
Sentry For React 完整接入详解(2021 Sentry v21.8.x)前方高能预警!三万字,慎入!(一)
2181 0
Sentry For React 完整接入详解(2021 Sentry v21.8.x)前方高能预警!三万字,慎入!(一)