Sarsa(λ) and Q(λ) in Tabular Case

简介: ‘Thanks R. S. Sutton and A. G. Barto for their great work in Reinforcement Learning: An Introduction.Eligibility Traces in Prediction ProblemsIn the backward view of TD(λ)TD(\lambda), t

‘Thanks R. S. Sutton and A. G. Barto for their great work in Reinforcement Learning: An Introduction.

Eligibility Traces in Prediction Problems

In the backward view of TD(λ), there is a memory variable associated with each state, its eligibility trace. The eligibility trace for state s at time t is a random variable denoted Zt(s)R+. On each step, the eligibility traces for all states decay by γλ, and the eligibility trace for the one state visited on the step is incremented by 1:

Zt(s)={γλZt1(s)γλZt1(s)+1sSts=St

for all nonterminal states s, where γ is the discount rate and λ is the λ-return 1 parameter or trace-decay parameter. This kind of eligibility trace is called an accumulating trace. The global TD error signal triggers proportional updates to all recently visited states, as signaled by their nonzero traces:
ΔVt(s)=αδtZt(s),sS

where
δt=Rt+1+γVt(St+1)Vt(St)

A complete prediction algorithm for on-line TD(λ) is given as follows:
  1. Initialize V(s) arbitrarily.
  2. Repeat for each episode:
    1. Initialize Z(s)=0 for all sS.
    2. Repeat for each step of episode:
      1. A action given by π for S.
      2. Observe reward R and the next state S.
      3. δR+γV(S)V(S)
      4. Z(S)Z(S)+1
      5. For all sS:
        1. V(s)V(s)+αδZ(s)
        2. Z(s)γλZ(s)
      6. SS
    3. Until S is terminal.

Sarsa(λ)

The main idea of control is simply to learn action values rather than state values. The idea in Sarsa(λ) is to apply the TD(λ) prediction method to state-action pairs. Let Zt(s,a) denote the trace for state-action pair s,a. Substitute state-action variables for state variables, i.e.

Qt+1(s,a)=Qt(s,a)+αδtZt(s,a),s,a

where
δt=Rt+1+γQt(St+1,At+1)Qt(St,At)

and
Zt(s,a)={γλZt1(s,a)+1γλZt1(s,a)s=St,a=Atotherwise for all s,a

The complete Sarsa( λ) algorithm is given as follows:
  1. Initialize Q(s,a) arbitrarily for all sS,aA(s)
  2. Repeat for each episode:
    1. Z(s,a)=0 for all sS,aA(s)
    2. Repeat for each step of episode:
      1. Take action A, observe R, S.
      2. Choose A from S using policy derived from Q.
      3. δR+γQ(S,A)Q(S,A)
      4. Z(S,A)Z(S,A)+1
      5. For all sS,aA(s):
        1. Q(s,a)Q(s,a)+αδZ(s,a)
        2. Z(s,a)γλZ(s,a)
      6. SS,AA
    3. Unitl S is terminal.

Q(λ)

There are two different methods have been proposed that combine eligibility traces and Q-Learning: the Watkins’s Q(λ) and Peng’s Q(λ). Here we focus on Watkin’s Q(λ) only and give a brief description of Peng’s Q(λ).

Unlike TD(λ) and Sarsa(λ), Watkins’s Q(λ) does not look ahead all the way to the end of the episode in its backup. It only looks ahead as far as the next exploratory action. For example, suppose the first action, At+1, is exploratory. Watkins’s Q(λ) would still do the one-step update of Qt(St,At) toward Rt+1+γmaxaQ(St+1,a). In general, if At+n is the first exploratory action, then the longest backup is toward

Rt+1+γRt+1++γn1Rt+n+γnmaxaQ(St+n,a)

where we assume off-line updating.

The mechanistic or backward view of Watkins’s Q(λ) is also very simple. Eligibility traces are used just as in Sarsa(λ), except that they are set to zero whenever an exploratory action is taken. That is to say. First, the traces for all state-action pairs are either decayed by γλ or, if an exploratory action was taken, set to 0. Second, the trace corresponding to the current state and action is incremented by 1, i.e.

Zt(s,a)=IsStIaAt+γλZt1(s,a)0 if Qt1(St,At)=maxaQt1(St,a) otherwise

where Ixy is an identity indicator function, equal to 1 if x=y and 0 otherwise. The rest of the algorithm is defined by
Qt+1(s,a)=Qt(s,a)+αδtZt(s,a),sS,aA(s)

where
δt=Rt+1+γmaxaQt(St+1,a)Qt(St,At)

The complete algorithm is given below:
  1. Initialize Q(s,a) arbitrarily, for all sS,aA(s).
  2. Repeat for each episode:
    1. Z(s,a)=0 for all sS,aA(s)
    2. Repeat for each step of episode:
      1. Take action A, observe R, S
      2. Choose A from S using policy derived from Q.
      3. AargmaxaQ(S,a), ifA ties for the max, then AA
      4. δR+γQ(S,A)Q(S,A)
      5. Z(S,A)Z(S,A)+1
      6. For all sS,aA(s):
        1. Q(s,a)Q(s,a)+αδZ(s,a)
        2. If A=A, then Z(s,a)γλZ(s,a), else Z(s,a)0.
      7. SS,AA
    3. Until S is terminal

Unfortunately, cutting off traces every time an exploratory action is taken loses much of the advantage of using eligibility traces. Peng’s Q(λ) is an alternate version of Q(λ) meant to remedy this, which can be thought of as a hybrid of Sarsa(λ) and Watkins’s Q(λ). In Peng’s Q(λ), there is no distinction between exploratory and greedy actions. Each component beckup is over many steps of actual experiences, and all but the last are capped by a final maximization over actions. For a fixed nongreedy policy, Qt converges to neither qπ nor q, but to some hybrid of the two. However, if the policy is gradually made more greedy, then the method may still converge to q.

Replacing Traces

In some cases significantly better performance can be obtained by using a slightly modified kind of trace known as replacing trace:

Zt(s)={γλZt1(s)1sSts=St

There are several possible ways to generalize replacing eligibility traces for use in control methods. In some cases, the state-action traces are updated by the following:
Zt(s,a)=10γλZt1(s,a)s=St,a=Ats=St,aAtsSt

  1. λ-return: Gλt=(1λ)n=1λn1G(n)t
相关文章
|
3月前
|
Ubuntu 安全 搜索推荐
揭秘Ubuntu系统的优势,你想知道吗?
对于移动设备,Ubuntu系统还在不断地探索与支持。众多Ubuntu系统的社区和开发人员正在探索Ubuntu系统在移动领域的应用,以提供全新的、更加开放和稳定的移动系统体验。 对于云服务器,Ubuntu系统作为一种轻量级的操作系统,越来越受到云服务提供商的青睐。Ubuntu系统可以作为一种安全和高效的云服务器操作系统,无论在公有云、私有云或混合云里,都可以提供出色的性能和体验。
|
C语言
【c语言】指针就该这么学(1)
本文详细介绍了C语言中的指针概念及其基本操作。首先通过生活中的例子解释了指针的概念,即内存地址。接着,文章逐步讲解了指针变量的定义、取地址操作符`&`、解引用操作符`*`、指针变量的大小以及不同类型的指针变量的意义。此外,还介绍了`const`修饰符在指针中的应用,指针的运算(包括指针加减整数、指针相减和指针的大小比较),以及野指针的概念和如何规避野指针。最后,通过具体的代码示例帮助读者更好地理解和掌握指针的使用方法。
234 1
|
机器学习/深度学习 人工智能 算法
「AI工程师」算法研发与优化-工作指导
**工作指导书摘要:** 设计与优化算法,提升性能效率;负责模型训练及测试,确保准确稳定;跟踪业界最新技术并应用;提供内部技术支持,解决使用问题。要求扎实的数学和机器学习基础,熟悉深度学习框架,具备良好编程及数据分析能力,注重团队协作。遵循代码、文档和测试规范,持续学习创新,优化算法以支持业务发展。
645 0
「AI工程师」算法研发与优化-工作指导
|
11月前
|
安全 小程序 数据建模
SSL证书概述、类型、价格、作用及应用等10大常见问题解答
在互联网+时代,随着数字化进程加速,网络威胁日益严峻。SSL证书作为遵循SSL协议的数字证书,能实现HTTPS加密,验证网站服务器身份,确保数据传输安全性和完整性,有效防范中间人攻击和钓鱼网站。本文将介绍关于SSL证书的10大常见问题,帮助您更好地了解和使用SSL证书,确保网站安全。
|
数据采集 Python
数据爬取技术进阶:从表单提交到页面点击的实现
本文介绍了如何使用 Python 和代理 IP 技术,从表单提交到页面点击,实现动态网页的数据爬取。以百度贴吧为例,详细讲解了登录、发帖和数据采集的实现流程,并提供了完整的代码示例。通过代理 IP 确保数据获取的稳定性和安全性。
407 3
|
人工智能
开启歌词创作之门:写歌词的技巧和方法详解,妙笔生词AI智能写歌词软件
歌词创作是通往音乐灵魂深处的大门。本文介绍了一些实用技巧,如借助《妙笔生词智能写歌词软件》的AI功能,捕捉生活中的灵感,确定主题,合理安排歌词结构,运用生动的语言和修辞手法,确保韵律和节奏,帮助你轻松开启创作之旅。
|
Java 微服务 Spring
【Spring Cloud】spring cloud 调用feign请求超时 feign.RetryableException: Read timed out executing POST
【Spring Cloud】spring cloud 调用feign请求超时 feign.RetryableException: Read timed out executing POST
2518 0
|
前端开发 搜索推荐
Flutter中自定义气泡框效果的实现
Flutter中自定义气泡框效果的实现
455 3
|
运维 Kubernetes Devops
平台工程:它是什么?谁来做?怎么做?
大家可能听说过平台工程,这是一个新术语,它为开发和 DevOps 领域中本已拥挤的角色集合增添了新内容。 在这篇文章中,我们将介绍平台工程、它与 DevOps 的区别以及为什么你可能考虑采用平台工程以及谁需要拥有平台工程的能力。
|
SQL 存储 安全
SQL安全性能:构建坚不可摧的数据防线
随着信息技术的发展,数据成为核心资产,SQL数据库作为关键工具,其安全性至关重要。本文探讨了SQL安全的重要性、常见威胁及对策: - **重要性**: 包括数据保护、业务连续性和合规要求。 - **威胁**: 如SQL注入、未经授权访问、数据泄露和拒绝服务攻击。 - **措施**: 实施访问控制、数据加密、定期更新/备份、审计/监控及漏洞管理。 - **最佳实践**: 定期培训、建立应急响应计划、持续评估改进和安全编程。 通过这些方法,组织能够构建强大的SQL数据防护体系。
562 0