Actor - Critic Algorithms: A Brief Note

简介: Actor - CriticA class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Contr

Actor - Critic

A class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to

V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143-1166 (2003).

for more details. We first give some basics:

Policy Gradient Methods & Policy Gradient Theorem

We consider methods for learning the policy weights based on the gradient of some performance measure η(θ) with respect to the policy weights. These methods maximize performance, so their updates approximate gradient ascent in η:

θt+1=θt+αη(θt)ˆ

All methods that follow this general schema are known as policy gradient methods, whether or not they also learn an approximate value function. A value function may still be used to learn the policy weight θRn, but may not be required for action selection.

We consider the episodic case only, in which performance is defined as the value of the start state under the parameterized policy, η(θ)=vπθ(s0). (In continuing case, the performance is defined as the average reward rate.) In policy gradient methods, the policy can be parameterized in any way as long as π(a|s,θ) is differentiable with respect to its weights, i.e. θπ(a|s,θ) always exists and is finite. To ensure exploration, we generally require the policy never becomes deterministic.

Without losing any meaningful generality, we assume that every episode starts in some particular state s0. Then we define performance as:

η(θ)=vπθ(s0)

The policy gradient theorem is that
η(θ)=sdπ(s)aqπ(s,a)θπ(a|s,θ)

where the gradients in all cases are the column vectors of partial derivatives with respect to the components of θ, π denotes the policy corresponding to weights vector θ and the distribution dπ here is the expected number of time steps t on which St=s in a randomly generated episode starting in s0 and following π and the dynamics of the MDP.

REINFORCE

Now we have an exact expression for updating gradient. We need some way of sampling whose expectation equals or approximates this expression. Notice that the right-hand side is a sum over states weighted by how often the states occurs under the target policy π weighted again by γ times how many steps it takes to get to those states. If we just follow π we will encounter states in these proportions, which we can then weighted by γt to preserve the expected value:

η(θ)=Eπ[γtaqπ(St,a)θπ(a|St,θ)]

Then we approximate a sum over actions:
η(θ)=Eπ[γtaqπ(St,a)π(a|St,θ)θπ(a|St,θ)π(a|St,θ)]=Eπ[γtqπ(St,At)θπ(At|St,θ)π(At|St,θ)]=Eπ[γtGtθπ(At|St,θ)π(At|St,θ)]

Using the sample to instantiate the generic stochastic gradient ascent algorithm, we obtain the update:
θt+1=θt+αγtGtθπ(At|St,θ)π(At|St,θ)

It is evident that REINFORCE is a type of Monte Carlo Policy Gradient methods. The update increases the weight vector in this direction proportional to the return but inversely proportional to the action probability. The former makes sense because it causes the weights to move most in the directions that favor actions yielding the highest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage and might win out even if they do not yield the highest return.

The update law can also be written as:

θt+1=θt+αγtGtθlogπ(At|St,θ)

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline b(s):

η(θ)=sdπ(s)a(qπ(s,a)b(s))θπ(a|s,θ)

the baseline can be any function, even a random variable, as long as it does not vary with a. Then we rewrite the update law to be:
θt+1=θt+α(Gtb(St))θlogπ(At|St,θ)

The baseline leaves the expected value of the update unchanged, but it can have a large effect on its variance. One natural choice is an estimate of the state value v^(St,w).

Actor-Critic

Methods that learn approximations to both policy and value functions are called actor-critic methods. REINFORCE with baseline methods use value functions only as a baseline, not a critic, i.e. not for bootstrapping. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation.

Consider one-step action-critic methods. The main appeal of one step methods is that they are fully online and incremental such as TD(0), SARSA(0) and Q-Learning. One-step actor-critic methods replace the full return of REINFORCE with the one-step return:

θt+1=θt+α(Rt+1+γv^(St+1,w)v^(St,w))θlogπ(At|St,θ)

Full algorithm can be formulated as follows:

  1. Initialize a differentiable policy parameterization π(a|s,θ).
  2. Initialize a differentiable state-value parameterization v^(s,w).
  3. Set step sizes α>0,β>0.
  4. Initialize policy weights θ and state-value weights w.
  5. Repeat:
    1. Initialize S - first state
    2. I=1
    3. While S is not terminal
      1. Aπ(|S,θ)
      2. Take A, observe S,R
      3. δ=R+γv^(S,w)v^(S,w)
      4. w=w+βδwv^(S,w)
      5. θ=θ+αIδθlogπ(A|S,θ)
      6. I=γI
      7. S=S

Parameterization for Continuous Actions

Policy based methods offer practical ways of dealing with large actions spaces, even continuous spaces. All of the notes above are dealing with discrete actions, nevertheless, we begin with continuous ones from now on.

Instead of computing learned probabilities for each of many actions, we learn the statistics of the probability distributions. Take the Gaussian distribution for example. The actions set might be the real numbers with actions chosen from a Gaussian distribution:

p(x)=1σ2πexp((xμ)22σ2)

The value p(x) is the density of the probability at x.

To produce a policy parameterization, we can define the policy as the normal probability density over a real-valued scalar action, with mean and standard deviation give by parametric function approximators:

π(a|s,θ)=1σ(s,θ)2πexp((aμ(s,θ))22σ(s,θ)2)

Then we divide the policy weight vector into two parts: θ=[θμ,θσ]T. The mean can be approximated as a linear function:
μ(s,θ)=θμTϕ(s)

The standard deviation must always be positive and is better approximated as the exponential of a linear function:
σ(s,θ)=exp(θσTϕ(s))
\
where ϕ(s) is a basis function of some type. With these definitions, all the algorithms described above can be applied to learn to select real-valued actions.

Summary

Actor -Critic, which is a branch of TD methods, keeps a separate policy independent of the value function. The policy is called the actor and the value function is called the critic. An advantage of having a separate policy representation is that if there are many actions, or when the action space is continuous, there is no need to consider all actions’ Q-values in order to select one of them. A second advantage is that they can learn stochastic policies naturally. Furthermore, a prior knowledge about policy constraints can be used.

相关文章
|
SQL Java 数据库连接
在java中h2数据库的使用
H2 是一个轻量级的嵌入式数据库,可以在 Java 应用程序中使用
791 0
|
安全 算法 Java
代码质量和安全使用代码检测提升
云效代码管理提供多种内置扫描服务,确保代码质量与安全性。面对编码不规范、敏感数据泄露、依赖项安全漏洞等问题,该服务从代码提交到合并全程保驾护航。不仅依据《阿里巴巴 Java 开发手册》检查编码规范,还利用先进算法智能推荐代码补丁,检测敏感信息及依赖包漏洞。用户可在每次提交或合并请求时选择自动化扫描,快速定位并解决问题,提升研发流程的稳定性与安全性。立即体验云效代码管理,保障代码健康。
279 12
tf.keras.layers.Dense
【8月更文挑战第20天】tf.keras.layers.Dense。
398 2
|
人工智能 监控 安全
巧用通义灵码助力护网面试
护网行动是公安部组织的网络安全评估活动,通过模拟攻防演练提升企事业单位安全防护能力。自2016年起,涉及单位逐年增加,网络安全已成为业务保障必需。行动分为红蓝两队,红队模拟攻击,蓝队负责防御。在面试中,蓝队工程师岗位分为初级、中级和高级,要求包括漏洞分析、应急响应和安全设备操作。通义灵码作为AI工具,可用于面试准备,如分析日志、撰写脚本和辅助报告撰写,提高应聘者表现。红队面试侧重实战经验,如渗透测试和漏洞利用,通义灵码也可在代码审查和策略规划上提供帮助。请遵守中国国家网络安全法!!!网络不是法外之地!!!
会员管理系统实战开发教程06-会员充值
会员管理系统实战开发教程06-会员充值
|
机器学习/深度学习 监控 自动驾驶
如何使用 Python 和 OpenCV 进行实时目标检测
如何使用 Python 和 OpenCV 进行实时目标检测
|
Java 测试技术 持续交付
Springboot中JUNIT5单元测试+Mockito详解
Springboot中JUNIT5单元测试+Mockito详解
1672 1
|
机器学习/深度学习 人工智能 自然语言处理
人工智能(AI)在决策支持系统(DSS)中的作用变得越来越重要
人工智能(AI)在决策支持系统(DSS)中的作用变得越来越重要
|
缓存 fastjson Java
FastJson - JSONPath 使用
FastJson - JSONPath 使用
1905 0
|
JavaScript 小程序 Java
财务管理|基于SprinBoot+vue的财务管理系统(源码+数据库+文档)
财务管理|基于SprinBoot+vue的财务管理系统(源码+数据库+文档)
426 0