Actor - Critic Algorithms: A Brief Note

简介: Actor - CriticA class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Contr

Actor - Critic

A class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to

V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143-1166 (2003).

for more details. We first give some basics:

Policy Gradient Methods & Policy Gradient Theorem

We consider methods for learning the policy weights based on the gradient of some performance measure η(θ) with respect to the policy weights. These methods maximize performance, so their updates approximate gradient ascent in η:

θt+1=θt+αη(θt)ˆ

All methods that follow this general schema are known as policy gradient methods, whether or not they also learn an approximate value function. A value function may still be used to learn the policy weight θRn, but may not be required for action selection.

We consider the episodic case only, in which performance is defined as the value of the start state under the parameterized policy, η(θ)=vπθ(s0). (In continuing case, the performance is defined as the average reward rate.) In policy gradient methods, the policy can be parameterized in any way as long as π(a|s,θ) is differentiable with respect to its weights, i.e. θπ(a|s,θ) always exists and is finite. To ensure exploration, we generally require the policy never becomes deterministic.

Without losing any meaningful generality, we assume that every episode starts in some particular state s0. Then we define performance as:

η(θ)=vπθ(s0)

The policy gradient theorem is that
η(θ)=sdπ(s)aqπ(s,a)θπ(a|s,θ)

where the gradients in all cases are the column vectors of partial derivatives with respect to the components of θ, π denotes the policy corresponding to weights vector θ and the distribution dπ here is the expected number of time steps t on which St=s in a randomly generated episode starting in s0 and following π and the dynamics of the MDP.

REINFORCE

Now we have an exact expression for updating gradient. We need some way of sampling whose expectation equals or approximates this expression. Notice that the right-hand side is a sum over states weighted by how often the states occurs under the target policy π weighted again by γ times how many steps it takes to get to those states. If we just follow π we will encounter states in these proportions, which we can then weighted by γt to preserve the expected value:

η(θ)=Eπ[γtaqπ(St,a)θπ(a|St,θ)]

Then we approximate a sum over actions:
η(θ)=Eπ[γtaqπ(St,a)π(a|St,θ)θπ(a|St,θ)π(a|St,θ)]=Eπ[γtqπ(St,At)θπ(At|St,θ)π(At|St,θ)]=Eπ[γtGtθπ(At|St,θ)π(At|St,θ)]

Using the sample to instantiate the generic stochastic gradient ascent algorithm, we obtain the update:
θt+1=θt+αγtGtθπ(At|St,θ)π(At|St,θ)

It is evident that REINFORCE is a type of Monte Carlo Policy Gradient methods. The update increases the weight vector in this direction proportional to the return but inversely proportional to the action probability. The former makes sense because it causes the weights to move most in the directions that favor actions yielding the highest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage and might win out even if they do not yield the highest return.

The update law can also be written as:

θt+1=θt+αγtGtθlogπ(At|St,θ)

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline b(s):

η(θ)=sdπ(s)a(qπ(s,a)b(s))θπ(a|s,θ)

the baseline can be any function, even a random variable, as long as it does not vary with a. Then we rewrite the update law to be:
θt+1=θt+α(Gtb(St))θlogπ(At|St,θ)

The baseline leaves the expected value of the update unchanged, but it can have a large effect on its variance. One natural choice is an estimate of the state value v^(St,w).

Actor-Critic

Methods that learn approximations to both policy and value functions are called actor-critic methods. REINFORCE with baseline methods use value functions only as a baseline, not a critic, i.e. not for bootstrapping. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation.

Consider one-step action-critic methods. The main appeal of one step methods is that they are fully online and incremental such as TD(0), SARSA(0) and Q-Learning. One-step actor-critic methods replace the full return of REINFORCE with the one-step return:

θt+1=θt+α(Rt+1+γv^(St+1,w)v^(St,w))θlogπ(At|St,θ)

Full algorithm can be formulated as follows:

  1. Initialize a differentiable policy parameterization π(a|s,θ).
  2. Initialize a differentiable state-value parameterization v^(s,w).
  3. Set step sizes α>0,β>0.
  4. Initialize policy weights θ and state-value weights w.
  5. Repeat:
    1. Initialize S - first state
    2. I=1
    3. While S is not terminal
      1. Aπ(|S,θ)
      2. Take A, observe S,R
      3. δ=R+γv^(S,w)v^(S,w)
      4. w=w+βδwv^(S,w)
      5. θ=θ+αIδθlogπ(A|S,θ)
      6. I=γI
      7. S=S

Parameterization for Continuous Actions

Policy based methods offer practical ways of dealing with large actions spaces, even continuous spaces. All of the notes above are dealing with discrete actions, nevertheless, we begin with continuous ones from now on.

Instead of computing learned probabilities for each of many actions, we learn the statistics of the probability distributions. Take the Gaussian distribution for example. The actions set might be the real numbers with actions chosen from a Gaussian distribution:

p(x)=1σ2πexp((xμ)22σ2)

The value p(x) is the density of the probability at x.

To produce a policy parameterization, we can define the policy as the normal probability density over a real-valued scalar action, with mean and standard deviation give by parametric function approximators:

π(a|s,θ)=1σ(s,θ)2πexp((aμ(s,θ))22σ(s,θ)2)

Then we divide the policy weight vector into two parts: θ=[θμ,θσ]T. The mean can be approximated as a linear function:
μ(s,θ)=θμTϕ(s)

The standard deviation must always be positive and is better approximated as the exponential of a linear function:
σ(s,θ)=exp(θσTϕ(s))
\
where ϕ(s) is a basis function of some type. With these definitions, all the algorithms described above can be applied to learn to select real-valued actions.

Summary

Actor -Critic, which is a branch of TD methods, keeps a separate policy independent of the value function. The policy is called the actor and the value function is called the critic. An advantage of having a separate policy representation is that if there are many actions, or when the action space is continuous, there is no need to consider all actions’ Q-values in order to select one of them. A second advantage is that they can learn stochastic policies naturally. Furthermore, a prior knowledge about policy constraints can be used.

相关文章
|
5月前
|
机器学习/深度学习 算法 关系型数据库
Hierarchical Attention-Based Age Estimation and Bias Analysis
【6月更文挑战第8天】Hierarchical Attention-Based Age Estimation论文提出了一种深度学习方法,利用层次注意力和图像增强来估计面部年龄。通过Transformer和CNN,它学习局部特征并进行序数分类和回归,提高在CACD和MORPH II数据集上的准确性。论文还包括对种族和性别偏倚的分析。方法包括自我注意的图像嵌入和层次概率年龄回归,优化多损失函数。实验表明,该方法在RS和SE协议下表现优越,且在消融研究中验证了增强聚合和编码器设计的有效性。
39 2
|
6月前
|
存储 JSON 算法
【论文代码】②.1 STIOCS: Active learning-based semi-supervised training framework for IOC extraction
【论文代码】②.1 STIOCS: Active learning-based semi-supervised training framework for IOC extraction
36 0
|
算法 计算机视觉 知识图谱
ACL2022:A Simple yet Effective Relation Information Guided Approach for Few-Shot Relation Extraction
少样本关系提取旨在通过在每个关系中使用几个标记的例子进行训练来预测句子中一对实体的关系。最近的一些工作引入了关系信息
127 0
|
机器学习/深度学习 自然语言处理 算法
TASLP21-Reinforcement Learning-based Dialogue Guided Event Extraction to Exploit Argument Relations
事件抽取是自然语言处理的一项基本任务。找到事件论元(如事件参与者)的角色对于事件抽取至关重要。
101 0
|
机器学习/深度学习 人工智能 自然语言处理
OneIE:A Joint Neural Model for Information Extraction with Global Features论文解读
大多数现有的用于信息抽取(IE)的联合神经网络模型使用局部任务特定的分类器来预测单个实例(例如,触发词,关系)的标签,而不管它们之间的交互。
189 0
|
6月前
|
自然语言处理
【论文代码】① STIOCS: Active learning-based semi-supervised training framework for IOC extraction
【论文代码】① STIOCS: Active learning-based semi-supervised training framework for IOC extraction
48 0
|
机器学习/深度学习 自然语言处理 算法
Joint Information Extraction with Cross-Task and Cross-Instance High-Order Modeling 论文解读
先前的信息抽取(IE)工作通常独立地预测不同的任务和实例(例如,事件触发词、实体、角色、关系),而忽略了它们的相互作用,导致模型效率低下。
96 0
|
存储 自然语言处理 测试技术
LASS: Joint Language Semantic and Structure Embedding for Knowledge Graph Completion 论文解读
补全知识三元组的任务具有广泛的下游应用。结构信息和语义信息在知识图补全中都起着重要作用。与以往依赖知识图谱的结构或语义的方法不同
131 0
|
机器学习/深度学习 自然语言处理 算法
TPLinker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking 论文解读
近年来,从非结构化文本中提取实体和关系引起了越来越多的关注,但由于识别共享实体的重叠关系存在内在困难,因此仍然具有挑战性。先前的研究表明,联合学习可以显著提高性能。然而,它们通常涉及连续的相互关联的步骤,并存在暴露偏差的问题。
221 0
|
机器学习/深度学习 存储 数据挖掘
Global Constraints with Prompting for Zero-Shot Event Argument Classification 论文解读
确定事件论元的角色是事件抽取的关键子任务。大多数以前的监督模型都利用了昂贵的标注,这对于开放域应用程序是不实际的。
74 0