Actor - Critic Algorithms: A Brief Note

简介: Actor - CriticA class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Contr

Actor - Critic

A class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to

V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143-1166 (2003).

for more details. We first give some basics:

Policy Gradient Methods & Policy Gradient Theorem

We consider methods for learning the policy weights based on the gradient of some performance measure η(θ) with respect to the policy weights. These methods maximize performance, so their updates approximate gradient ascent in η:

θt+1=θt+αη(θt)ˆ

All methods that follow this general schema are known as policy gradient methods, whether or not they also learn an approximate value function. A value function may still be used to learn the policy weight θRn, but may not be required for action selection.

We consider the episodic case only, in which performance is defined as the value of the start state under the parameterized policy, η(θ)=vπθ(s0). (In continuing case, the performance is defined as the average reward rate.) In policy gradient methods, the policy can be parameterized in any way as long as π(a|s,θ) is differentiable with respect to its weights, i.e. θπ(a|s,θ) always exists and is finite. To ensure exploration, we generally require the policy never becomes deterministic.

Without losing any meaningful generality, we assume that every episode starts in some particular state s0. Then we define performance as:

η(θ)=vπθ(s0)

The policy gradient theorem is that
η(θ)=sdπ(s)aqπ(s,a)θπ(a|s,θ)

where the gradients in all cases are the column vectors of partial derivatives with respect to the components of θ, π denotes the policy corresponding to weights vector θ and the distribution dπ here is the expected number of time steps t on which St=s in a randomly generated episode starting in s0 and following π and the dynamics of the MDP.

REINFORCE

Now we have an exact expression for updating gradient. We need some way of sampling whose expectation equals or approximates this expression. Notice that the right-hand side is a sum over states weighted by how often the states occurs under the target policy π weighted again by γ times how many steps it takes to get to those states. If we just follow π we will encounter states in these proportions, which we can then weighted by γt to preserve the expected value:

η(θ)=Eπ[γtaqπ(St,a)θπ(a|St,θ)]

Then we approximate a sum over actions:
η(θ)=Eπ[γtaqπ(St,a)π(a|St,θ)θπ(a|St,θ)π(a|St,θ)]=Eπ[γtqπ(St,At)θπ(At|St,θ)π(At|St,θ)]=Eπ[γtGtθπ(At|St,θ)π(At|St,θ)]

Using the sample to instantiate the generic stochastic gradient ascent algorithm, we obtain the update:
θt+1=θt+αγtGtθπ(At|St,θ)π(At|St,θ)

It is evident that REINFORCE is a type of Monte Carlo Policy Gradient methods. The update increases the weight vector in this direction proportional to the return but inversely proportional to the action probability. The former makes sense because it causes the weights to move most in the directions that favor actions yielding the highest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage and might win out even if they do not yield the highest return.

The update law can also be written as:

θt+1=θt+αγtGtθlogπ(At|St,θ)

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline b(s):

η(θ)=sdπ(s)a(qπ(s,a)b(s))θπ(a|s,θ)

the baseline can be any function, even a random variable, as long as it does not vary with a. Then we rewrite the update law to be:
θt+1=θt+α(Gtb(St))θlogπ(At|St,θ)

The baseline leaves the expected value of the update unchanged, but it can have a large effect on its variance. One natural choice is an estimate of the state value v^(St,w).

Actor-Critic

Methods that learn approximations to both policy and value functions are called actor-critic methods. REINFORCE with baseline methods use value functions only as a baseline, not a critic, i.e. not for bootstrapping. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation.

Consider one-step action-critic methods. The main appeal of one step methods is that they are fully online and incremental such as TD(0), SARSA(0) and Q-Learning. One-step actor-critic methods replace the full return of REINFORCE with the one-step return:

θt+1=θt+α(Rt+1+γv^(St+1,w)v^(St,w))θlogπ(At|St,θ)

Full algorithm can be formulated as follows:

  1. Initialize a differentiable policy parameterization π(a|s,θ).
  2. Initialize a differentiable state-value parameterization v^(s,w).
  3. Set step sizes α>0,β>0.
  4. Initialize policy weights θ and state-value weights w.
  5. Repeat:
    1. Initialize S - first state
    2. I=1
    3. While S is not terminal
      1. Aπ(|S,θ)
      2. Take A, observe S,R
      3. δ=R+γv^(S,w)v^(S,w)
      4. w=w+βδwv^(S,w)
      5. θ=θ+αIδθlogπ(A|S,θ)
      6. I=γI
      7. S=S

Parameterization for Continuous Actions

Policy based methods offer practical ways of dealing with large actions spaces, even continuous spaces. All of the notes above are dealing with discrete actions, nevertheless, we begin with continuous ones from now on.

Instead of computing learned probabilities for each of many actions, we learn the statistics of the probability distributions. Take the Gaussian distribution for example. The actions set might be the real numbers with actions chosen from a Gaussian distribution:

p(x)=1σ2πexp((xμ)22σ2)

The value p(x) is the density of the probability at x.

To produce a policy parameterization, we can define the policy as the normal probability density over a real-valued scalar action, with mean and standard deviation give by parametric function approximators:

π(a|s,θ)=1σ(s,θ)2πexp((aμ(s,θ))22σ(s,θ)2)

Then we divide the policy weight vector into two parts: θ=[θμ,θσ]T. The mean can be approximated as a linear function:
μ(s,θ)=θμTϕ(s)

The standard deviation must always be positive and is better approximated as the exponential of a linear function:
σ(s,θ)=exp(θσTϕ(s))
\
where ϕ(s) is a basis function of some type. With these definitions, all the algorithms described above can be applied to learn to select real-valued actions.

Summary

Actor -Critic, which is a branch of TD methods, keeps a separate policy independent of the value function. The policy is called the actor and the value function is called the critic. An advantage of having a separate policy representation is that if there are many actions, or when the action space is continuous, there is no need to consider all actions’ Q-values in order to select one of them. A second advantage is that they can learn stochastic policies naturally. Furthermore, a prior knowledge about policy constraints can be used.

相关文章
|
5月前
|
机器学习/深度学习 算法
【文献学习】RoemNet: Robust Meta Learning based Channel Estimation in OFDM Systems
本文提出了一种基于元学习的鲁棒信道估计算法RoemNet,旨在解决OFDM系统中由于训练和部署信道模型不一致导致的问题,并展示了其在不同信道环境下优越的性能。
43 5
|
5月前
|
机器学习/深度学习 算法
【文献学习】Channel Estimation Method Based on Transformer in High Dynamic Environment
一种基于CNN和Transformer的信道估计方法,用于在高度动态环境中跟踪信道变化特征,并通过实验结果展示了其相比传统方法的性能提升。
66 0
|
7月前
|
机器学习/深度学习 算法 关系型数据库
Hierarchical Attention-Based Age Estimation and Bias Analysis
【6月更文挑战第8天】Hierarchical Attention-Based Age Estimation论文提出了一种深度学习方法,利用层次注意力和图像增强来估计面部年龄。通过Transformer和CNN,它学习局部特征并进行序数分类和回归,提高在CACD和MORPH II数据集上的准确性。论文还包括对种族和性别偏倚的分析。方法包括自我注意的图像嵌入和层次概率年龄回归,优化多损失函数。实验表明,该方法在RS和SE协议下表现优越,且在消融研究中验证了增强聚合和编码器设计的有效性。
53 2
|
算法 计算机视觉 知识图谱
ACL2022:A Simple yet Effective Relation Information Guided Approach for Few-Shot Relation Extraction
少样本关系提取旨在通过在每个关系中使用几个标记的例子进行训练来预测句子中一对实体的关系。最近的一些工作引入了关系信息
137 0
|
机器学习/深度学习 自然语言处理 算法
TASLP21-Reinforcement Learning-based Dialogue Guided Event Extraction to Exploit Argument Relations
事件抽取是自然语言处理的一项基本任务。找到事件论元(如事件参与者)的角色对于事件抽取至关重要。
111 0
|
机器学习/深度学习 人工智能 自然语言处理
OneIE:A Joint Neural Model for Information Extraction with Global Features论文解读
大多数现有的用于信息抽取(IE)的联合神经网络模型使用局部任务特定的分类器来预测单个实例(例如,触发词,关系)的标签,而不管它们之间的交互。
207 0
|
存储 机器学习/深度学习 人工智能
PTPCG: Efficient Document-level Event Extraction via Pseudo-Trigger-aware Pruned Complete Graph论文解读
据我们所知,我们目前的方法是第一项研究在DEE中使用某些论元作为伪触发词的效果的工作,我们设计了一个指标来帮助自动选择一组伪触发词。此外,这种度量也可用于度量DEE中带标注触发词的质量。
137 1
|
机器学习/深度学习 自然语言处理 算法
Joint Information Extraction with Cross-Task and Cross-Instance High-Order Modeling 论文解读
先前的信息抽取(IE)工作通常独立地预测不同的任务和实例(例如,事件触发词、实体、角色、关系),而忽略了它们的相互作用,导致模型效率低下。
107 0
|
机器学习/深度学习 自然语言处理 算法
TPLinker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking 论文解读
近年来,从非结构化文本中提取实体和关系引起了越来越多的关注,但由于识别共享实体的重叠关系存在内在困难,因此仍然具有挑战性。先前的研究表明,联合学习可以显著提高性能。然而,它们通常涉及连续的相互关联的步骤,并存在暴露偏差的问题。
227 0
|
机器学习/深度学习 自然语言处理 算法
SS-AGA:Multilingual Knowledge Graph Completion with Self-Supervised Adaptive Graph Alignment 论文解读
预测知识图(KG)中缺失的事实是至关重要的,因为现代知识图远未补全。由于劳动密集型的人类标签,当处理以各种语言表示的知识时,这种现象会恶化。
113 0

热门文章

最新文章