Actor - Critic Algorithms: A Brief Note

2017-05-02 1789

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Actor - CriticA class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Contr

Actor - Critic

A class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to

V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143-1166 (2003).

for more details. We first give some basics:

Policy Gradient Methods & Policy Gradient Theorem

We consider methods for learning the policy weights based on the gradient of some performance measure η(θ) with respect to the policy weights. These methods maximize performance, so their updates approximate gradient ascent in η:

θ t + 1 = θ t + α \nabla η (θ t) ˆ

All methods that follow this general schema are known as policy gradient methods, whether or not they also learn an approximate value function. A value function may still be used to learn the policy weight

θ∈Rn, but may not be required for action selection.

We consider the episodic case only, in which performance is defined as the value of the start state under the parameterized policy, η(θ)=vπθ(s0). (In continuing case, the performance is defined as the average reward rate.) In policy gradient methods, the policy can be parameterized in any way as long as π(a|s,θ) is differentiable with respect to its weights, i.e. ∇θπ(a|s,θ) always exists and is finite. To ensure exploration, we generally require the policy never becomes deterministic.

Without losing any meaningful generality, we assume that every episode starts in some particular state s0. Then we define performance as:

η (θ) = v π θ (s 0)

The policy gradient theorem is that

\nabla η (θ) = \sum s d π (s) \sum a q π (s, a) \nabla θ π (a | s, θ)

where the gradients in all cases are the column vectors of partial derivatives with respect to the components of

θ,

π denotes the policy corresponding to weights vector

θ and the distribution

dπ here is the expected number of time steps

t on which

St=s in a randomly generated episode starting in

s0 and following

π and the dynamics of the MDP.

REINFORCE

Now we have an exact expression for updating gradient. We need some way of sampling whose expectation equals or approximates this expression. Notice that the right-hand side is a sum over states weighted by how often the states occurs under the target policy π weighted again by γ times how many steps it takes to get to those states. If we just follow π we will encounter states in these proportions, which we can then weighted by γt to preserve the expected value:

\nabla η (θ) = E π [γ t \sum a q π (S t, a) \nabla θ π (a | S t, θ)]

Then we approximate a sum over actions:

\nabla η (θ) = E π [γ t \sum a q π (S t, a) π (a | S t, θ) \nabla θ π ( a | S t , θ ) π ( a | S t , θ )] = E π [γ t q π (S t, A t) \nabla θ π ( A t | S t , θ ) π ( A t | S t , θ )] = E π [γ t G t \nabla θ π ( A t | S t , θ ) π ( A t | S t , θ )]

Using the sample to instantiate the generic stochastic gradient ascent algorithm, we obtain the update:

θ t + 1 = θ t + α γ t G t \nabla θ π ( A t | S t , θ ) π ( A t | S t , θ )

It is evident that REINFORCE is a type of Monte Carlo Policy Gradient methods. The update increases the weight vector in this direction proportional to the return but inversely proportional to the action probability. The former makes sense because it causes the weights to move most in the directions that favor actions yielding the highest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage and might win out even if they do not yield the highest return.

The update law can also be written as:

θ t + 1 = θ t + α γ t G t \nabla θ log π (A t | S t, θ)

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline b(s):

\nabla η (θ) = \sum s d π (s) \sum a (q π (s, a) - b (s)) \nabla θ π (a | s, θ)

the baseline can be any function, even a random variable, as long as it does not vary with

a. Then we rewrite the update law to be:

θ t + 1 = θ t + α (G t - b (S t)) \nabla θ log π (A t | S t, θ)

The baseline leaves the expected value of the update unchanged, but it can have a large effect on its variance. One natural choice is an estimate of the state value

v^(St,w).

Actor-Critic

Methods that learn approximations to both policy and value functions are called actor-critic methods. REINFORCE with baseline methods use value functions only as a baseline, not a critic, i.e. not for bootstrapping. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation.

Consider one-step action-critic methods. The main appeal of one step methods is that they are fully online and incremental such as TD(0), SARSA(0) and Q-Learning. One-step actor-critic methods replace the full return of REINFORCE with the one-step return:

θ t + 1 = θ t + α (R t + 1 + γ v^(S t + 1, w) - v^(S t, w)) \nabla θ log π (A t | S t, θ)

Full algorithm can be formulated as follows:

Initialize a differentiable policy parameterization π(a|s,θ).
Initialize a differentiable state-value parameterization v^(s,w).
Set step sizes α>0,β>0.
Initialize policy weights θ and state-value weights w.
Repeat:
1. Initialize S - first state
2. I=1
3. While S is not terminal
  1. A∼π(⋅|S,θ)
  2. Take A, observe S′,R
  3. δ=R+γv^(S′,w)−v^(S,w)
  4. w=w+βδ∇wv^(S,w)
  5. θ=θ+αIδ∇θlogπ(A|S,θ)
  6. I=γI
  7. S=S′

Parameterization for Continuous Actions

Policy based methods offer practical ways of dealing with large actions spaces, even continuous spaces. All of the notes above are dealing with discrete actions, nevertheless, we begin with continuous ones from now on.

Instead of computing learned probabilities for each of many actions, we learn the statistics of the probability distributions. Take the Gaussian distribution for example. The actions set might be the real numbers with actions chosen from a Gaussian distribution:

p (x) = 1 σ 2 π - - \sqrt exp (- ( x - μ ) 2 2 σ 2)

The value

p(x) is the density of the probability at

To produce a policy parameterization, we can define the policy as the normal probability density over a real-valued scalar action, with mean and standard deviation give by parametric function approximators:

π (a | s, θ) = 1 σ ( s , θ ) 2 π - - \sqrt exp (- ( a - μ ( s , θ ) ) 2 2 σ ( s , θ ) 2)

Then we divide the policy weight vector into two parts:

θ=[θμ,θσ]T. The mean can be approximated as a linear function:

μ (s, θ) = θ μ T ϕ (s)

The standard deviation must always be positive and is better approximated as the exponential of a linear function:

σ (s, θ) = exp (θ σ T ϕ (s))

\
where

ϕ(s) is a basis function of some type. With these definitions, all the algorithms described above can be applied to learn to select real-valued actions.

Summary

Actor -Critic, which is a branch of TD methods, keeps a separate policy independent of the value function. The policy is called the actor and the value function is called the critic. An advantage of having a separate policy representation is that if there are many actions, or when the action space is continuous, there is no need to consider all actions’ Q-values in order to select one of them. A second advantage is that they can learn stochastic policies naturally. Furthermore, a prior knowledge about policy constraints can be used.

Actor - Critic Algorithms: A Brief Note

Actor - Critic

Policy Gradient Methods & Policy Gradient Theorem

REINFORCE

Actor-Critic

Parameterization for Continuous Actions

Summary

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Actor - Critic Algorithms: A Brief Note

Actor - Critic

Policy Gradient Methods & Policy Gradient Theorem

REINFORCE

Actor-Critic

Parameterization for Continuous Actions

Summary

热门文章

最新文章

相关电子书