‘Thanks R. S. Sutton and A. G. Barto for their great work in Reinforcement Learning: An Introduction. Eligibility Traces in Prediction Problems In the backward view of TD(λ), there is a memory variable associated with each state, its eligibility trace. The eligibility trace for state s at time t is a random variable denoted Zt(s)∈R+. On each step, the eligibility traces for all states decay by γλ, and the eligibility trace for the one state visited on the step is incremented by 1: Zt(s)={γλZt−1(s)γλZt−1(s)+1s≠Sts=St for all nonterminal states s, where γ is the discount rate and λ is the λ-return1 parameter or trace-decay parameter. This kind of eligibility trace is called an accumulating trace. The global TD error signal triggers proportional updates to all recently visited states, as signaled by their nonzero traces: ΔVt(s)=αδtZt(s),∀s∈S where δt=Rt+1+γVt(St+1)−Vt(St) A complete prediction algorithm for on-line TD(λ) is given as follows: Initialize V(s) arbitrarily. Repeat for each episode: Initialize Z(s)=0 for all s∈S. Repeat for each step of episode: A← action given by π for S. Observe reward R and the next state S′. δ←R+γV(S′)−V(S) Z(S)←Z(S)+1 For all s∈S: V(s)←V(s)+αδZ(s) Z(s)←γλZ(s) S←S′ Until S is terminal. Sarsa(λ) The main idea of control is simply to learn action values rather than state values. The idea in Sarsa(λ) is to apply the TD(λ) prediction method to state-action pairs. Let Zt(s,a) denote the trace for state-action pair s,a. Substitute state-action variables for state variables, i.e. Qt+1(s,a)=Qt(s,a)+αδtZt(s,a),∀s,a where δt=Rt+1+γQt(St+1,At+1)−Qt(St,At) and Zt(s,a)={γλZt−1(s,a)+1γλZt−1(s,a)s=St,a=Atotherwise for all s,a The complete Sarsa(λ) algorithm is given as follows: Initialize Q(s,a) arbitrarily for all s∈S,a∈A(s) Repeat for each episode: Z(s,a)=0 for all s∈S,a∈A(s) Repeat for each step of episode: Take action A, observe R, S′. Choose A′ from S′ using policy derived from Q. δ←R+γQ(S′,A′)−Q(S,A) Z(S,A)←Z(S,A)+1 For all s∈S,a∈A(s): Q(s,a)←Q(s,a)+αδZ(s,a) Z(s,a)←γλZ(s,a) S←S′,A←A′ Unitl S is terminal. Q(λ) There are two different methods have been proposed that combine eligibility traces and Q-Learning: the Watkins’s Q(λ) and Peng’s Q(λ). Here we focus on Watkin’s Q(λ) only and give a brief description of Peng’s Q(λ). Unlike TD(λ) and Sarsa(λ), Watkins’s Q(λ) does not look ahead all the way to the end of the episode in its backup. It only looks ahead as far as the next exploratory action. For example, suppose the first action, At+1, is exploratory. Watkins’s Q(λ) would still do the one-step update of Qt(St,At) toward Rt+1+γmaxaQ(St+1,a). In general, if At+n is the first exploratory action, then the longest backup is toward Rt+1+γRt+1+⋯+γn−1Rt+n+γnmaxaQ(St+n,a) where we assume off-line updating. The mechanistic or backward view of Watkins’s Q(λ) is also very simple. Eligibility traces are used just as in Sarsa(λ), except that they are set to zero whenever an exploratory action is taken. That is to say. First, the traces for all state-action pairs are either decayed by γλ or, if an exploratory action was taken, set to 0. Second, the trace corresponding to the current state and action is incremented by 1, i.e. Zt(s,a)=IsSt⋅IaAt+⎧⎩⎨γλZt−1(s,a)0 if Qt−1(St,At)=maxaQt−1(St,a) otherwise where Ixy is an identity indicator function, equal to 1 if x=y and 0 otherwise. The rest of the algorithm is defined by Qt+1(s,a)=Qt(s,a)+αδtZt(s,a),∀s∈S,a∈A(s) where δt=Rt+1+γmaxa′Qt(St+1,a′)−Qt(St,At) The complete algorithm is given below: Initialize Q(s,a) arbitrarily, for all s∈S,a∈A(s). Repeat for each episode: Z(s,a)=0 for all s∈S,a∈A(s) Repeat for each step of episode: Take action A, observe R, S′ Choose A′ from S′ using policy derived from Q. A∗←argmaxaQ(S′,a), ifA′ ties for the max, then A∗←A′ δ←R+γQ(S′,A∗)−Q(S,A) Z(S,A)←Z(S,A)+1 For all s∈S,a∈A(s): Q(s,a)←Q(s,a)+αδZ(s,a) If A′=A∗, then Z(s,a)←γλZ(s,a), else Z(s,a)←0. S←S′,A←A′ Until S is terminal Unfortunately, cutting off traces every time an exploratory action is taken loses much of the advantage of using eligibility traces. Peng’s Q(λ) is an alternate version of Q(λ) meant to remedy this, which can be thought of as a hybrid of Sarsa(λ) and Watkins’s Q(λ). In Peng’s Q(λ), there is no distinction between exploratory and greedy actions. Each component beckup is over many steps of actual experiences, and all but the last are capped by a final maximization over actions. For a fixed nongreedy policy, Qt converges to neither qπ nor q∗, but to some hybrid of the two. However, if the policy is gradually made more greedy, then the method may still converge to q∗. Replacing Traces In some cases significantly better performance can be obtained by using a slightly modified kind of trace known as replacing trace: Zt(s)={γλZt−1(s)1s≠Sts=St There are several possible ways to generalize replacing eligibility traces for use in control methods. In some cases, the state-action traces are updated by the following: Zt(s,a)=⎧⎩⎨⎪⎪10γλZt−1(s,a)s=St,a=Ats=St,a≠Ats≠St λ-return: Gλt=(1−λ)∑∞n=1λn−1G(n)t↩
Thanks Richard S. Sutton and Andrew G. Barto for their great work in Reinforcement Learning: An Introduction. We focus on episodic case only and deal with continuous state and action spaces. Suppose you already have the basic knowledge of TD(0) methods in reinforcement learning. Stochastic-gradient and Semi-gradient Methods Stochastic-gradient methods are among the most widely used of all function approximation methods and are particularly well suited to online reinforcement learning. Let v^(s,θ) be the approximation of the true value function v(s) with real valued components θ=(θ1,θ2,…,θn)T for all s∈S. We assume that states appear in examples with the same distribution d, over which we are trying to minimize error on the Mean Square Value Error (MSVE(θ)=∑s∈Sd(s)[vπ(s)−v^(s,θ)]2). A straight-forward strategy is to try to minimize the error on the observed examples. Stochastic-gradient methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example: θt+1=θt−12α∇θ[vπ(St)−v^(St,θt)]2=θt+α[vπ(St)−v^(St,θt)]∇θv^(St,θt) SGD methods are gradient descent methods because the overall step in θt is proportional to the negetive gradient of the example’s squared error. This is the direction in which the error falls most rapidly. Meanwhile, stochastic means that the update is done on only a single example, which might have been selected stochastically. Why we take only a small step in the direction of the gradient? We do not seek to find a value function that has zero error for all states, but only an approximation that balances the errors in different states. In fact, the convergence results for SGD methods assume that α decreases over time. By that way, the SGD method is guaranteed to converge to a local optimum. However, we do not know the true value function vπ(s) in practice. Hence it is necessary to substitute an estimate Ut in place of vπ(s). If Ut is an unbiased estimate, this is, if E[Ut]=vπ(St) for each t, then θt is guaranteed to converge to a local optimum. For example, the Monte Carlo target Ut=Gt is by definition an unbiased estimate of vπ(St). If a bootstrapping estimate of vπ(St) is used as the target Ut (such as Ut=Rt+1+γv^(St+1,θt)), one does not obtain the same guarantees since the update step would rely on the target failing to be independent of θt. Bootstrapping methods take into account the effect of changing the weight vector θt on the estimate, but ignore its effect on the target. They include only a part of the gradient and, accordingly, we call them semi-gradient methods. Although semi-gradient methods do not converge as robustly as gradient methods, they do converge reliably in important case such as the linear case. Episodic Semi-gradient Control In the view of control instead of prediction, the action-value function q(St,At) accounts more for the application than value function v(St). The general gradient-descent update for action-value prediction is θt+1=θt+α[Ut−q^(St,At,θt)]∇q^(St,At,θt) For example, the update for the one-step SARSA method is θt+1=θt+α[Rt+1+γq^(St+1,At+1,θt)−q^(St,At,θt)]∇q^(St,At,θt) We call this method episodic semi-gradient one-step SARSA. For a constant policy, this method converges in the same way that TD(0) does, with the same kind of error bound MSVE(θTD)≤11−γminθMSVE(θ). n-Step Semi-gradient SARSA We can obtain an n-step version of episodic semi-gradient SARSA by using an n-step return as the update target. G(n)t=Rt+1+γRt+2+⋯+γn−1Rt+n+γnq^(St+n,At+n,θt+n−1),n≥1,0≤t<T−n The n-step update update equation is θt+n=θt+n−1+α[G(n)t−q^(St,At,θt+n−1)]∇q^(St,At,θt+n−1) The performance is the best if an intermediate level of bootstrapping is used, corresponding to an n larger than 1. The λ-Return To begin with, we note that a valid update can be done not just toward any n-step return, but toward any average of n-step returns. The composite return possesses an error reduction property similar to that of individual n-step returns and thus can be used to construct backups with guaranteed convergence properties. A backup that average simpler component backups is called a compound backup. The λ-return is defined by Gλt=(1−λ)∑n=1∞λn−1G(n)t which can be understood as one particular way of averaging n-step backups. This average contains all the n-step backups, each weighted proportional to λn−1, where λ∈[0,1], and normalized by a factor of 1−λ to ensure that the weights sum to 1. After a terminal state (if exists) has been reached, all subsequent n-step returns are equal to Gt. If we want, we can separate these post-termination terms from the main sum, yielding Gλt=(1−λ)∑n=1T−t−1λn−1G(n)t+λT−t−1Gt Now it is time to define our first learning algorithm based on λ-return: the off-line λ-return algorithm. As an off-line algorithm, it makes no changes to the weight vector during the episode. Then, at the end of the episode, a whole sequence of off-line updates are made according to usual semi-gradient rule: θt+1=θt+α[Gλt−v^(St,θt)]∇v^(St,θt) The approach is what we call the theoretical, or forward view of a learning algorithm. For each state visited, we look forward in time to all the future rewards and decide how best to combine them. Eligibility Traces Eligibility traces are one of the basic mechanisms of reinforcement learning. Almost any temporal-difference methods, such as Q-Learning or SARSA, can be combined with eligibility traces to obtain a more general method that may learn more efficiently. What eligibility traces offer beyond these is an elegant algorithmic mechanism with significant computational advantages. Only a single trace vector is required rather than a store of the last n feature vectors. Learning also occurs continually and uniformly in time rather than being delayed and ten catching up at the end of the episode. The mechanism is a short-term memory vector, the eligibility trace et∈Rn,that parallels the long-term memory weight vector θt∈Rn. The rough idea is that when a component of θt participates in producing an estimated value, then the corresponding component of et is bumped up and then begins to fade away. Learning will then occur in that component of θt is a nonzero TD error occurs before the trace falls back to zero. The trace-decay parameter λ∈[0,1] determines the rate at which the trace falls. The idea behind eligibility traces is very simple. Each time a state is visited it initializes a short-term memory process, a trace, which then decays gradually over time. This trace marks the state as eligibility for learning. If an unexpectedly good or bad event occurs while the trace is non-zero, then the state is assigned credit accordingly. There have been two kinds of traces: the accumulating trace and the replacing trace. In a accumulating trace, the trace builds up each time the state is entered. In a replacing trace, on the other hand, each time the state is visited the trace is reset to 1 regardless of the presence of a prior trace. Typically, eligibility traces decay exponentially according to the product of a decay parameter, λ, 0≤λ≤1, and a discount-rate parameter, γ, 0≤γ≤1. The accumulating trace is defined by: et+1(s)={γλet(s),γλet(s)+1,s≠sts=st where et(s) represents the trace for state s at time t. and st is the actual state at time t. The corresponding replacing trace is defined by: et+1(s)={γλet(s),1s≠sts=st In the gradient descent case, the (accumulating) eligibility trace is initialized to zero at the beginning of the episode, is incremental on each time step by the value gradient, and then fades away by γλ: e0et=0,=∇v^(St,θt)+γλet−1 The eligibility trace keeps track of which components of the weight vector have contributed, positively or negatively, to recent state valuations, where recent is defined in terms γλ. TD(λ) Methods TD(λ) improves over the off-line λ-return algorithm in three ways: It updates the weight vector on every step of an episode rather than only at the end. Its computations are equally distributed in time rather that all at the end of the episode. It can be applied to continuing problems rather than just episodic problems. To begin with, the TD error for state-value prediction is: δt=Rt+1+γv^(St+1,θt)−v^(St,θt) In TD(λ), the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace: θt+1=θt+αδtet TD(λ) is oriented backward in time. At each moment we look at the current TD error and assign it backward to each prior state according to how much that state contributed to the current eligibility trace at that time. The complete update rule of TD(λ) (with so-called accumulating traces) is as follows: e=0 Choose A∼π(⋅|S) Observe R, S′ e=γλe+∇v^(S,θ) δ=R+γv^(S′,θ)−v^(S,θ) θ=θ+αδe S=S′ In another variant of the algorithm, the eligibility traces are updated according to: e=max(γλe,∇v^(S,θ)) This is called the replacing traces update. TD(0) is the case in which only the one state preceding the current one is changed by the TD error. For larger values of λ, but still λ<1, more of the preceding states are changed but each more temporally distant state is changed less because the corresponding eligibility trace is smaller. We say that the earlier states are given less credit for the TD error. TD(1) is a way of implementing Monte Carlo algorithms. On-line TD(1) learns in an n-step TD way from the incomplete ongoing episode, where the n steps are all the way up to the current step. If something unusually good or bad happens during an episode, control methods based on TD(1) can learn immediately and alter their behavior on that same episode.
Thanks Hado van Hasselt for the great work. Introduction In the problems of sequential decision making in continuous domains with delayed reward signals, the main purpose for the algorithms is to learn how to choose actions from an infinitely large action space to optimize a noisy delayed cumulative reward signal in an infinitely large state space, where even the outcome of a single action can be stochastic. Here we assume that a model of environment is not known. Analytically computing a good policy from a continuous model can be infeasible, we focus on methods that explicitly update a representation of a value function, a policy or both here. In addition, we focus mainly on the problem of control, which means we want to find action-selection polices that yield high returns, as opposed to the problem of prediction, which aims to estimate the value of a given policy. Some possible valuable books are given below: (we suppose that you are familiar with Sutton) Szepesvari, C.: Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 4(1), 1-103(2010). Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement learning and dynamic programming using function approximators. CRC Press, Boca Raton (2010). Function Approximation In this section, we mostly limit ourselves to the general functional form of the approximators and general methods to update the parameters. Non-parametric approaches, such as kernel-based methods are not included. Turn to the following references for more details if you are interested in kernel based methods: Ormoneit, D., Sen, S.: Kernel-based reinforcement learning. Machine Learning 49(2), 161-178 (2002). Linear Function Approximation A linear function is a simple parametric function that depends on the feature vector. For instance, consider a value-approximation algorithm where the value function is approximated by: Vt(s)=θTtϕ(s) A common method to find features for a linear function approximator divides the continuous state space into separate segments and attaches one feature to each segment, ex. tile coding. However, one potential problem with discretizing methods such as tile coding is that the resulting function that maps state into features is not injective, i.e. ϕ(s)=ϕ′(s) does not imply that s=s′. Another issue is related to the step-size parameter that many algorithms use. In general, it is often a good idea to make sure that |ϕ(s)|=∑DΦkϕk(s)≤1 for all s. A final issue is that it introduces discontinuities in the function. Some of the issues mentioned above can be tackled by suing so-called fuzzy sets. A fuzzy set is a generalization of normal sets to fuzzy membership. A state may belong partially to the set defined by feature ϕi and partially to the set defined by ϕj. An advantage of this view is that it is quite natural to assume that ∑kϕk≤1, since each part of an element can belong to only one set. It is possible to define the sets such that each combination of feature activations corresponds precisely to one single state, thereby avoiding the partial-observability problem sketched earlier. A common choice is to use triangular functions that are equal to one at the center of the corresponding feature and decay linearly to zero for states further from the center. A drawback of fuzzy sets is that these sets still need to be defined beforehand, which may be difficult. Non-linear Function Approximation In a parametric non-linear function approximator, the function that should be optimized is represented by some predetermined parameterized function. For instance, for value-based algorithms we may have: Vt(s)=V(ϕ(s),θt) In general, a non-linear function approximator may approximate an unknown function with better accuracy than a linear function approximator that uses the same input features. A drawback of non-linear function approximation is that less convergence guarantees can be given. Updating Parameters Gradient Descent A gradient descent update follows the direction of the negative gradient of some parameterized function that we want to minimize. The gradient of a parameterized function is a vector in parameter space that points in the direction in which the function decreases. Because the gradient only describes the local shape of the function, this algorithm can end up in a local minimum. For instance, using an error measure such as temporal-difference or a prediction error, i.e. E(st,at,θt)=(R(st,at,θt)−rt+1)2 Update parameter: θt+1=θt−αt∇θE(x,θt) If the parameter space is a curved space, it is more appropriate to use dθTGdθ where G is a P×P positive semi-definite matrix. Wit this weighted distance metric, the direction of steepest descent becomes ∇~θE(x,θ)=G−1∇θE(x,θ) which is known as the natural gradient. In general, the best choice for matrix G depends on the functional form of E. Gradient-Free Optimization Gradient free methods are useful when the function that is optimized is not differentiable or when it is expected that many local optima exist. There are many general global methods for optimization exist, including evolutionary algorithms. Details about evolutionary algorithms are beyond the scope of this note. Approximate Reinforcement Learning Value Approximation In value-approximation algorithms, experience samples are used to update a value function that gives an approximation of the current or the optimal policy. Many reinforcement learning algorithms fall into this category. Important differences between algorithms within this category is whether they are on-policy or off-policy and whether they update online or offline. Online algorithms are sometimes more sample-efficient in control problems. In order to update a value with gradient descent, we must choose some measure of error that we can minimize. This measure is often referred to as objective function. We generalize standard temporal-difference learning to a gradient update on the parameters of a function approximator. The tabular TD-Learning update is Vt+1(st)=Vt(st)+αt(st)δt with δt=rt+1+γVt(st+1)−Vt(st) and αt(s)∈[0,1] is a step-size parameter. When the state values are stored in a table, TD-learning can be interpreted as a stochastic gradient-descent update on the one-step temporal-difference error: E(st)=12(rt+1+γVt(st+1)−Vt(st))2=12δ2t However, if Vt is a parametrized function s.t. Vt(s)=V(s,θt), the negative gradient with respect to the parameters is given by −∇θE(st,θ)=−(rt+1+γVt(st+1)−Vt(st))∇θ(rt+1+γVt(st+1)−Vt(st)) Anyway, we can interpret rt+1+γVt(st+1) as a stochastic approximation for Vπ that does not depend on θ. Then −∇θE(st,θ)=−(rt+1+γVt(st+1)−Vt(st))∇θVt(st) This implies the parameters can be updated as θt+1=θt+αtδt∇θVt(st) Similarly, updates for action-value algorithms are δt=rt+1+γQt(st+1,at+1)−Qt(st,at) or δt=rt+1+γmaxaQt(st+1,a)−Qt(st,at) for SARSA and Q-Learning respectively. We can also incorporate accumulating eligibility traces with trace parameter λ with the following two equations: et+1θt+1=λγet+∇θVt(st)=θt+αtδtet+1 Parameters updated with equations above may diverge when off-policy updates are used. This holds for any temporal-difference method with λ<1. In other words, if we sample transitions from a distribution that does not comply completely to the state-visit probabilities that would occur under the estimation policy, the parameters of the function may diverge. This is unfortunate, because in the control setting ultimately we want to learn about the unknown optimal policy. Recently, a class of algorithms has been proposed to deal with this issue: Maei, H.R., Sutton, R.S.: GQ (λ ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In: Proceedings of the Third Conference On Artificial General Intelligence (AGI-2010), pp. 91–96. Atlantis Press, Lugano (2010). Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesv´ari, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 993–1000. ACM (2009). It is nontrivial to extend the standard online temporal difference algorithms to continuous action spaces. The algorithms in the next section are usually much better suited for use in problems with continuous actions. Policy Approximation If the action space is continuous finding the greedy action in each state can be nontrivial and time-consuming. Policy Gradient Algorithms The idea of policy-gradient algorithms is to update the policy with gradient ascent on the cumulative expected value Vπ. If the gradient is known, we can update the policy parameters with ψk+1=ψk+βk∇ψE(Vπ(st))=ψk+βk∇ψ∫s∈SP(st=s)Vπ(s)ds As a practical alternative, we can use stochastic gradient descent: ψt+1=ψt+βt(st)∇ψVπ(st) Such procedures can be at best hope to find a local optimum, because they use a gradient of a value function that is usually not convex with respect to the policy parameters. Define the trajectory as T which is a sequence of states and actions. Then ∇ψVπ(s)=∫T∇ψP(T|s,ψ)E{∑t=0∞γtrt+1∣∣∣T}dT=∫TP(T|s,ψ)∇ψlogP(T|s,ψ)E{∑t=0∞γtrt+1∣∣∣T}dT=E{∇ψlogP(T|s,ψ)E{∑t=0∞γtrt+1∣∣∣T}∣∣∣∣s,ψ} Moreover, since only the policy term depend on ψ, then ∇ψlogP(T|s,ψ)=∑t=0∞∇ψlogπ(st,at,ψ) This only holds if the policy is stochastic. In most cases, this is not a big problem, for stochastic polices are needed anyway to ensure sufficient exploration. There are two examples of stochastic polices for policy gradient algorithms: Boltzmann Exploration Gaussian Exploration We need to sample the expected cumulative discounted reward to get the gradient. For instance, if the task is episodic we can take a Monte Carlo sample that gives the cumulative (possibly discounted) reward for each episode: ∇ψVπ(s)=E⎧⎩⎨Rk(st)⎛⎝∑j=tTk−1∇ψlogπ(sj,aj,ψ)⎞⎠⎫⎭⎬ where Rk=∑Tk−1j=tγt−jrj+1 is the total discounted return obtained after reaching state st in episode k. Actor Critic Algorithms The variance of the estimate of ∇ψVπ(st) can be very high if Monte Carlo roll-outs are used, which can severely slow convergence. A potential solution to this problem is presented by using an explicit approximation of Vπ. Such an approximate value function is called a critic and the combined algorithm is called an actor-critic algorithm. Actor critic algorithms typically use a temporal difference algorithm to update Vt, an estimate for Vπ. Assuming b(st)=Vπ(st) as a baseline, this leads to an unbiased estimate of δt∇ψlogπ(st,at,ψt) for the gradient of the policy. A typical actor-critic update would update the policy parameters with ψt+1=ψt+βt(st)δt∇ψlogπ(st,at,ψt) where δt=rt+1+γVt(st+1)−Vt(st) is an unbiased estimate of Qπ(st,at)−Vπ(st)Cacla (continuous actor-critic learning-automation) is special AC algorithm using an error in action space rather than in parameter or policy space and it uses the sign of the temporal difference error rather than its size. During learning, it is assumed that there is exploration. As in many other actor-critic algorithms, if the temporal-difference error δt is positive, we judge at to be a good choice and we reinforce it. In Cacla, this is done by updating the output of the actor towards at . This is why exploration is necessary: without exploration the actor output is already equal to the action, and the parameters cannot be updated. An update to the actor only occurs when the temporal difference error is positive. As an extreme case, consider an actor that already outputs the optimal action in each state for some deterministic MDP. For most exploring actions, the temporal-difference error is then negative. If the actor would be updated away from such an action, its output would almost certainly no longer be optimal. This is an important difference between Cacla and policy-gradient methods: Cacla only updates its actor when actual improvements have been observed. This avoids slow learning when there are plateaus in the value space and the temporal difference errors are small. Intuitively, it makes sense that the distance to a promising action at is more important than the size of the improvement in value. Here is a simple Cacla algorithm: More details about actor critic algorithms? Refer to : van Hasselt, H.P.,Wiering, M.A.: Using continuous action spaces to solve discrete problems. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2009), pp. 1149–1156 (2009). http://blog.csdn.net/philthinker/article/details/71104095
Supervised Dimension Reduction Greater dimensionality always brings about more difficult learning tasks. Here we introduce a supervised dimension reduction method based on linear dimension reduction as introduced in http://blog.csdn.net/philthinker/article/details/70212147 which can also be simplified as: z=Tx,x∈Rd,z∈Rm,m<d Of course, centeralization in the first place is necessary: xi←xi−1n∑i′=1nxi′ Fisher discrimination analysis is one of the most basic supervised linear dimension reduction methods, where we seek for a T to make samples of the same label as close as possible and vice versa. To begin with, define within-class class matrix S(w) and between-class matrix S(b) as: S(w)=∑y=1c∑i:yi=y(xi−μy)(xi−μy)T∈R(d×d)S(b)=∑y=1cnyμyμTy∈R(d×d) where μy=1ny∑i:yi=yxi ∑i:yi=y stands for the sum of y satisfying yi=y, ny is the amount of samples belonging to class y. Then we can define the projection matrix T: maxT∈Rm×dtr((TS(w)TT)−1TS(b)TT) It is obvious that our optimization goal is trying to maximize within-class matrix TS(w)TT as well as minimize between-class matrix TS(b)TT. This optimization problem is solvable once we carry out some approaches similar to the one used in Unsupervised Dimension Reduction, i.e. S(b)ξ=λS(w)ξ where the normalized eigenvalues are λ1≥⋯≥λd≥0 and corresponded eigen-vectors are ξ1,⋯,ξd. Taking the largest m eigenvalues we get the solution of T: Tˆ=(ξ1,⋯,ξm)T n=100; x=randn(n,2); x(1:n/2,1)=x(1:n/2,1)-4; x(n/2+1:end,1)=x(n/2+1:end, 1)+4; x=x-repmat(mean(x),[n,1]); y=[ones(n/2,1);2*ones(n/2,1)]; m1=mean(x(y==1,:)); x1=x(y==1,:)-repmat(m1,[n/2,1]); m2=mean(x(y==2,:)); x2=x(y==2,:)-repmat(m2,[n/2,1]); [t,v]=eigs(n/2*(m1')*m1+n/2*(m2')*m2,x1'*x1+x2'*x2,1); figure(1); clf; hold on; axis([-8 8 -6 6]); plot(x(y==1,1),x(y==1,2),'bo'); plot(x(y==2,1),x(y==2,2),'rx'); plot(99*[-t(1) t(1)],99*[-t(2) t(2)],'k-'); Attention please: when samples have several peeks, the output fails to be ideal. Local Fisher Discrimination Analysis may work yet.
Laplacian Regularization In Least Square learning methods, we calculate the Euclidean distance between sample points to find a classifier plane. However, here we calculate the minimum distance along the manifold of points and based on which we find a classifier plane. In semi-supervised learning applications, we assume that the inputs x must locate in some manifold and the outputs y vary smoothly in that manifold. In the case of classification, inputs in the same manifold are supposed to have the same label. In the case of regression, the maps of inputs to outputs are supposed to vary smoothly in some manifold. Take the Gaussian kernal function for example: fθ(x)=∑j=1nθjK(x,xj),K(x,c)=exp(−∥x−c∥22h2) There are unlabeled samples {xi}n+n′i=n+1 that also be utilized: fθ(x)=∑j=1n+n′θjK(x,xj) In order to make all of the samples (labeled and unlabeled) have local similarity, it is necessary to add a constraint condition: minθ⎡⎣12∑i=1n(fθ(xi)−yi)2+λ2∥θ∥2+v4∑i,i′=1n+n′Wi,i′(fθ(xi)−fθ(xi′))2⎤⎦ whose first two terms relate to the ℓ2 regularized least square learning and last term is the regularized term relates to semi-supervised learning (Laplacian Regularization). v≥0 is a parameter to tune the smoothness of the manifold. Wi,i′≥0 is the similarity between xi and xi′. Not familiar with similarity? Refer to: http://blog.csdn.net/philthinker/article/details/70212147 Then how to solve the optimization problem? By the diagonal matrix D, whose elements are sums of row elements of W, and the Laplace matrix L that equals to D−W, it is possible to transform the optimization problem above to a general ℓ2 constrained Least Square problem. For simplicity, we omit the details here. n=200; a=linspace(0,pi,n/2); u=-10*[cos(a)+0.5 cos(a)-0.5]'+randn(n,1); v=10*[sin(a) -sin(a)]'+randn(n,1); x=[u v]; y=zeros(n,1); y(1)=1; y(n)=-1; x2=sum(x.^2,2); hh=2*1^2; k=exp(-(repmat(x2,1,n)+repmat(x2',n,1)-2*x*(x'))/hh); w=k; t=(k^2+1*eye(n)+10*k*(diag(sum(w))-w)*k)\(k*y); m=100; X=linspace(-20,20,m)';X2=X.^2; U=exp(-(repmat(u.^2,1,m)+repmat(X2',n,1)-2*u*(X'))/hh); V=exp(-(repmat(v.^2,1,m)+repmat(X2',n,1)-2*v*(X'))/hh); figure(1); clf; hold on; axis([-20 20 -20 20]); colormap([1 0.7 1; 0.7 1 1]); contourf(X,X,sign(V'*(U.*repmat(t,1,m)))); plot(x(y==1,1),x(y==1,2),'bo'); plot(x(y==-1,1),x(y==-1,2),'rx'); plot(x(y==0,1),x(y==0,2),'k.');