Policy Gradient Algorithms

 2019-10-02 17:37:47

This blog is from: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html 

 

Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, SAC, TD3 & SVPG. 

 

What is Policy Gradient

Policy gradient is an approach to solve reinforcement learning problems. If you haven’t looked into the field of reinforcement learning, please first read the section “A (Long) Peek into Reinforcement Learning » Key Concepts” for the problem definition and key concepts.

Notations

Here is a list of notations to help you read through equations in the post easily.

SymbolMeaning
sSs∈SStates.
aAa∈AActions.
rRr∈RRewards.
St,At,RtSt,At,RtState, action, and reward at time step t of one trajectory. I may occasionally use st,at,rtst,at,rt as well.
γγDiscount factor; penalty to uncertainty of future rewards; 0<γ10<γ≤1.
GtGtReturn; or discounted future reward; Gt=k=0γkRt+k+1Gt=∑k=0∞γkRt+k+1.
P(s,r|s,a)P(s′,r|s,a)Transition probability of getting to the next state s’ from the current state s with action a and reward r.
π(a|s)π(a|s)Stochastic policy (agent behavior strategy); πθ(.)πθ(.) is a policy parameterized by θ.
μ(s)μ(s)Deterministic policy; we can also label this as π(s)π(s), but using a different letter gives better distinction so that we can easily tell when the policy is stochastic or deterministic without further explanation. Either ππ or μμ is what a reinforcement learning algorithm aims to learn.
V(s)V(s)State-value function measures the expected return of state s; Vw(.)Vw(.) is a value function parameterized by w.
Vπ(s)Vπ(s)The value of state s when we follow a policy π; Vπ(s)=Eaπ[Gt|St=s]Vπ(s)=Ea∼π[Gt|St=s].
Q(s,a)Q(s,a)Action-value function is similar to V(s)V(s), but it assesses the expected return of a pair of state and action (s, a); Qw(.)Qw(.) is a action value function parameterized by w.
Qπ(s,a)Qπ(s,a)Similar to Vπ(.)Vπ(.), the value of (state, action) pair when we follow a policy π; Qπ(s,a)=Eaπ[Gt|St=s,At=a]Qπ(s,a)=Ea∼π[Gt|St=s,At=a].
A(s,a)A(s,a)Advantage function, A(s,a)=Q(s,a)V(s)A(s,a)=Q(s,a)−V(s); it can be considered as another version of Q-value with lower variance by taking the state-value off as the baseline.

Policy Gradient

The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function respect to θ, πθ(a|s)πθ(a|s). The value of the reward (objective) function depends on this policy and then various algorithms can be applied to optimize θ for the best reward.

The reward function is defined as:

J(θ)=sSdπ(s)Vπ(s)=sSdπ(s)aAπθ(a|s)Qπ(s,a)J(θ)=∑s∈Sdπ(s)Vπ(s)=∑s∈Sdπ(s)∑a∈Aπθ(a|s)Qπ(s,a)

where dπ(s)dπ(s) is the stationary distribution of Markov chain for πθπθ (on-policy state distribution under π). For simplicity, the θ parameter would be omitted for the policy πθπθ when the policy is present in the subscript of other functions; for example, dπdπ and QπQπ should be dπθdπθ and QπθQπθ if written in full. Imagine that you can travel along the Markov chain’s states forever, and eventually, as the time progresses, the probability of you ending up with one state becomes unchanged — this is the stationary probability for πθπθ. dπ(s)=limtP(st=s|s0,πθ)dπ(s)=limt→∞P(st=s|s0,πθ) is the probability that st=sst=s when starting from s0s0 and following policy πθπθ for t steps. Actually, the existence of the stationary distribution of Markov chain is one main reason for why PageRank algorithm works. If you want to read more, check this.

It is natural to expect policy-based methods are more useful in the continuous space. Because there is an infinite number of actions and (or) states to estimate the values for and hence value-based approaches are way too expensive computationally in the continuous space. For example, in generalized policy iteration, the policy improvement step argmaxaAQπ(s,a)arg⁡maxa∈AQπ(s,a) requires a full scan of the action space, suffering from the curse of dimensionality.

Using gradient ascent, we can move θ toward the direction suggested by the gradient θJ(θ)∇θJ(θ) to find the best θ for πθπθ that produces the highest return.

Policy Gradient Theorem

Computing the gradient θJ(θ)∇θJ(θ) is tricky because it depends on both the action selection (directly determined by πθπθ) and the stationary distribution of states following the target selection behavior (indirectly determined by πθπθ). Given that the environment is generally unknown, it is difficult to estimate the effect on the state distribution by a policy update.

Luckily, the policy gradient theorem comes to save the world! Woohoo! It provides a nice reformation of the derivative of the objective function to not involve the derivative of the state distribution dπ(.)dπ(.) and simplify the gradient computation θJ(θ)∇θJ(θ) a lot.

θJ(θ)=θsSdπ(s)aAQπ(s,a)πθ(a|s)sSdπ(s)aAQπ(s,a)θπθ(a|s)∇θJ(θ)=∇θ∑s∈Sdπ(s)∑a∈AQπ(s,a)πθ(a|s)∝∑s∈Sdπ(s)∑a∈AQπ(s,a)∇θπθ(a|s)

Proof of Policy Gradient Theorem

This session is pretty dense, as it is the time for us to go through the proof (Sutton & Barto, 2017; Sec. 13.1) and figure out why the policy gradient theorem is correct.

We first start with the derivative of the state value function:

=====θVπ(s)θ(aAπθ(a|s)Qπ(s,a))aA(θπθ(a|s)Qπ(s,a)+πθ(a|s)θQπ(s,a))aA(θπθ(a|s)Qπ(s,a)+πθ(a|s)θs,rP(s,r|s,a)(r+Vπ(s)))aA(θπθ(a|s)Qπ(s,a)

转载于:https://www.cnblogs.com/wangxiaocvpr/p/11617854.html

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐