Policy Gradient Algorithms
2019-10-02 17:37:47
This blog is from: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html
Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, SAC, TD3 & SVPG.
What is Policy Gradient
Policy gradient is an approach to solve reinforcement learning problems. If you haven’t looked into the field of reinforcement learning, please first read the section “A (Long) Peek into Reinforcement Learning » Key Concepts” for the problem definition and key concepts.
Notations
Here is a list of notations to help you read through equations in the post easily.
Symbol | Meaning |
---|---|
s∈Ss∈S | States. |
a∈Aa∈A | Actions. |
r∈Rr∈R | Rewards. |
St,At,RtSt,At,Rt | State, action, and reward at time step t of one trajectory. I may occasionally use st,at,rtst,at,rt as well. |
γγ | Discount factor; penalty to uncertainty of future rewards; 0<γ≤10<γ≤1. |
GtGt | Return; or discounted future reward; Gt=∑∞k=0γkRt+k+1Gt=∑k=0∞γkRt+k+1. |
P(s′,r|s,a)P(s′,r|s,a) | Transition probability of getting to the next state s’ from the current state s with action a and reward r. |
π(a|s)π(a|s) | Stochastic policy (agent behavior strategy); πθ(.)πθ(.) is a policy parameterized by θ. |
μ(s)μ(s) | Deterministic policy; we can also label this as π(s)π(s), but using a different letter gives better distinction so that we can easily tell when the policy is stochastic or deterministic without further explanation. Either ππ or μμ is what a reinforcement learning algorithm aims to learn. |
V(s)V(s) | State-value function measures the expected return of state s; Vw(.)Vw(.) is a value function parameterized by w. |
Vπ(s)Vπ(s) | The value of state s when we follow a policy π; Vπ(s)=Ea∼π[Gt|St=s]Vπ(s)=Ea∼π[Gt|St=s]. |
Q(s,a)Q(s,a) | Action-value function is similar to V(s)V(s), but it assesses the expected return of a pair of state and action (s, a); Qw(.)Qw(.) is a action value function parameterized by w. |
Qπ(s,a)Qπ(s,a) | Similar to Vπ(.)Vπ(.), the value of (state, action) pair when we follow a policy π; Qπ(s,a)=Ea∼π[Gt|St=s,At=a]Qπ(s,a)=Ea∼π[Gt|St=s,At=a]. |
A(s,a)A(s,a) | Advantage function, A(s,a)=Q(s,a)−V(s)A(s,a)=Q(s,a)−V(s); it can be considered as another version of Q-value with lower variance by taking the state-value off as the baseline. |
Policy Gradient
The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function respect to θ, πθ(a|s)πθ(a|s). The value of the reward (objective) function depends on this policy and then various algorithms can be applied to optimize θ for the best reward.
The reward function is defined as:
where dπ(s)dπ(s) is the stationary distribution of Markov chain for πθπθ (on-policy state distribution under π). For simplicity, the θ parameter would be omitted for the policy πθπθ when the policy is present in the subscript of other functions; for example, dπdπ and QπQπ should be dπθdπθ and QπθQπθ if written in full. Imagine that you can travel along the Markov chain’s states forever, and eventually, as the time progresses, the probability of you ending up with one state becomes unchanged — this is the stationary probability for πθπθ. dπ(s)=limt→∞P(st=s|s0,πθ)dπ(s)=limt→∞P(st=s|s0,πθ) is the probability that st=sst=s when starting from s0s0 and following policy πθπθ for t steps. Actually, the existence of the stationary distribution of Markov chain is one main reason for why PageRank algorithm works. If you want to read more, check this.
It is natural to expect policy-based methods are more useful in the continuous space. Because there is an infinite number of actions and (or) states to estimate the values for and hence value-based approaches are way too expensive computationally in the continuous space. For example, in generalized policy iteration, the policy improvement step argmaxa∈AQπ(s,a)argmaxa∈AQπ(s,a) requires a full scan of the action space, suffering from the curse of dimensionality.
Using gradient ascent, we can move θ toward the direction suggested by the gradient ∇θJ(θ)∇θJ(θ) to find the best θ for πθπθ that produces the highest return.
Policy Gradient Theorem
Computing the gradient ∇θJ(θ)∇θJ(θ) is tricky because it depends on both the action selection (directly determined by πθπθ) and the stationary distribution of states following the target selection behavior (indirectly determined by πθπθ). Given that the environment is generally unknown, it is difficult to estimate the effect on the state distribution by a policy update.
Luckily, the policy gradient theorem comes to save the world! Woohoo! It provides a nice reformation of the derivative of the objective function to not involve the derivative of the state distribution dπ(.)dπ(.) and simplify the gradient computation ∇θJ(θ)∇θJ(θ) a lot.
Proof of Policy Gradient Theorem
This session is pretty dense, as it is the time for us to go through the proof (Sutton & Barto, 2017; Sec. 13.1) and figure out why the policy gradient theorem is correct.
We first start with the derivative of the state value function: