Nomenclature

Symbol	Description
$\theta$	Policy network parameters
$\pi_\theta(a\|s)$	Policy: probability of taking action $a$ given state $s$
$\tau$	Trajectory: sequence $(s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$
$s_t$	State at time step $t$
$a_t$	Action at time step $t$
$r_t$	Reward at time step $t$
$\gamma$	Discount factor
$R(\tau)$	Discounted cumulative return: $\sum_{t=0}^T \gamma^t r_t$
$G_t$	Reward-to-go from step $t$: $\sum_{t’=t}^T \gamma^{t’-t} r_{t’}$
$J(\theta)$	Expected return objective function
$V^\pi(s)$	Value function: expected return starting from state $s$ under policy $\pi$
$Q^\pi(s, a)$	Action-value function: expected return for taking action $a$ in state $s$ under $\pi$
$A^\pi(s, a)$	Advantage function: $Q^\pi(s, a) - V^\pi(s)$
$\hat{V}_\phi(s)$	Parameterized value network with parameters $\phi$
$\hat{A}(s, a)$	Estimated advantage
$\delta_t$	TD error at step $t$: $r_t + \gamma V(s_{t+1}) - V(s_t)$
$\lambda$	GAE weighting parameter
$d^{\pi_\theta}$	Occupancy measure (discounted state visitation frequency)
$r_t(\theta)$	Importance sampling ratio: $\frac{\pi_\theta(a_t\|s_t)}{\pi_{\theta_{old}}(a_t\|s_t)}$
$D_{KL}(p\|q)$	KL divergence from distribution $q$ to $p$
$\epsilon$	PPO clipping threshold
$\alpha$	Temperature parameter (e.g., in SAC)
$\pi^*$	Optimal policy

REINFORCE

In reinforcement learning, we’d like to maximize the cumulative return expectation.

\[J(\theta) = \mathbb{E}[R(\tau)]\]

in which $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, …)$, $R(\tau) = \sum_{t=0}^T \gamma^t r_t$ is the discounted cumulative return.

If we expand the expectation by trace probability:

\[J(\theta) = \int p_\theta (\tau) R(\tau) d\tau\]

in which, $p_\theta(\tau) = p(s_0) \prod_{t=0}^T \pi_\theta(a_t \mid s_t) \cdot p(s_{t+1}|s_{t}, a{t})$. Trivially, $\log(p_\theta(\tau)) = \log(p(s_0)) + \sum_{t=0}^T \log(\pi_\theta(a_t \mid s_t)) + \sum_{t=0}^T \log( p(s_{t+1}|s_{t}, a_{t}))$ Only the second term is dependent on the parameter $\theta$.

Then we compute the gradient to the parameter $\theta$:

\[\begin{align*} \nabla_\theta J(\theta) = \int \nabla_\theta p_\theta (\tau) R(\tau) d\tau \\ = \int \nabla_\theta \log (p_\theta (\tau)) \cdot p_\theta(\tau) R(\tau) d\tau \\ = \int \sum_{t=0}^T \nabla_\theta\log(\pi_\theta(a_t \mid s_t)) \cdot p_\theta(\tau) R(\tau) d\tau \\ = \sum_{t=0}^T \int \nabla_\theta\log(\pi_\theta(a_t \mid s_t)) \cdot p_\theta(\tau) R(\tau) d\tau \\ = \sum_{t=0}^T \mathbb{E}_{\tau\sim\pi_\theta}[\nabla_\theta\log(\pi_\theta(a_t \mid s_t)) R(\tau)] \end{align*}\]

The result shows that the gradient of the target function is the sum along the entire trajectory on the expected gradient of the log probability of the policy network times the cumulative reward. This result is called REINFORCE (Sutton and Barto book, Chapter 13).

Introduce the causality (“Reward to go”)

Note that this gradient pushes up the log probabilities of each action along the trajectory, no matter it is before or after the action $a_t$. Thus, a more improved form of $R(\tau)$ is removing the all terms before the step t. Therefore, for each step $t$, the $R(\tau) = \sum_{t=0}^T \gamma^t r_t$ can be replaced by $G_t = \sum_{t’=t}^T\gamma^{t’-t}r_t’$

Introduce the baseline

Lemma: for any function b(s) that doesn’t depend on the action $a$, we have $\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta (a|s) \cdot b(s)] = 0$

Proof:

\[\begin{align*} & \mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta (a|s) \cdot b(s)] \\ &= b(s) \mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta (a|s)] \\ &= b(s) \int \nabla_\theta \log \pi_\theta (a|s) \pi_\theta(a) da \\ &= b(s) \int \nabla_\theta \pi_\theta(a|s) da \\ &= b(s) \nabla_\theta \int \pi_\theta(a|s) da \\ &= b(s) \nabla_\theta 1 \\ &= 0 \end{align*}\]

This is also called EGLP Lemma.

thus we can subtract a baseline function $b(s)$ that is independent of $a$ and doesn’t affect the final expectation in the REINFORCE gradient, namely

\[\begin{align*} & \nabla_\theta J(\theta) = \\ &= \sum_{t=0}^T \mathbb{E}_{\tau\sim\pi_\theta}[\nabla_\theta\log(\pi_\theta(a_t|s_t)) R(\tau)] \\ &= \mathbb{E}_{\tau\sim\pi_\theta}[ \sum_{t=0}^T \nabla_\theta\log(\pi_\theta(a_t|s_t)) (G_t - b(s))] \end{align*}\]

How to choose a baseline function

A common baseline function is the value function $V^\pi(s_t)$ (See the Dueling DQN paper by Ziyu Wang). Then $G_t - b(s) = G_t - V^\pi(s_t)$, which is an effectively an unbiased estimator of the advantage function.

Proof:

Given policy $\pi$, the value function (expected return of state $s$ under policy $\pi$) and action value function (expected return of taking action $a$ at state $s$ under policy $\pi$) are formulated as (Refer to (Sutton and Barto, 2020, Chapter 3)):

\[\begin{align*} & V^\pi(s) = \mathbb{E}_{\pi}[\sum_{t'=t}^T\gamma^{t'-t} r_{t+1} | s_t = s] \\ & Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[\sum_{t'=t}^T\gamma^{t'-t} r_t | s_t = s, a_t=a] \\ \end{align*}\]

The definition of advantage function is $A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$, which means that whether an action $a$ is better or worse than average under a given state $s$.

Then we look at $G_t = \sum_{t’=t}^T\gamma^{t’}r_t$, we can find out that $\mathbb{E}_{a \sim \pi}[G_t(s) \mid s_t=s, a_t=a] = Q^\pi(s, a)$

We minus the value function above and we’ll see that $\mathbb{E}_{a \sim \pi}[G_t(s)-V^\pi(s) \mid s_t=s, a_t=a] = Q^\pi(s, a) -V^\pi(s) = A^\pi(s, a)$ In other words,$G_t(s_t)-V^\pi(s_t)$ is the unbiased estimator of $A^\pi(s_t, a_t)$

Finally, picking the value function as the baseline, the gradient on the target becomes

\[\begin{align*} & \nabla_\theta J(\theta) = \\ &= \mathbb{E}_{\tau\sim\pi_\theta}[ \sum_{t=0}^T \nabla_\theta\log(\pi_\theta(a_t \mid s_t)) A(s_t, a_t)] \end{align*}\]

This equation also tells us one of those approaches to estimate the advantage function.

Methods of estimating advantage function practically

Method 1: Monte Carlo Sampling

Sample trajectories $\tau$ given the policy network $\pi$, and then compute the $G_t = \sum_{k=0}^{T-t}\gamma^{t + k}r_{t+k}$. And compute the value estimation from the value network to get $\hat{V}_{\phi}(s_t)$, then we get the estimation of advantage as $\hat{A}(s_t, a_t) = G_t -\hat{V}(s_t)$. This estimator, as mentioned above, is un-biased, but often with high variance since it needs to compute along the entire trajectory.

Method 2: Temporal Difference (TD)**

Only sample one step ahead in the trajectory and compute the advantage using the following formula.

\[\hat{A}(s_t, a_t) = r_t + \gamma \hat{V}_{s_{t+1}} - \hat{V}(s_t)\]

This formula is derived from the Bellman equaltion of Q in the form of V. $Q^{\pi}(s_t, a_t) = \mathbb{E}[r_{t+1} + \gamma V^{\pi}(s_{t+1})]$

Then we get $A^{\pi}(s_t, a_t) = \mathbb{E}[r_{t+1} + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_t)]$

The term in the expectation is exactly the TD error, which means that TD error is an estimator for the advantage function. The n-step version of TD error is the following:

\[A^{(n)}(s_t, a_t) = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1} + \gamma^n V_{\phi}(s_{t+n}) - V_{\phi}(s_t)\]

If taking n=1, it de-generates back to the vanilla TD above.

Notes:
- Hereby we have two value functions, $V^\pi$: the true value function that is unknown to us; $\hat{V}_{\phi}$ the parameterized value function (often a neural network) to approximate the true value function. TD is unbiased only when we know the true value function but it is almost never known to us in a given problem.
- We often needs two separate networks to represent policy network and value network to avoid cross-talking. Practically, you may also use the same backbone with two output heads to save parameters.~~

Method 3: Generalized Advantage Estimation(GAE)

By calculating the weighted sum of the error from TD(0) to TD(T), we generalize the advantage into the GAE formulation:

\[\hat{A}^{GAE}_t = \sum_{l=0}^T(\gamma\lambda)^l\delta_{t+l}\]

in which $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD-0 error at step $t$.

When $\lambda =0$, it reduces to TD-0 error; when $\lambda=1$, it reduces to Monte Carlo sampling.

Policy update and importance sampling

We’d like compute the expectation of the new policy but the sampled data is from the old policy. In this case, we need to perform importance sampling between the data and new policy’s distribution to reduce the variance of estimation.

Importance sampling

For any two distribution,

\[\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}[\frac{p(x)}{q(x)}f(x)]\]

the ratio $\frac{p(x)}{q(x)}$ is called the importance weight.

Rewrite the policy gradient like below

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[ \sum_{t=0}^T \nabla_\theta\log(\pi_\theta(a_t|s_t)) A(s_t, a_t)]\]

noted that this formulation is equivalent to the following surrogate form: $\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi}[A^{\pi_\theta}(s, a)]$

This approximation is the accurate only when $\theta = \theta_{old}$, which means that at the old policy, the surrogate gradient is the same as the true gradient (tangent to each other).

In this formula, $d^{\pi_\theta}$ is the occupancy measure, representating under the policy $\pi_\theta$, the discounted frequency that agent visits the state $s$.

Then the importance-sampled gradient is formulated as:

\[\nabla_\theta L(\theta) = \mathbb{E}_t[ \nabla_\theta r_t(\theta) A_t]\]

The reason is , due to $\nabla_\theta r_t(\theta) = \frac{\nabla_\theta \pi_\theta(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)} $

when $\theta = \theta_{old}$, it goes back to the original gradient $\nabla_\theta \log(\pi_\theta(a_t \mid s_t))$

TRPO

Optimize the following problem: $\begin{align*} &\max_\theta \mathbb{E}_t[r_t(\theta) A_t] \\ &\text{s.t. } \mathbb{E}_t[D_{KL}(\pi_{\theta_{old}}\|{\pi_\theta})] \le \delta \\ \end{align*}$

PPO

Use clipping to replace the constraint:

\[L_{clipping}(\theta) = \mathbb{E}_t[\min(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t)]\]

Practically, there are four conditions:

$r_t > 1+\epsilon, A_t > 0$, good action, gradient clipped.
$r_t < 1-\epsilon, A_t > 0$, good action, gradient passed.
$r_t > 1+\epsilon, A_t < 0$, bad action, gradient clipped.
$r_t < 1-\epsilon, A_t < 0$, bad action, gradient passed.

Note that TRPO requires optimizing on a second-order problem but clipping only requires a first-order optimization.

KL and reverse KL

Two common additional loss terms are the KL divergence and reverse KL divergence term.

Assume that we have (sampled) data distribution $p$ and a distribution we’d like to fit $q$:

The forward KL is $D_{KL}(p|q) = E_{x \sim p(x)}[\log(\frac{p(x)}{q(x)})]$, which uses the $p$ sampling to evaluate.
the backward KL is $D_{KL}(q|p) = E_{x \sim q(x)}[\log(\frac{q(x)}{p(x)})]$, which uses the fitting distribution $q$ to evaluate.

These two asymmetric, forward and backward terms lead to mean-seeking and mode-seeking behaviors, respectively.

In policy gradient, reversed KL term is more often used as we’d like to let the parameter fit an improved target distribution, a quantitative formula is the following

\[\pi^{*}(a|s) \propto \pi_{old}(a|s) \cdot \exp{\frac{1}{\alpha} A(s, a)}\]

In reverse KL term $D_{KL}(\pi_{\theta}|\pi^*)$, it is feasible to sample from the existing parameterized policy network and let the behaviors focus on a few good action modes instead of trying to cover the mean of all good actions.

At the high level, the reversed KL term can be estimated by Monte Carlo:

\[\begin{align*} D_{KL}(\pi_\theta\|\pi^*) & = \mathbb{E}_{a \sim \pi_\theta}[\log \frac{\pi_{\theta}(a|s)}{\pi^*(a|s)}] \\ & = \mathbb{E}_{a \sim \pi_\theta}[\log \pi_{\theta}(a \mid s) - \frac{Q(s,a)}{\alpha} ]\\ \end{align*}\]

in which $\pi^*(a \mid s) = \frac{\exp(Q(s,a)/a)}{Z(s)}$, the $Z(s)$ is ignored as it is irrelevant to $a$.

Hereby the $Q$ is the output of the action value network. You might be able to compute the optimal $\pi$ given a fixed action space but it is often infeasbile when the action space is large or continuous.

Example1: Soft Actor Critic

In Soft Actor-Critic(SAC) method, the actor is the current policy $\pi_\theta$ and the critic is the current action-value network $Q_\phi(s,a)$ that evaluates the action output. The goal is to learn the optimal policy $\pi^*$.

Summary

TRPO/PPO: $D(\pi_{old}|\pi_{new})$, restrict too much drifting from the current policy.
SAC/MPO: $D(\pi_\theta|\pi^*)$, sample from the current policy, encourage mode-seeking in the current policy.
Behavior cloning: $D(\pi_{\text{expert}}) | \pi_\theta$, covers all expert action modes.