0
$\begingroup$

In reinforcement learning, how do you prove the following relation between the difference in value functions of two policies?

The value function $V^\pi(s)$ represents the expected cumulative reward for all states $s\in\mathcal{S}$ when following policy $\pi$ from state $s$, while the action-value function $Q^\pi(s, a)$ (aka the $Q$-function) represents the expected cumulative reward for all states $s\in\mathcal{S}$ and $a\in\mathcal{A}$ when taking action $a$ in state $s$ and then following policy $\pi$. The advantage function, defined as $A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$, indicates the relative improvement of taking a specific action over the average action in state $s$ according to policy $\pi$.

The following relationship links the difference in value functions between two policies $\pi$ and $\pi'$ to their action distributions, the advantage function under $\pi'$, and the discounted state visitation distribution under $\pi$:

$$ \def\expect{\mathbb{E}} V^{\pi}(s_0) - V^{\pi'}(s_0) = \sum_{t=0}^\infty \gamma^t \expect_{a\sim\pi(\cdot|s)}\expect_{s\sim P_t(\cdot|s_0,\pi)}[A^{\pi'}(s,a)]$$

where

  • $\pi, \pi'$ are two policies;
  • $\gamma$ is the discount factor;
  • $P_t(\cdot|s_0,\pi)$ is the probability distribution over states reached at time $t$ starting from state $s_0$ and following policy $\pi$.
$\endgroup$

0

You must log in to answer this question.