How to prove the relation involving difference between value functions of two different policies and the sum of advantage function over time?

Ask Question

Asked 8 months ago

Modified 8 months ago

Viewed 22 times

In reinforcement learning, how do you prove the following relation between the difference in value functions of two policies?

The value function $V^\pi(s)$ represents the expected cumulative reward for all states $s\in\mathcal{S}$ when following policy $\pi$ from state $s$, while the action-value function $Q^\pi(s, a)$ (aka the $Q$-function) represents the expected cumulative reward for all states $s\in\mathcal{S}$ and $a\in\mathcal{A}$ when taking action $a$ in state $s$ and then following policy $\pi$. The advantage function, defined as $A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$, indicates the relative improvement of taking a specific action over the average action in state $s$ according to policy $\pi$.

The following relationship links the difference in value functions between two policies $\pi$ and $\pi'$ to their action distributions, the advantage function under $\pi'$, and the discounted state visitation distribution under $\pi$:

$$ \def\expect{\mathbb{E}} V^{\pi}(s_0) - V^{\pi'}(s_0) = \sum_{t=0}^\infty \gamma^t \expect_{a\sim\pi(\cdot|s)}\expect_{s\sim P_t(\cdot|s_0,\pi)}[A^{\pi'}(s,a)]$$

where

$\pi, \pi'$ are two policies;
$\gamma$ is the discount factor;
$P_t(\cdot|s_0,\pi)$ is the probability distribution over states reached at time $t$ starting from state $s_0$ and following policy $\pi$.

asked Nov 13, 2023 at 16:04

Michael

11 bronze badge

Add a comment |

Stack Exchange Network

How to prove the relation involving difference between value functions of two different policies and the sum of advantage function over time?

0

You must log in to answer this question.

Browse other questions tagged
statistics
probability-distributions
computer-science
conditional-expectation
machine-learning
.

Hot Network Questions

How to prove the relation involving difference between value functions of two different policies and the sum of advantage function over time?

0

You must log in to answer this question.

Browse other questions tagged statisticsprobability-distributionscomputer-scienceconditional-expectationmachine-learning.

Related

Hot Network Questions

Browse other questions tagged
statistics
probability-distributions
computer-science
conditional-expectation
machine-learning
.