3
$\begingroup$

\begin{equation} \mathcal{W}_\epsilon(\alpha, \beta) = \min_{\pi\in \Pi(\alpha\beta)} \int c(x,y) \mathrm{d}\pi(x,y) + \epsilon H(\pi \| \alpha \otimes \beta) \end{equation} Cuturi (2013) introduced the entropy-regularized Wasserstein distance, or Sinkhorn distance, shown above, where $\epsilon $ is the regularization parameter and $H(\pi \| \alpha \otimes \beta)$ is the relative entropy, or KL-divergence, between the transport plan and the marginal probabilities.

But I have seen the $H(\cdot)$ term shown in two different ways, one with entropy and the other with relative entropy:

\begin{align} H(\pi) &= \int \pi(x,y) \ln \pi(x,y) \\ H(\pi \| \alpha \otimes \beta) &= \int \ln \left(\frac{\mathrm{d}\pi (x,y)}{\mathrm{d}\alpha(x) \mathrm{d}\beta(y) } \right) \mathrm{d}\pi (x,y) \end{align}

How are the last two lines equal or connected to each other? Obviously they're not the same, so why are there two different versions running around?

$\endgroup$
2
  • $\begingroup$ Ok, my bad! I didn't understand the question. $\endgroup$ Commented Nov 18, 2020 at 11:34
  • $\begingroup$ I'm starting to think that the notation $H(\pi \| \alpha \otimes \beta)$ is completely wrong, since other articles use $D_{KL}(\pi \| \alpha \otimes \beta)$, which is the KL-divergence, not entropy. If it's correct to say that $H \neq D_{KL}$ by any means, for any inputs, then the current best answer skillfully found a way to make them equal each other when in fact they were never intended to be, in practice. I guess this leaves the question of which of the two versions for the "entropy" term better complies with the underlying Sinkhorn algorithm intended to solve the Sinkhorn distance $\endgroup$
    – develarist
    Commented Dec 3, 2020 at 23:18

2 Answers 2

2
$\begingroup$

These two are actually equivalent up to a constant when $\pi$ is a coupling of $\alpha$ and $\beta$. I'll assume that $\pi,\alpha, \beta$ all have densities. We can then write:

$$ H(\pi||\alpha\otimes \beta) = \int\ln\left(\frac{d\pi}{d\alpha d\beta} \right)d\pi = \int \pi(x,y) \ln\left(\frac{\pi(x,y)}{\alpha(x)\beta(y)} \right) dx dy $$

Note that $\pi(x,y)$ is the density with respect to the Lebesgue measure, and the same can be said for $\alpha(x)$ and $\beta(y)$. Therefore:

$$ H(\pi||\alpha\otimes \beta) = \int\pi(x,y)\ln \pi(x,y) dx dy - \int\pi(x,y)\ln(\alpha(x))dxdy - \int\pi(x,y)\ln(\beta(y))dxdy =\\ = \int \pi(x,y) \ln\pi(x,y) dx dy - \int\alpha(x)\ln\alpha(x) dx -\int \beta(y) \ln \beta(y) dy = H(\pi) - H(\alpha) - H(\beta) $$

Since $\alpha$ and $\beta$ are fixed, we get $H(\pi) + C$, where $C$ is a constant.

$\endgroup$
5
  • $\begingroup$ I wonder why Cuturi or anyone else don't derive this equality. Does any source give it. And is there a reason for why half of users might use entropy while the other half uses relative entropy? How would they assign a value to $C$ and decide which of the two to use. Or which extreme is better for probability distributions? $\endgroup$
    – develarist
    Commented Nov 18, 2020 at 19:03
  • $\begingroup$ Yeah... No idea. I made a similar question in another post, because in the book by Peyré and Cuturi they define entropy differently than usual by adding a constant, but I cannot understand why. $\endgroup$ Commented Nov 18, 2020 at 19:11
  • $\begingroup$ You seem interested in OT, saw some other posts you had. I’ve started a study group on a platform called Zulip. Let me know if you are interested in joining. We are going through the Computational Optimal Transport book. But I’m also reading Santambrogio on the side. $\endgroup$ Commented Nov 18, 2020 at 19:13
  • 1
    $\begingroup$ At least if $\alpha$ and $\beta$ have full support, it doesn't matter which one you choose. The minimizers are the same. $\endgroup$
    – Tobsn
    Commented Nov 18, 2020 at 21:16
  • $\begingroup$ Because the relative entropy is infinite by definition if the two measures are not absolutely continuous with each other. $\endgroup$
    – Tobsn
    Commented Jan 5, 2021 at 21:21
1
$\begingroup$

I would like to add a couple of points here which I think shouldn't be overlooked.

Neither of the choices are "wrong". In the 2013 Cuturi paper you reference he chooses to regularise with "entropy" (notice this is actually the negative Boltzmann entropy) :

$$ H(\pi)= \begin{cases} \int \pi \log \pi~~&\text{when}~\pi~\text{has a density} \\ \infty & o.w \end{cases}. $$

  1. The reason this is a natural choice for regularisation is that it does the "smoothing" or "relaxing" job that regularisation is intended to do. Adding $H$ into the optimal transport problem gives the mass "freedom to spread out". This can be seen in this example let $\mu$ be concentrated on the two points $x_1,x_2 \in \mathbb{R}$ such that $\mu(x_1)=\mu(x_2)=\frac{1}{2}$, and $\nu$ be concentrated on the two points $y_1,y_2 \in \mathbb{R}$ such that $\nu(y_1)=\mu(y_2)=\frac{1}{2}$, then the optimal coupling $\pi$ which maximises $H$ is

$$ \pi(x_i,y_j)=\frac{1}{4},\forall~i,j.$$

  1. Since we have a minimisation problem its beneficial to be adding a uniformly convex term, again $H$ ticks that box!

  2. The choice of adding $H(\pi~||~\alpha\otimes\beta)$, the entropy conditioned on the product measure, has its advantages as outlined in https://audeg.github.io/publications/these_aude.pdf. As far as I understand it lets you rephrase the dual problem in a neat way.

  3. Now to comparing the two choices: I don't think it matters too much either way, both do the same job. As the other answer points out, the minimiser is the same, and they differ by a constant $C$. Lastly remember that, usually when "doing regularisation" you have a small parameter $\epsilon \ll 1$ multiplying the regularisation term, hence

$$ \epsilon \Big(H(\pi~||~\alpha\otimes\beta)-H(\pi)\Big)=\epsilon C \ll 1 .$$

  • edit I have changes to make about what choice to take.
$\endgroup$
2
  • $\begingroup$ Can you recommend articles that explain and demonstrate why entropy-regularized learning, for or for not the Sinkhorn distance, enhances out-of-sample performance, besides Cuturi and Aude? They say it but aren't convincing, even though in practice we know entropy somehow improves out-of-sample performance in learning algorithms by magic $\endgroup$
    – develarist
    Commented Jan 8, 2021 at 4:44
  • $\begingroup$ I don't really study machine learning. I use this stuff for Wasserstein Gradient Flows. Maybe post that as a question, or check papers which cite them? $\endgroup$ Commented Jan 8, 2021 at 9:43

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .