300
$\begingroup$

I've studied convex optimization pretty carefully, but don't feel that I have yet "grokked" the dual problem. Here are some questions I would like to understand more deeply/clearly/simply:

  1. How would somebody think of the dual problem? What thought process would lead someone to consider the dual problem and to recognize that it's valuable/interesting?
  2. In the case of a convex optimization problem, is there any obvious reason to expect that strong duality should (usually) hold?
  3. It often happens that the dual of the dual problem is the primal problem. However, this seems like a complete surprise to me. Is there any intuitive reason to expect that this should happen?
  4. Does the use of the word "dual" or "duality" in optimization have anything to do with the dual space in linear algebra? Or are they just different concepts that go by the same name. What about the use of the word "dual" in projective geometry — is there a connection there?
  5. You can define the dual problem and prove theorems about strong duality without ever mentioning the Fenchel conjugate. For example, Boyd and Vandenberghe prove a strong duality theorem without mentioning the Fenchel conjugate in their proof. And yet, people often talk as if the Fenchel conjugate is somehow the "essence" of duality, and make it sound as if the whole theory of duality is based on the Fenchel conjugate. Why is the Fenchel conjugate considered to have such fundamental importance?

Note: I will now describe my current level of understanding of the intuition behind the dual problem. Please tell me if you think I might be missing any basic insights.

I have read the excellent notes about convex optimization by Guilherme Freitas, and in particular the part about "penalty intuition". When we are trying to solve

\begin{align*} \text{minimize} &\quad f(x) \\ \text{such that} & \quad h(x) \leq 0 \end{align*}

one might try to eliminate the constraints by introducing a penalty when constraints are violated. This gives us the new unconstrained problem

\begin{equation} \text{minimize} \quad f(x) + \langle \lambda ,h(x) \rangle \end{equation}

where $\lambda \geq 0$. It's not hard to see that for a given $\lambda \geq 0$, the optimal value of this unconstrained problem is less than or equal to the optimal value for the constrained problem. This gives us a new problem — find $\lambda$ so that the optimal value for the unconstrained problem is as large as possible. That is one way to imagine how somebody might have thought of the dual problem. Is this the best intuition for where the dual problem comes from?

Another viewpoint: the KKT conditions can be derived using what Freitas calls the "geometric intuition". Then, if we knew the value of the multipliers $\lambda$, it would be (often) much easier to find $x$. So, a new problem is to find $\lambda$. And if we can somehow recognize that $\lambda$ is a maximizer for the dual problem, then this suggests that we might try solving the dual problem.

Please explain or give references to any intuition that you think I might find interesting, even if it's not directly related to what I asked.

$\endgroup$
2
  • 5
    $\begingroup$ This is pretty late, but in case anyone'd like a different perspective, you might like to have a look at the (my) answer to this question. Also, I've always thought that these slides introduce duality very well (and concisely). $\endgroup$
    – jkn
    Commented Dec 30, 2013 at 21:32
  • 3
    $\begingroup$ Link to slides mentioned in previous comment is broken; try Wayback Machine link. $\endgroup$ Commented Oct 16, 2020 at 16:49

8 Answers 8

161
$\begingroup$

Here's what's really going on with the dual problem. (This is my attempt to answer my own question, over a year after originally asking it.)

(A very nice presentation of this material is given in Ekeland and Temam. These ideas are also in Rockafellar.)

Let $V$ be a finite dimensional normed vector space over $\mathbb R$. (Working in an inner product space or just in $\mathbb R^N$ risks concealing the fundamental role that the dual space plays in duality in convex optimization.)

The basic idea behind duality in convex analysis is to think of a convex set in terms of its supporting hyperplanes. (A closed convex set $\Omega$ can be "recovered" from its supporting hyperplanes by taking the intersection of all closed half spaces containing $\Omega$. The set of all supporting hyperplanes to $\Omega$ is sort of a "dual representation" of $\Omega$.)

For a convex function $f$ (whose epigraph is a convex set), this strategy leads us to think about $f$ in terms of affine functions $\langle m^*, x \rangle - \alpha$ which are majorized by $f$. (Here $m^* \in V^*$ and we are using the notation $\langle m^*, x \rangle = m^*(x)$.)

For a given slope $m^* \in V^*$, we only need to consider the "best" choice of $\alpha$ -- the other affine minorants with slope $m^*$ can be disregarded. \begin{align*} & f(x) \geq \langle m^*,x \rangle - \alpha \quad \forall x \in V \\ \iff & \alpha \geq \langle m^*, x \rangle - f(x) \quad \forall x \in V \\ \iff & \alpha \geq \sup_{x \in V} \quad \langle m^*, x \rangle - f(x) \end{align*} so the best choice of $\alpha$ is \begin{equation} f^*(m^*) = \sup_{x \in V} \quad \langle m^*, x \rangle - f(x). \end{equation} If this supremum is finite, then $\langle m^*,x \rangle - f^*(m^*)$ is the best affine minorant of $f$ with slope $m^*$. If $f^*(m^*) = \infty$, then there is no affine minorant of $f$ with slope $m^*$.

The function $f^*$ is called the "conjugate" of $f$. The definition and basic facts about $f^*$ are all highly intuitive. For example, if $f$ is a proper closed convex function then $f$ can be recovered from $f^*$, because any closed convex set (in this case the epigraph of $f$) is the intersection of all the closed half spaces containing it. (I still think the fact that the "inversion formula" $f = f^{**}$ is so simple is a surprising and mathematically beautiful fact, but not hard to derive or prove with this intuition.)

Because $f^*$ is defined on the dual space, we see already the fundamental role played by the dual space in duality in convex optimization.

Given an optimization problem, we don't obtain a dual problem until we specify how to perturb the optimization problem. This is why equivalent formulations of an optimization problem can lead to different dual problems. By reformulating it we have in fact specified a different way to perturb it.

As is typical in math, the ideas become clear when we work at an appropriate level of generality. Assume that our optimization problem is \begin{equation*} \operatorname*{minimize}_{x} \quad \phi(x,0). \end{equation*} Here $\phi:X \times Y \to \bar{\mathbb R}$ is convex. Standard convex optimization problems can be written in this form with an appropriate choice of $\phi$. The perturbed problems are \begin{equation*} \operatorname*{minimize}_{x} \quad \phi(x,y) \end{equation*} for nonzero values of $y \in Y$.

Let $h(y) = \inf_x \phi(x,y)$. Our optimization problem is simply to evaluate $h(0)$.

From our knowledge of conjugate functions, we know that \begin{equation*} h(0) \geq h^{**}(0) \end{equation*} and that typically we have equality. For example, if $h$ is subdifferentiable at $0$ (which is typical for a convex function) then $h(0) = h^{**}(0)$.

The dual problem is simply to evaluate $h^{**}(0)$.

In other words, the dual problem is: \begin{equation*} \operatorname*{maximize}_{y^* \in Y^*} \quad - h^*(y^*). \end{equation*} We see again the fundamental role that the dual space plays here.

It is enlightening to express the dual problem in terms of $\phi$. It's easy to show that the dual problem is \begin{equation*} \operatorname*{maximize}_{y^* \in Y^*} \quad - \phi^*(0,y^*). \end{equation*}

So the primal problem is \begin{equation*} \operatorname*{minimize}_{x \in X} \quad \phi(x,0) \end{equation*} and the dual problem (slightly restated) is \begin{equation*} \operatorname*{minimize}_{y^* \in Y^*} \quad \phi^*(0,y^*). \end{equation*} The similarity between these two problems is mathematically beautiful, and we can see that if we perturb the dual problem in the obvious way, then the dual of the dual problem will be the primal problem (assuming $\phi = \phi^{**}$). The natural isomorphism between $V$ and $V^{**}$ is of fundamental importance here.

The key facts about the dual problem -- strong duality, the optimality conditions, and the sensitivity interpretation of the optimal dual variables -- all become intuitively clear and even "obvious" from this viewpoint.

An optimization problem in the form \begin{align*} \operatorname*{minimize}_x & \quad f(x) \\ \text{subject to} & \quad g(x) \leq 0, \end{align*} can be perturbed as follows: \begin{align*} \operatorname*{minimize}_x & \quad f(x) \\ \text{subject to} & \quad g(x) + y \leq 0. \end{align*}

This perturbed problem has the form given above with \begin{equation*} \phi(x,y) = \begin{cases} f(x) \quad \text{if } g(x) + y \leq 0 \\ \infty \quad \text{otherwise}. \end{cases} \end{equation*} To find the dual problem, we need to evaluate $-\phi^*(0,y^*)$, which is a relatively straightforward calculation. \begin{align*} -\phi^*(0,y^*) &= -\sup_{g(x) + y \leq 0} \quad \langle y^*,y \rangle - f(x) \\ &= -\sup_{\substack{ x \\ q \geq 0 }} \quad \langle y^*, -g(x) - q \rangle - f(x) \\ &= \inf_{\substack{ x \\ q \geq 0 }} \quad f(x) + \langle y^*, g(x) \rangle + \langle y^*, q \rangle. \end{align*} We can minimize first with respect to $q$, and we will get $-\infty$ unless $\langle y^*, q \rangle \geq 0$ for all $q \geq 0$. In other words, we will get $-\infty$ unless $y^* \geq 0$.

The dual function is \begin{equation*} -\phi^*(0,y^*) = \begin{cases} \inf_x \quad f(x) + \langle y^*, g(x) \rangle \quad \text{if } y^* \geq 0 \\ -\infty \quad \text{otherwise}. \end{cases} \end{equation*}

This is the expected result.

$\endgroup$
11
  • 2
    $\begingroup$ ohhh this was really helpful! If you ever get the time, it would be great if you could also expand on "The key facts about the dual problem all become intuitively clear". $\endgroup$
    – air
    Commented Jul 15, 2015 at 23:43
  • 1
    $\begingroup$ Thank you very much for the clarifications. Could you (or somebody else) please also explain the intuition behind the perturbation function $\phi$, and the variable $y$ in particular? Is there a connection to the "penalty intuition" you mentioned in your original question, or even to the Lagrangian? $\endgroup$
    – kostrykin
    Commented May 13, 2017 at 11:59
  • 3
    $\begingroup$ @theV0ID When faced with an optimization problem, it's natural to ask what would happen to the optimal value if you were to perturb the problem a bit, for example by tightening or loosening one of the constraints. For example, if we are going to loosen just one constraint, which constraint should we pick to get the most bang for our buck? This thought process suggests introducing the perturbed problems and also the "value function" $h(y) = \inf_x \, \phi(x,y)$ as objects for us to think about $\endgroup$
    – littleO
    Commented May 14, 2017 at 0:51
  • 1
    $\begingroup$ This is great. Can you tell a little more why "The dual problem is simply to evaluate $h^{**}(0)$"? And from that, why this dual problem means "$\operatorname*{maximize}_{y^* \in Y^*} [- h^*(y^*)]$"? $\endgroup$
    – X Leo
    Commented May 24, 2019 at 20:32
  • 16
    $\begingroup$ Btw, your answer was cited in arxiv.org/pdf/1609.06349.pdf. $\endgroup$ Commented Jul 29, 2019 at 9:11
56
+50
$\begingroup$

I'll take a crack at a couple of these questions (some of them are hard and would require more thought).

1) Here's a nice economic interpretation of duality (being a bit "fast and loose" with the details). Let's say $x$ represents the design of a widget, and $f(x)$ is the cost you will incur producing it. $x$ must satisfy "design constraints" given by $h(x)\leq 0$ (suppose for simplicity that we have only 1 constraint function). Out of curiosity, you decide to try farming out production: another company agrees to produce your thingy $x$ and "charge" you $f(x)$ for it. Their goal is ultimately to maximize profit, but they can't charge you more than you would spend doing it yourself. So given a fixed $\lambda$, they need to find the design $x$ that minimizes $f(x)+\lambda h(x)$. If they don't, you'll be able to find a feasible design $y$ that has a lower cost $f(y)<f(x)$, and thus you'll make the widget yourself. Now that this company can do at least as good as you, they'll set about maximizing their profit by varying $\lambda$ (maybe $\lambda$ corresponds to a different process or something).

This interpretation is similar to that in Boyd and Vandenberghe, 5.4.4. Also I find 'game theoretic' interpretations helpful - the primal problem is your strategy while the dual corresponds to an opponent's strategy.

2) I don't have an intuitive answer for this other than to say that strong duality is equivalent to existence of a saddle point of the Lagrangian.

3) For convex problems this makes sense because applying the convex conjugate (Fenchel transform) twice returns the 'convexification' of the original objective function, which is the same as the original function in most 'nice' situations.

4) Yes, but you definitely have to be careful here. The dual of a vector space is formally defined as the space of all continuous linear functionals on that space, and this concept lives 100% independently of optimization theory. However, you're correct to notice that the dual of a vector space does arise in the statement of the dual of an optimization problem. Let me explain: define $X=\mathbb{R}^n$ and $Y=\mathbb{R}^m$ and consider the problem:

$$ \text{minimize}\quad f(x)\\ \text{such that} \quad h(x)\leq 0\\ x\in X,\;f:X\rightarrow \mathbb{R},\\ h:X\rightarrow Y. $$ Then, this problem has the following dual:

$$ \max_\lambda\quad \inf_{x\in X} \{f(x)+\langle \lambda,h(x)\rangle\}\\ \text{such that}\quad \lambda_i\geq 0,~~\forall i: 0\leq i\leq m $$ Now, "formally", $\lambda$ is an element of the dual space of $Y$, since we're considering an inner product of the form $\langle\lambda, y\rangle$ where $y\in Y$, and hence can think of this as a continuous linear functional on $Y$. But here $Y$ is finite dimensional, so (by the Riesz representation theorem) $Y^*$ is actually isomorphic to $Y$, so the distinction isn't really necessary and you haven't gained anything by thinking of $\lambda$ as being an element of the dual space. It's possible that in infinite dimensional problems where $Y^*$ is not isomorphic to $Y$ you get something out of this, but I can't think of a good example.

To my knowledge, there is zero connection between duality in projective geometry and duality in optimization. Duality in projective geometry is more a statement about the bijection between points and rays that defines projective space. "Duality" in math really just means having 2 ways to think about a problem - it's as overused as words like "fundamental" and "canonical". Another classical example of duality comes from fluid dynamics and PDE - "Eulerian" coordinates vs. 'Lagrangian' coordinates.

5) For convex problems, I would agree that the convex conjugate is key, though in general I'm not sure it helps a lot with the 'general' idea of duality since (a) not all interesting problems are convex and (b) the convex conjugate is legendarily hard to gain intuition for. Historically this 'fundamental importance' of the convex conjugate might be traced to physics, where it turns out that a lot of interesting physical systems have a convex 'Lagrangian' describing their total energy. The equations of motion can then be phrased essentially as a convex minimization problem with respect to this Lagrangian. The convex conjugate function, it then turns out, is what is called the 'Hamiltonian', which leads to an entirely different (one might say 'dual') formulation of the equations of physics. Thus we have yet again two ways to approach the same problem!

Hope this was somewhat useful to you.

$\endgroup$
6
  • $\begingroup$ I wonder why do we need to solve the dual problem, why not just solve the primal one directly? Because the dual one is often more tractable? $\endgroup$
    – avocado
    Commented Dec 17, 2013 at 8:36
  • 2
    $\begingroup$ @loganecolss yes, or more commonly various saddle point formulations are more computationally tractable. The dual can also offer a useful theoretical perspective, especially in problems of economic interest, and finally the dual (actually the lagrangian/saddle point problem) are crucial for proving things like existence and characterizing solutions. $\endgroup$
    – icurays1
    Commented Dec 17, 2013 at 13:59
  • $\begingroup$ @loganecolss In my opinion Saying that solving dual problem is easier, is ridiculous ! Even if we restrict ourself to those problems where constructing dual problem does not require much effort, like LPs then we probably have symmetric property i.e, $$\text{dual} ~ \text{dual} = \text{primal}$$ That's enough to reject this claim which is widely used in optimization community ! $\endgroup$
    – Red shoes
    Commented Jul 22, 2017 at 19:52
  • $\begingroup$ @Redshoes why would it be ridiculous? You have two problems to choose from: primal and dual. It's possible one is simpler to solve than the other. Nobody is saying that the dual is /always/ easier. $\endgroup$ Commented May 31, 2021 at 18:05
  • $\begingroup$ @TSF I'm not talking about an abstract and specific example that cooked up such that its dual seems easier to solve. Consider a class of problems. I have heard a lot saying, " the dual problem is usually easier to solve." !! Plus, what do you mean by solving; solving by definition? or any algorithms? There are many algorithms that basically solve the dual problem without obtaining the dual problem explicitly. On the top of that, your original problem is the primal problem, not the its dual, so when you solve a dual problem then you have to obtain the primal variables. Is this also easy? $\endgroup$
    – Red shoes
    Commented Jun 1, 2021 at 4:19
55
$\begingroup$

Consider the problem $$ \begin{aligned} \mbox{min} \quad& f(x) \\ \mbox{subject to} \quad& x\le a\\ \end{aligned} $$ illustrated below and where $f$ is convex: enter image description here

The essential features are there in this problem despite it being the most basic problem with an inequality. Now consider the optimum given as a function of $a$. So $f_+(a)=\inf_{x\le a}f(a)$. Its graph looks like this:

enter image description here

It's basically like the original $f$ except that we only have the descending parts and none of the ascending parts.

We're going to construct $f_+$ by a slightly roundabout way, not by considering the function itself, but its epigraph, the set of points above the graph. Here's the epigraph for the original $f$:

enter image description here

Notice I've drawn a bunch of lines just touching the epigraph, ie. its subtangents. In general, any convex shape can be expressed as the intersection of half-planes. Those subtangents indicate a handful of the boundaries of those half planes. We can construct those subtangents by considering the lines $y=\lambda x$ and then pushing those lines up or down so each is high as possible without crossing $f$. We can find out how much pushing we need by solving the optimisation: $$ \begin{aligned} \mbox{maximise} \quad& \lambda x-f(x) \end{aligned} $$ and we call the optimal value $f^\ast(\lambda)$. $f^\ast$ is also known as the Legendre transform or Fenchel dual. It's basically telling us which half-planes define our epigraph. Interestingly, reconstructing the function $f$ from $f^\ast$ requires exactly the same expression. So the Fenchel dual also takes a collection of half-planes and gives you back the original function.

So now we want to construct the epigraph of $f_+$. We just want the descending parts of $f$. It's easy to construct the epigraph, we take the intersection of just the half-planes whose boundaries are decreasing. Here's a picture:

enter image description here

So here's a process for constructing $f_+$: we take the Fenchel dual of $f$, we throw away the "half" where $\lambda\ge0$, and now take the inverse Fenchel dual (which is in fact just the Fenchel dual).

Now let's construct the dual to our original optimisation problem as described in numerous textbooks. We form the Lagrangian $$L(a,x,\lambda)=f(x)+\lambda (x-a)$$ We then form the function $$g(a,\lambda)=\min_x L(a,x,\lambda)$$ This is basically (modulo a sign) the Fenchel dual of $f$ with an extra $-\lambda a$ term.

The dual problem, as in the textbooks, is $$ \begin{aligned} \mbox{maximise} \quad& g(a,\lambda)\\ \mbox{such that} \quad & \lambda\ge0\\ \end{aligned} $$ Remembering that $-\lambda a$ term, that's almost the inverse Fenchel dual except we've maximised over just "half" of the $\lambda$s, those for which $\lambda\ge0$. When we compute the optimum of the dual problem we get $h(x)=\max_x g(a,x)$. If the problem is convex, we've just computed $f_+$ by another path.

So this gives what I think is a clear picture of what the dual problem is. Forming the dual problem is (modulo a vertical shift) converting the original problem into finding the subtangents. Solving the dual problem is converting that back to a function again, but using only the downward sloping lines.

I'm hoping this also answers some of the other questions.

By the way, this is connected with duals in linear algebra. We typically define a half-space using $h(x)\le\mbox{constant}$ where $h$ is linear in $x$. So $h$ is an example of what is meant by a dual vector in linear algebra. When we work from the dual viewpoint we're looking at convex objects from the point of view of the dual vectors defining the hyperplanes that touch it.

While I'm here I may as well add that this is similar to the Hilbert transform in signal processing. The Hilbert transform involves throwing away "half" of the Fourier transform analogously to throwing away "half" of the Legendre transform. There's a deep connection.

$\endgroup$
2
  • 2
    $\begingroup$ Qs. 1) not sure why did you choose $y=\lambda x$ and not $y=\lambda x+\alpha$ because subtangent need not pass though the origin. 2) Why maximize the $y\lambda-f(x)$ and not minimize? $\endgroup$
    – CKM
    Commented Jul 10, 2021 at 11:54
  • $\begingroup$ If you maximize you get another convex function which can be nice when you're doing further work $\endgroup$
    – Dan Piponi
    Commented Jan 20 at 23:17
12
$\begingroup$

For #4 and #5, see "The concept of duality in convex analysis, and the characterization of the Legendre transform" by Artstein-Avidan and Milman.

$\endgroup$
0
10
$\begingroup$

after reading and learning from this excellent discussion, I summarised an explanation based on my own background. If I made any mistakes, please correct me, thanks a lot.

I understood the dual problem and the intuition behind with following steps.

  1. first of first, the ultimate goal is to find a "satisfying" lower bound of original minimise problem.

  2. the Lagrange function can be considered as adding perturbations to the original object. With some constrains, e.g. the dual function $g(\lambda)=\inf_x \mathcal{L}(x,\lambda)$, the constructed Lagrange function could be a lower bound of original problem.

  3. the dual function $g(\lambda)=\inf_x \mathcal{L}(x,\lambda)$ is fortunately easy to optimise. Because $g(\lambda)$ is a function about $\lambda$, which is concave.

  4. based on these reasons, we solve the dual problem to estimate a good lower bound of original problem.

  5. Additionally, if the constrains in original problem are linear, the dual problem will have a similar form with conjugate function, which can give us a more clearly geometric intuition.


here are some examples, which helped me to understand these.

  1. the Lagrange function can be considered as adding perturbations. consider an example. $$ \min x^2 \\ s.t. x-1 <0 $$

Lagrangian function and dual function

from the figure above, it is clear to see that $g(\lambda)=\inf_x \mathcal{L}(x,\lambda)$ is a lower bound of original problem.

note, f(x)and g(λ) have independent variables, for a more concise expression, the red curve should be plotted on an additional figure.

  1. if constrains are linear, the connection between dual problem and conjugate function.

conjugate function, in my opinion, e.g. for convex function, can be considered as supported hyperplanes. It is a another view of the original function.

if you are not familiar with conjugate function, I wrote a blog to give a step by step explanation. or read the answer from Dan Piponi.

When constrains are linear, our optimization problem looks like $$ \min_x f(x) \quad s.t. \space Ax \preceq b $$

The Lagrange dual function is $$ \begin{aligned} g(\lambda) & = \inf_x (f(x)+\lambda^T(Ax-b)) && with \space \lambda \succeq 0\\ & = -\sup_x (-\lambda^T A x - f(x)) - \lambda^T b && \\ & = -f^{\star}(-\lambda^T A) - \lambda^T b && \text{definition of conjugate function} \end{aligned} $$

Intuitively, since $-\lambda^T b$ is a linear term, the extreme values only are effected by the domain, so if we focus on the $-f^{\star}(-\lambda^T A)$ term, to solve $\max_{\lambda} g(\lambda)$ means to solve $\min f^{\star}(-\lambda^T A)$.

What is the minimum of a conjugate function? again, let's see an example.

if the optimization problem is $$ \min x^2 \quad s.t. x \leq 1 $$

we have $$ g(\lambda) = -f^{\star}(-\lambda^T A) - \lambda^T b \\ = -f^{\star}(-\lambda) - \lambda $$

where $$ f^{\star}(\lambda)=\sup_x(\lambda x - f(x)) \\ = \frac{1}{4} \lambda^2 $$

to solve $$ \max_{\lambda} g(\lambda) \\ \text{dom } g = \{-\lambda \in \text{dom } f^{\star}, \lambda \geq 0\} $$

when consider $$ \max_\lambda -f^{\star}(-\lambda) $$

it is equivalent to solve $$ \min_\lambda f^{\star}(-\lambda) $$

(here supposed be a figure, but I don't have enough reputation to post more than 2 links, you can find these figure in my blog)

Again, in the figure, we can see, it is obvious when $\lambda = 0$, the conjugate function $f^{\star}(-\lambda)$ is minimised.

$\endgroup$
0
6
$\begingroup$

A lot of great explanations.

The easiest way to get in-depth knowledge, I think, however, is just to study chapter 5 of this book, written by Stanford professor Stephen Boyd.

You will realize that the dual problem is nothing more than replacing hard constraints (of the form $f_i(x) \le 0$ and $h_i(x) = 0$) by soft constraints, which are linear functions of $x$, such as $\lambda_i (x) = \lambda_i x$ and $\nu_i (x) = \nu_i x$.

In the book, he uses the variable $u$, and he formalizes these hard constraints with indicator functions $$I_-{}(u) ,$$ defined to be zero for $u\le0$ and $\infty$ for $u>0$, and $$I_0{}(u) ,$$ defined to be $\infty$ everywhere except at $u=0$.

Note that the linear functions $\lambda_i u$ and $\nu_i u$ are always $\le$ the respective indicator functions. Therefore:

Clearly the approximation of the indicator function $I_-{} (u)$ with a linear function $λ_iu$ is rather poor. But the linear function is at least an underestimator of the that the dual function yields a lower bound on the optimal value of the original indicator function. Since $λ_i u \le I_-{}(u)$ and $\nu_i u \le I_0 (u)$ for all u, we see immediately that the dual function yields a lower bound on the optimal value of the original problem.

This means that if $\lambda_i > 0$, for every acceptable value $\tilde{x}$ (i.e. a value that satisfies the constraints $f_i(\tilde{x}) \le 0$ and $h_i(\tilde{x}) = 0$), the resulting values of $$L(\tilde{x},\nu,\lambda) = f_0(\tilde{x}) + \sum_{i} \lambda_i f_i(\tilde{x}) + \sum_{i} \nu_i h_i(\tilde{x}) $$ are always $$L(\tilde{x},\nu,\lambda)\le f_0(\tilde{x}) ,$$ where the minimization of $f_0(x)$ was the original optimization problem.


example: linear Programming

For linear programming (LP), the function to be minimized is $$f_0(x) = c^T x$$ and the constraints are of the form $$\left\{\begin{aligned}A x &= b \\ x &\ge 0\end{aligned}\right. ,$$

which leads to the Lagrangian $$L(x,\lambda,\nu) = c^T x + \nu^T (Ax-b) - \lambda^T x .$$ (Note the minus sign in front of $\lambda$: this is because we chose the conditions of type $f_i(x) \le 0$ the other way around.)

The terms in function of of $x$ are all linear, so minimization in $x$ means setting them to 0: $$c^T + \nu^T A - \lambda^T = 0 ,$$ so that the dual function $g(\lambda, \nu)$ becomes $$g(\lambda, \nu) = \inf_{x} L(x,\lambda,\nu) = -\nu^T b,$$

This is the dual problem: You have to maximize $g(\lambda, \nu)$, the lower bound for the original minimization problem, subject to the last equation above, that can be transposed to $$c + A^T \nu - \lambda = 0, $$

Finally, note that in order for this to be a true lower bound, we need $\lambda \ge 0$, as discussed previously: You need to take the correct half-plane for the soft constraints to be always softer than the hard constraints.

Combining this with the equation above then leads to the equations you will see in any textbook on LP:

Maximize in $\nu$ $$-b^T \nu$$ subject to $$c+A^T\nu > 0. $$

$\endgroup$
5
$\begingroup$

There's a lot of great answers, but they seem to require some understanding of the problem already - so below I write a very quick and basic deduction of how one might about duality, something that indicates why there could be some sort of duality and convinces you that it is true. I wish I had seen something like this as the very first thing before trying to understand the topic, perhaps as some sort of introduction to the problem - but with the huge amount of great in depth answers and texts online, I didn't stumble upon such an explanation.


The summary is this - let's start with the problem of minimizing a function with an equality constrain. Assuming $f$ is convex, the problem of

$\begin{align*} \text{minimize} &\quad f(x) \\ \text{such that} & \quad g(x) =0 \end{align*}$

is equivalent to

$\text{min}_x \, \text{max}_\lambda \,\,f(x) + \lambda g(x)$

and this is equivalent to

$ \text{max}_\lambda \, \text{min}_x \,\,f(x) + \lambda g(x) = \text{max}_\lambda\, h(\lambda)$.


In more depth :

How might one come to think about this whole topic and approach? First, we try to convert the problem of minimizing a function with an equality constraint to simply a problem of minimizing a function. This is a natural goal that one might try to achieve, as it is conceptually easier to think of the problem as minimization without restrictions, even if it might be across more dimensions etc.

Using Lagrange multipliers, we know that any local extreme of $f$ with constraint $g$ is necessarily a stationary point of the Lagrangian $L(x,\lambda)=f-\lambda g$. This for example means that if $f$ with constraint $g$ achieves a minimum at $x^*$, then there exists a $\lambda^*$ such that $\triangledown L(x^*,\lambda^*)=0$ .

This might bring one to think whether we could maybe do something with the Lagrangian - maybe we could minimize it with respect to $\lambda$ and $x$. But that won't help us find the solution, since clearly if we fix some $x$ where $g(x) \neq 0$, the minimum with respect to $\lambda$ is $-\infty$. But based on this and some pondering, it might occur to us that we could pose the problem as $\text{min}_x \, \text{max}_\lambda \,\,f(x) + \lambda g(x)$, and we check it and see that yes, this does in fact work, since $\text{max}_\lambda \,\,f(x) + \lambda g(x) = \infty$ for $x$ such that $g(x)\neq 0$, and $\text{max}_\lambda \,\,f(x) + \lambda g(x) = f(x)$ for $x$ such that $g(x) = 0$.

The final equality $\text{min}_x \, \text{max}_\lambda \,\,f(x) + \lambda g(x)=\text{max}_\lambda \, \text{min}_x \,\,f(x) + \lambda g(x)$ is "just" an application of the min-max inequality.


This whole approach can be used for an inequality constraint too, with the difference that we will not maximize over $\lambda \in \mathbb{R}$, but only over $\lambda \geq 0$ (or $\lambda \leq 0$, depending on if the function is $f-\lambda g$ or $f + \lambda g$).

Lastly, while $h$ in $\text{max}_\lambda\, h(\lambda)$ could be a "complicated"/hard to calculate function, under certain conditions on $f$ and the constraint $g$ (here we are already assuming that $g$ is an inequality constraint), we might realize that similarly to how we got rid of the constraint $g$, we might go the other way and figure out a way to formulate the problem $ \text{max}_\lambda \, \text{min}_x \,\,f(x) + \lambda g(x) = \text{max}_\lambda\ h(\lambda)$ as

$\begin{align*} \text{maximize} &\quad u(\lambda) \\ \text{such that} & \quad v(\lambda) \leq 0 \end{align*}$

where $u,v$ are reasonably complicated functions.


Obviously the min-max inequality is not trivial or intuitive, but I still feel like it's a valid step, although ingenious. I'm not quire sure how duality was discovered since I'm still a beginner - it seems from the little research I've done that it might be inspired by a similar duality observation between the Lagrangian and Hamiltonian approach to classical mechanics, where duality was shown via the Legendre transform, so it definitely seems that showing duality in optimization via the Legendre transform is the classical way to show it and probably how it was discovered first too. But at this point I'm talking about things I definitely do not understand. Still I like the simplicity of the approach I wrote in this post. (And maybe it could be used to show the duality between Lagrangian and Hamiltonian mechanics too?).


I've done most of the observations in this post playing around with the topic by myself, so maybe it's all wrong - please let me know if it is and make sure to downvote.

$\endgroup$
4
$\begingroup$

Here are some counter-examples to help you understand KKT conditions and strong duality. The answer is from my other post: https://math.stackexchange.com/a/4154563/273731

${\bf counter-example 1}$ If one drops the convexity condition on objective function, then strong duality could fails even with relative interior condition.

The counter-example is the same as the following one.

${\bf counter-example 2}$ For non-convex problem where strong duality does not hold, primal-dual optimal pairs may not satisfy KKT condition.

Consider the optimization problem \begin{align} \operatorname{minimize} & \quad e^{-x_1x_2} \\ \text{subject to} & \quad x_1\le 0. \end{align} The domain for the problem is $D = \{ (x_1,x_2) \ge 0 \}$. The problem is not convex by calculating the Hessian matrix. Clearly, any $x_1 = 0, x_2 \in\mathbb R_+$ is a primal optimal solution with primal optimal value $1$ .

The Lagrangian is $$ L(x_1,x_2,\lambda) = e^{-x_1x_2} + \lambda x_1. $$ The dual function is \begin{align} G(\lambda) &= \inf L(x_1,x_2,\lambda) = \begin{cases} 0& \lambda\ge 0\\ -\infty& \lambda < 0 \end{cases} \end{align}

Thus, $\lambda \geq 0$ is dual optimal solution with dual optimal value $0$, so dual gap is $1$, strong duality fails. As for the KKT conditions, remember the domain is $D = \{ (x_1,x_2) \ge 0 \}$

\begin{align*} &\lambda-x_2e^{-x_1x_2}=0\\ &x_1\le 0\\ &\lambda\ge 0\\ &\lambda x_1=0\\ \end{align*}

Pick any primal-dual pair satisfying $x_1 = 0, x_2\ge 0, \lambda\ge0, \lambda\ne x_2$, the KKT conditions fail.

${\bf counter-example 3}$ For a non-convex problem, even strong duality holds, solutions for KKT conditions may not be primal-dual optimal solution.

Consider the optimization problem on $\mathbb R$ \begin{align} \operatorname{minimize} & \quad x^3 \\ \text{subject to} & -x^3-1\le 0. \end{align} The objective function is not convex by calculating the Hessian matrix. Clearly, $x=-1$ is the unique primal optimal solution with primal optimal value $-1$.

The Lagrangian is $$ L(x,\lambda) = x^3 - \lambda (x^3+1). $$

The dual function is \begin{align} G(\lambda) &= \inf L(x,\lambda) = \begin{cases} -1& \lambda=1\\ -\infty& otherwise \end{cases} \end{align}

Thus, $\lambda = 1 $ is dual optimal solution with dual optimal value $-1$, so dual gap is $0$, strong duality holds. While the KKT conditions are

\begin{align*} &3x^2(1-\lambda)=0\\ &-x^3-1\le 0\\ &\lambda\ge 0\\ &\lambda (-x^3-1)=0\\ \end{align*}

Solutions for KKT conditions are $x=-1, \lambda=1$ or $x=0,\lambda=0$. Notice that $x=0,\lambda=0$ satisfies KKT conditions but has nothing to do with primal-dual optimal solutions.

The discussion indicates for non-convex problem, KKT conditions may be neither necessary nor sufficient conditions for primal-dual optimal solutions.

${\bf counter-example4}$ For a convex problem, even strong duality holds, there could be no solution for the KKT condition, thus no solution for Lagrangian multipliers.

Consider the optimization problem on domain $\mathbb R$ \begin{align} \operatorname{minimize} & \quad x \\ \text{subject to} & \quad x^2\le 0. \end{align}

Obviously, the problem is convex with unique primal optimal solution $x=0$ and optimal value $0$; Feasible set is $\{0\}$, therefore Slater's condition fails.

The Lagrangian is $$ L(x,\lambda) = x + \lambda x^2. $$

The dual function is \begin{align} G(\lambda) &= \inf L(x,\lambda) = \begin{cases} -\infty& \lambda\le 0\\ -\frac{1}{4\lambda} &\lambda >0 \end{cases} \end{align}

Thus, dual optimal value is $0$, so dual gap is $0$, strong duality holds. However, there are no solution for dual optimal solution because the optimal value is attained as $\lambda\rightarrow \infty$. As for the KKT conditions

\begin{align*} &1+2\lambda x=0\\ &x^2\le 0\\ &\lambda\ge 0\\ &\lambda x^2=0\\ \end{align*} No solution for KKT conditions.

This is the convex problem where the dual problem has no feasible solution and KKT conditions have no solution but the primal problem is simple to solve.

${\bf counter-example 5}$ For a differentiable convex problem, there could be no solution for the KKT conditions, even if the primal-dual pair exists. In that case, the strong duality fails.

Consider the optimization problem on domain $D:=\{(x,y):y>0\}$ \begin{align} \operatorname{minimize} & \quad e^{-x} \\ \text{subject to} & \quad \frac{x^2}{y}\le 0. \end{align}

The problem can be proved to be convex with primal optimal solution $x=0, y>0$ and optimal value $1$; Feasible set is $\{(0,y): y > 0\}$, therefore Slater's condition fails.

The Lagrangian is $$ L(x,y,\lambda) = e^{-x} + \lambda \frac{x^2}{y}. $$

After some careful calculation, the dual function is \begin{align} G(\lambda) &= \inf L(x,y,\lambda) = \begin{cases} 0& \lambda\ge 0\\ -\infty &\lambda <0 \end{cases} \end{align}

Thus, dual optimal value is $0$, so dual gap is $1$, strong duality fails. We can pick primal-dual pair to be $x=0, y=1, \lambda=2$. As for the KKT conditions

\begin{align*} &-e^{-x}+\frac{2\lambda x}{y}=0\\ &\frac{x^2}{y}\le0\\ &\lambda\ge 0\\ &\lambda \frac{x^2}{y}=0\\ \end{align*} with $y>0$, thus no solution for KKT conditions.

This counter-example warns us that we have to be careful about the strong duality condition even for differentiable convex problems.

$\endgroup$
1
  • $\begingroup$ This is a very helpful collection of examples, thank you. $\endgroup$
    – littleO
    Commented May 29, 2021 at 7:32

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .