I have a question in about a property of support vectors of SVM which is stated in subsection "12.2.1 Computing the Support Vector Classifier" of "The Elements of Statistical Learning". My question is very simple and clear; but below, I try to provide a summary from those pages, so you will have a clear understand about the context:
The optimization problem of finding decision hyperplane can be expressed as:
The Lagrange (primal) function is:
Setting the respective derivatives to zero, we get:
The Karush–Kuhn–Tucker conditions include the constraints:
From (12.10) we see that the solution for $\beta$ has the form
with nonzero coefficients $\hat\alpha_i$ only for those observations $i$ for which the constraints in (12.16) are exactly met (due to (12.14)). These observations are called the support vectors, since $\hat\beta$ is represented in terms of them alone. Among these support points, some will lie on the edge of the margin ($\hat ξ_i = 0$), and hence from (12.15) and (12.12) will be characterized by $0 < \hat\alpha_i < C$.
What I want to know is: How $\hat\alpha_i < C$ is deduced? Really with attention to the $\alpha_i + \mu_i = C$ we can say $\mu_i > 0$ is necessary to be correct to deduce $\hat\alpha_i < C$ and that is not true in all cases; because there is no constraint for preventing both $\hat ξ_i = 0$ and $\mu_i = 0$ at the same time.