Hoeffding's inequality and learning.

Question

I am studying the feasibility of learning from the book Learning from Data. The author uses a bin analogy to discuss the feasibility of learning in a probabilistic sense. I have certain questions to ask.

First, I will try to summarize what the author is trying to do:

Consider a bin containing red and green marbles, possibly infinitely many. The proportion of the red and green marbles is such that if we pick a marble at random, the probability that it will be red is $\mu$. We assume that $\mu$ is unknown to us. We pick a random sample of $N$ independent marbles (with replacement). Let $X_i$ be the indicator for the $i$th marble in the sample to be red. i.e., $X_i=1$ if $i$th marble in the sample is red, $X_i=0$ otherwise. Define $\nu:=\frac{1}{n}\sum_{i=1}^N X_i$. By Hodeffing's inequality, $$\mathbb{P}(|\nu-\mu|>\varepsilon)\le 2\mathrm{e}^{-2\varepsilon ^2N}$$ In a learning problem, there is an unknown target function $f:\mathcal{X}\to\mathcal{Y}$ to be learned. The learning algorithm picks a hypothesis $g:\mathcal{X}\to\mathcal{Y}$ from a hypothesis set $\mathcal{H}$. We can connect the bin problem to the learning problem as follows.

Take a single hypotheis $h\in\mathcal{H}$ and compare it to $f$ on each point $\mathbf{x}\in\mathcal{X}$. If $h(\mathbf{x})=f(\mathbf{x})$, color the point $\mathbf{x}$ green, else color it red. The color each point gets is unknown to us, since $f$ is unknown.. However, if we pick $\mathbf{x}$ at random according to some probability distribution $P$ over $\mathcal{X}$, we know that $\mathbf{x}$ wil be red with some probability $\mu$. Regardless of the value of $\mu$, the space $\mathcal{X}$ bow behaves like a bin. The training examples play the role of a sample from a bin. If the inputs $\mathbf{x_1},\cdots,\mathbf{x_N}$ in the data set $\mathcal{D}$ are picked independently according to $P$, we will get a random sample of red and green points. Each point will be red with probability $\mu$.The color of points$\mathbf{x_1},\cdots,\mathbf{x_N}$ will be known to us since we know $f$ on the data set $\mathcal{D}$. We define $E_{in}(h)$ to be the fraction of $D$ where $f$ and $h$ disagree, also called in-sample error, and similarly out-of-sample error $E_{out}$. Thus $$E_{in}(h)=\dfrac{1}{N}\sum_{i=1}^N[[h(\mathbf{x_i})\neq f(\mathbf{x_i})]], E_{out}=\mathbb{P}(h(\mathbf{x})\neq f(\mathbf{x})) $$

Using the Hoeffding's inequality, we have $$\mathbb{P}(|E_{in}(h)-E_{out}(h)|>\varepsilon)\le 2\mathrm{e}^{-2\varepsilon ^2N}\cdots\cdots(1)$$.

Let us consider an entire hypothesis set $\mathcal{H}$ instead of just one hypothesis $h$, and assume that $\mathcal{H}$ has finite number of hypothesis $\mathcal{H}=\{h_1,\cdots,h_m\}$. We can construct a bin equivalent in this case by having $M$ bins. Each bin still represents the input space $\mathcal{X}$ with the red marbles in the $m$th bin corresponding to the points $\mathbf{x}\in\mathcal{X}$ where $h(\mathbf{x})\neq f(\mathbf{x})$. The probability of red marbles in the $m$th bin is $E_{out}(h_m)$ and the fraction of red marbles in the $m$th sample is $E_{in}(h_m)$ for $m=1,2,\cdots,M$. Hoeffding's inequality applies to each bin seperately.

The Hoeffding's inequality $(1)$ assumes that the hypothesis $h$ is fixed before you generate the data set, and the probability is with respect to random data sets $\mathcal{D}$. The learning algorithm picks a final hypothesis $g$ based on $\mathcal{D}$. i.e., after generating the data set. Thus we cannot plug in $g$ for $h$ in the Hoeffding's inequality. A way to get around this is to try to bound $\mathbb{P}(|E_{in}(h)-E_{out}(h)|>\varepsilon)$ in a way that does not depend on which $g$ the learning algorithm picks.

$$\left(|E_{in}(g)-E_{out}(g)|>\varepsilon\right)\subseteq\left(\bigcup_{i=1}^M |E_{in}(h_i)-E_{out}(h_)|>\varepsilon\right)$$ and hence, $$\mathbb{P}\left(|E_{in}(g)-E_{out}(g)|>\varepsilon\right)\le\sum_{i=1}^M\mathbb{P}\left(|E_{in}(h_i)-E_{out}(h_i)|>\varepsilon\right)$$.

Applying Hoeffding's inequality to $M$ terms one at a time, we get $$\mathbb{P}\left(|E_{in}(g)-E_{out}(g)|>\varepsilon\right)\le 2M\mathrm{e}^{-2\varepsilon ^2N} $$

I have the following questions.

Why does Hoeffding's inequality requires that $h$ is fixed before generating the data set $\mathcal{D}$ ? Is it because $X_i$s, the indicator of $i$th point in $\mathcal{D}$ being red, are no longer independent after generating the data set, which is an assumption required for Hoeffding's inequality? For instance, after generating $\mathcal{D}$, knowing $X_i$ for some $1\le i\le N$ would give information for other $X_j$. Is this correct ?
In the last step,for each term in the summation $\sum_{i=1}^M\mathbb{P}\left(|E_{in}(h_i)-E_{out}(h_i)|>\varepsilon\right)$, author states that Hoeffding's inequality applies to each bin separately and $\mathbb{P}\left(|E_{in}(h_i)-E_{out}(h_i)|>\varepsilon\right)\le 2\mathrm{e}^{-2\varepsilon ^2N}$ for all $1\le i\le M$. How is this possible? According to the previous question, the hypothesis $h$ must be fixed before generating the data set, but in this case we have a single data set $\mathcal{D}$ and we are later considering the hypothesis $h_1,h_2,\cdots,h_m$. How is the application of Hoeffding's inequality to each term in summation justified since the data set is generated before hand i.e., before choosing a hypothesis?
If we could apply Hoeffding's to each term in the summation separately, why don't we say that $g$ is one of the hypothesis $h_1,h_2,\cdots,h_m$ and hence $\mathbb{P}\left(|E_{in}(g)-E_{out}(g)|>\varepsilon\right)\le 2\mathrm{e}^{-2\varepsilon ^2N}$?
Is the probability distribution $P$ on $\mathcal{X}$ independent of the hypothesis $h$ i.e., is $P$ chosen without bothering about the color of points in $\mathcal{X}$, which in turn is dictated by $h$?

I found the description difficult to understand until I added the following assumptions that I think are correct for this situation: We have an unknown function $f:\mathcal{X}\rightarrow\mathbb{R}$, but we are sure it is one of $M$ possible functions $h_1, ..,. h_M$. Every step $i \in \{1, 2, 3, ..., N\}$ we independently choose an argument $X_i \in \mathcal{X}$, uniformly over all (presumably finite) elements of $\mathcal{X}$. The output is the actual function value $f(X_i)$, with no noise. — Michael, Commented Jan 24, 2017 at 3:58

Michael · Accepted Answer · 2019-02-07 18:01:45Z

Intuition needed for this question

Some intuition on the difference between a-priori and a-posteriori: Let $\{Y_1, ..., Y_4\}$ be i.i.d. uniform random variables over $[0,1]$. So for each $m \in \{1, ..., 4\}$ we have $$P[Y_m>15/16]=1/16 = 0.0625$$

A-priori:

Let's a-priori pick an index $K \in \{1, 2, 3, 4\}$, independent of the $Y_i$ variables and before observing these variables. We can use any mass function $P[K=m]$ for $m \in \{1, 2, 3, 4\}$. What is $P[Y_K>15/16]$? It is still $1/16$ (regardless of the mass function we use) because, by the law of total probability: \begin{align} P[Y_K>15/16] &= \sum_{m=1}^4P[Y_K>15/16|K=m]P[K=m]\\ &=\sum_{m=1}^4\underbrace{P[Y_m>15/16|K=m]}_{P[Y_m>15/16]}P[K=m] \\ &=\sum_{m=1}^4 (1/16)P[K=m]\\ &=1/16 \end{align} where the equality $P[Y_m>15/16|K=m]=P[Y_m>15/16]$ holds because $Y_m$ is independent of $K$.

A-posteriori:

Now let's a-posteriori pick the index $K$ associated with the largest $Y_m$ value, so $Y_K = \max[Y_1, Y_2, Y_3, Y_4]$. This choice of index $K$ depends on the observed data. Then: \begin{align} P[Y_K>15/16] &= 1-P[Y_1\leq 15/16, Y_2\leq 15/16, Y_3\leq 15/16,Y_4\leq 15/16]\\ &=1-(15/16)^4 \\ &\approx 0.2275 \end{align} and so this probability is much larger than 1/16, even though $Y_K$ is just one of the $Y_m$ values and we know $P[Y_m>15/16]=1/16$ for all $m \in \{1, ..., 4\}$.

On the other hand, we know by the union bound that $$\{Y_K>15/16\} \subseteq \cup_{m=1}^4 \{Y_m>15/16\} \implies P[Y_K>15/16]\leq \sum_{m=1}^4P[Y_m>15/16]=1/4 $$ and indeed $0.2275 \leq 1/4$.

Your specific setup

As in my above comment, I believe we need the following extra assumptions: We have a finite set $\mathcal{X}$ and an unknown function $f:\mathcal{X}\rightarrow\mathbb{R}$. We are certain that $f$ is one of the $M$ functions in the known set $\{h_1, ..., h_M\}$. We have the ability to exactly evaluate the function $f$ one step at a time. However, the set $\mathcal{X}$ is too large so we want to do a probabilistic estimate. Every time step $i$ we independently chose $X_i \in \mathcal{X}$, uniformly over all points in $\mathcal{X}$. We then observe the value of $f(X_i)$ (with no noise).

So for a given function $h_m$ we define $\phi_m$ by: $$ \phi_m = P[f(X_i) \neq h_m(X_i)] = \frac{\mbox{number of points $x \in \mathcal{X}$ such that $f(x)\neq h_m(x)$}}{|\mathcal{X}|}$$ For each $m \in \{1, ..., M\}$ define the sequence of indicator functions $\{I^{(m)}_i\}_{i=1}^{\infty}$ by $$ I^{(m)}_i = \left\{ \begin{array}{ll} 1 &\mbox{ if $h_m(X_i) \neq f(X_i)$} \\ 0 & \mbox{ otherwise} \end{array} \right.$$ For any fixed $m \in \{1, ..., M\}$ the $\{I^{(m)}_i\}_{i=1}^{\infty}$ indicators are i.i.d. so we can apply Hoeffding. Note that for each individual $m$, Hoeffding is a bound on the following a-priori probability: $$P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)} - \phi_m\right|>\epsilon\right] \leq 2e^{-2\epsilon^2 N} \quad (Eq. 1)$$ and it is nice that the bound on the right-hand-side does not depend on the index $m$.

Your specific questions

1) Your first question asks why Hoeffding requires a fixed function $h$ rather than a random function $H$. It is because Hoeffding applies to i.i.d. random variables. If we fix an index $m \in \{1, .., M\}$ then the indicators $\{I^{(m)}_1, I^{(m)}_2, I^{(m)}_3, ...\}$ are i.i.d. over the indices $i \in \{1, 2, 3, ...\}$. If we have a random index $K$ then the indicators $\{I^{(K)}_1, I^{(K)}_2, I^{(K)}_3, ...\}$ are not necessarily independent because they share a common random index $K$.

2-4) Your remaining questions boil down to the difference between a-priori probability and a-posteriori probability. The Hoeffding bounds in (Eq. 1) are a-priori bounds that hold for all $m \in \{1, ..., M\}$. They are bounds on the probability the data behaves in a certain way. That probability (and its bound) is determined without observing the actual data outcome (in the same way that the probability of a fair coin flip being heads is 1/2, and this probability is determined without looking at the outcome of the flip).

If we a-priori pick a random index $K \in \{1, ..., M\}$ (before observing the data and independent of the data), then we do not need the union bound: \begin{align} P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(K)} - \phi_K\right|>\epsilon\right] &= \sum_{m=1}^M P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(K)} - \phi_K\right|>\epsilon | K=m\right]P[K=m] \\ &= \sum_{m=1}^M P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)} - \phi_m\right|>\epsilon | K=m \right]P[K=m] \\ &\overset{(a)}{=} \sum_{m=1}^M P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)} - \phi_m\right|>\epsilon \right]P[K=m] \\ &\leq \sum_{m=1}^m [2e^{-2\epsilon^2 N}]P[K=m]\\ &= 2e^{-2\epsilon^2 N} \end{align} where equality (a) holds because the random index $K$ is independent of the data $\{I^{(m)}_i\}_{i=1}^{\infty}$. So, if we were to a-priori pick a guess function $g$, independent of the data, by just picking a random index, the bound would indeed be $2e^{-2\epsilon^2 N}$ rather than $2M e^{-2\epsilon^2 N}$.

However, if we observe the results of the data and then a-posteriori pick a random index $K \in \{1, ..., M\}$, the way we choose might lead us to pick an index that exhibits "atypical" a-posteriori sample paths. So the equality (a) in the above chain of equalities does not necessarily hold for such picks. We need to be more careful by using the union bound.

Why can't we take the max. over all $m$ of $P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)} - \phi_m\right|>\epsilon | K=m \right]$ instead of summing over all $m$? — Jeevesh Juneja, Commented Feb 6, 2019 at 16:55
@KavitaJuneja Your comment motivated me to fix a typo ("$m$" to "$K$") in the left-hand-side of the last chain of inequalities. But in your question "why can't we take the max...," who says we cannot? I assume you mean the point just before equation (a). Yet, taking the max does not seem to lead to any better conclusion, and summing over all $m$ seems already simple and direct. That is, going from (a) to the inequality after is already done by replacing each term in the sum by a common upper bound (equivalently, the max of those terms is bounded by the same upper bound). — Michael, Commented Feb 7, 2019 at 18:03
Of course if $K$ is chosen in the set $\{1, ..., M\}$ with knowledge of the data then we cannot use equation (a) and we get a looser bound by the union bound: $$ \mbox{$\left\{\left|\frac{1}{N}\sum_{i=1}^N I_i^{(K)} - \phi_K\right|>\epsilon\right\} \subseteq \cup_{m=1}^M\left\{\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)}-\phi_m\right|>\epsilon\right\}$} $$ and so $$ P\left[\left|\frac{1}{N}\sum_{i=1}^N I_i^{(K)} - \phi_K\right|>\epsilon\right] \leq \sum_{m=1}^MP\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)}-\phi_m\right|>\epsilon\right] \leq 2Me^{-2\epsilon^2 N}$$ — Michael, Commented Feb 7, 2019 at 18:15

Stack Exchange Network

Hoeffding's inequality and learning.

1 Answer 1

Intuition needed for this question

A-priori:

A-posteriori:

Your specific setup

Your specific questions

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
statistics
machine-learning
.

Hot Network Questions

Hoeffding's inequality and learning.

1 Answer 1

Intuition needed for this question

A-priori:

A-posteriori:

Your specific setup

Your specific questions

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged statisticsmachine-learning.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
statistics
machine-learning
.