I am studying the feasibility of learning from the book Learning from Data. The author uses a bin analogy to discuss the feasibility of learning in a probabilistic sense. I have certain questions to ask.
First, I will try to summarize what the author is trying to do:
Consider a bin containing red and green marbles, possibly infinitely many. The proportion of the red and green marbles is such that if we pick a marble at random, the probability that it will be red is $\mu$. We assume that $\mu$ is unknown to us. We pick a random sample of $N$ independent marbles (with replacement). Let $X_i$ be the indicator for the $i$th marble in the sample to be red. i.e., $X_i=1$ if $i$th marble in the sample is red, $X_i=0$ otherwise. Define $\nu:=\frac{1}{n}\sum_{i=1}^N X_i$. By Hodeffing's inequality, $$\mathbb{P}(|\nu-\mu|>\varepsilon)\le 2\mathrm{e}^{-2\varepsilon ^2N}$$ In a learning problem, there is an unknown target function $f:\mathcal{X}\to\mathcal{Y}$ to be learned. The learning algorithm picks a hypothesis $g:\mathcal{X}\to\mathcal{Y}$ from a hypothesis set $\mathcal{H}$. We can connect the bin problem to the learning problem as follows.
Take a single hypotheis $h\in\mathcal{H}$ and compare it to $f$ on each point $\mathbf{x}\in\mathcal{X}$. If $h(\mathbf{x})=f(\mathbf{x})$, color the point $\mathbf{x}$ green, else color it red. The color each point gets is unknown to us, since $f$ is unknown.. However, if we pick $\mathbf{x}$ at random according to some probability distribution $P$ over $\mathcal{X}$, we know that $\mathbf{x}$ wil be red with some probability $\mu$. Regardless of the value of $\mu$, the space $\mathcal{X}$ bow behaves like a bin. The training examples play the role of a sample from a bin. If the inputs $\mathbf{x_1},\cdots,\mathbf{x_N}$ in the data set $\mathcal{D}$ are picked independently according to $P$, we will get a random sample of red and green points. Each point will be red with probability $\mu$.The color of points$\mathbf{x_1},\cdots,\mathbf{x_N}$ will be known to us since we know $f$ on the data set $\mathcal{D}$. We define $E_{in}(h)$ to be the fraction of $D$ where $f$ and $h$ disagree, also called in-sample error, and similarly out-of-sample error $E_{out}$. Thus $$E_{in}(h)=\dfrac{1}{N}\sum_{i=1}^N[[h(\mathbf{x_i})\neq f(\mathbf{x_i})]], E_{out}=\mathbb{P}(h(\mathbf{x})\neq f(\mathbf{x})) $$
Using the Hoeffding's inequality, we have $$\mathbb{P}(|E_{in}(h)-E_{out}(h)|>\varepsilon)\le 2\mathrm{e}^{-2\varepsilon ^2N}\cdots\cdots(1)$$.
Let us consider an entire hypothesis set $\mathcal{H}$ instead of just one hypothesis $h$, and assume that $\mathcal{H}$ has finite number of hypothesis $\mathcal{H}=\{h_1,\cdots,h_m\}$. We can construct a bin equivalent in this case by having $M$ bins. Each bin still represents the input space $\mathcal{X}$ with the red marbles in the $m$th bin corresponding to the points $\mathbf{x}\in\mathcal{X}$ where $h(\mathbf{x})\neq f(\mathbf{x})$. The probability of red marbles in the $m$th bin is $E_{out}(h_m)$ and the fraction of red marbles in the $m$th sample is $E_{in}(h_m)$ for $m=1,2,\cdots,M$. Hoeffding's inequality applies to each bin seperately.
The Hoeffding's inequality $(1)$ assumes that the hypothesis $h$ is fixed before you generate the data set, and the probability is with respect to random data sets $\mathcal{D}$. The learning algorithm picks a final hypothesis $g$ based on $\mathcal{D}$. i.e., after generating the data set. Thus we cannot plug in $g$ for $h$ in the Hoeffding's inequality. A way to get around this is to try to bound $\mathbb{P}(|E_{in}(h)-E_{out}(h)|>\varepsilon)$ in a way that does not depend on which $g$ the learning algorithm picks.
$$\left(|E_{in}(g)-E_{out}(g)|>\varepsilon\right)\subseteq\left(\bigcup_{i=1}^M |E_{in}(h_i)-E_{out}(h_)|>\varepsilon\right)$$ and hence, $$\mathbb{P}\left(|E_{in}(g)-E_{out}(g)|>\varepsilon\right)\le\sum_{i=1}^M\mathbb{P}\left(|E_{in}(h_i)-E_{out}(h_i)|>\varepsilon\right)$$.
Applying Hoeffding's inequality to $M$ terms one at a time, we get $$\mathbb{P}\left(|E_{in}(g)-E_{out}(g)|>\varepsilon\right)\le 2M\mathrm{e}^{-2\varepsilon ^2N} $$
I have the following questions.
- Why does Hoeffding's inequality requires that $h$ is fixed before generating the data set $\mathcal{D}$ ? Is it because $X_i$s, the indicator of $i$th point in $\mathcal{D}$ being red, are no longer independent after generating the data set, which is an assumption required for Hoeffding's inequality? For instance, after generating $\mathcal{D}$, knowing $X_i$ for some $1\le i\le N$ would give information for other $X_j$. Is this correct ?
- In the last step,for each term in the summation $\sum_{i=1}^M\mathbb{P}\left(|E_{in}(h_i)-E_{out}(h_i)|>\varepsilon\right)$, author states that Hoeffding's inequality applies to each bin separately and $\mathbb{P}\left(|E_{in}(h_i)-E_{out}(h_i)|>\varepsilon\right)\le 2\mathrm{e}^{-2\varepsilon ^2N}$ for all $1\le i\le M$. How is this possible? According to the previous question, the hypothesis $h$ must be fixed before generating the data set, but in this case we have a single data set $\mathcal{D}$ and we are later considering the hypothesis $h_1,h_2,\cdots,h_m$. How is the application of Hoeffding's inequality to each term in summation justified since the data set is generated before hand i.e., before choosing a hypothesis?
If we could apply Hoeffding's to each term in the summation separately, why don't we say that $g$ is one of the hypothesis $h_1,h_2,\cdots,h_m$ and hence $\mathbb{P}\left(|E_{in}(g)-E_{out}(g)|>\varepsilon\right)\le 2\mathrm{e}^{-2\varepsilon ^2N}$?
Is the probability distribution $P$ on $\mathcal{X}$ independent of the hypothesis $h$ i.e., is $P$ chosen without bothering about the color of points in $\mathcal{X}$, which in turn is dictated by $h$?