0
$\begingroup$

I'm looking at the following proof:

test error

Where:

Note: I'm new to this but I think I understand all the below variables correctly now. Mistakes are possible though.

  • $f$ is an ideal function with perfect results with respect to some sample set
  • $\sum_{h}$ is the hypothesis space
  • $\sum_{x \in \mathcal{X}-X}$ is the sum of the entire sample space except the training data
  • $P(x)$ is the probability distribution over the sample space
  • The binary function in the middle is a selection function for any instance where a hypothesis is selected that is not an ideal function
  • $P(h | X, L_a)$ is the probability distribution of selecting hypothesis $h$ given some function/learned model $L_a)$ used on training data $X$

The author says that given some learning function $X -> \{0,1\}$ it's function space would be $\{0,1\}^{|x|}$. He furthermore says

if $f$ is uniformly distributed, half of the predictions of $f$ on $X$ are not consistent with $h(X)$

I do not understand what is meant by this nor how he arrived at that conclusion. Why would it be the case that if $f$ is uniformly distributed that half of its predictions are not consistent with $h(X)$?

After showing this proof he then goes on to say that test error, $E_{ote}$, is independent of the chosen algorithm ($L_a$)

$\endgroup$
14
  • 1
    $\begingroup$ But to answer your question, the idea is that, the set $f$ can be organized in pairs that completely disagree with each other outside the training set. So on average each pair differs with $h$ on 1/2 of the points outside the training set. $\endgroup$
    – Joe
    Commented Apr 26, 2021 at 0:32
  • 1
    $\begingroup$ For example, if $f_1$ predicts true on every point outside the training set, and $f_2$ predicts false on every point outside the training set, and there are $n$ points outside the training set; then if $h$ disagrees with $f_1$ on $k$ points, $h$ must disagree with $f_2$ on $n-k$ points, and on average it disagrees with the pair on $n/2$ points $\endgroup$
    – Joe
    Commented Apr 26, 2021 at 0:38
  • 1
    $\begingroup$ I took this course cs.rpi.edu/~magdon/courses/learn.php $\endgroup$
    – Joe
    Commented Apr 27, 2021 at 11:12
  • 1
    $\begingroup$ The book is $28 here amazon.com/gp/product/1600490069 $\endgroup$
    – Joe
    Commented Apr 27, 2021 at 11:12
  • 1
    $\begingroup$ The lectures are currently on YouTube, which can be found by searching “learning from data Malik”, or currently at this link - cs.rpi.edu/~magdon/courses/ONLINElearn.php $\endgroup$
    – Joe
    Commented Apr 27, 2021 at 11:17

1 Answer 1

1
$\begingroup$

Thanks to @Joe for suggesting the resources to answer this. The below two pictures from "Learning from Data - A Short Course" (which I would suggest) are what made this problem clear to me.

Deriving the 1/2

In the picture $g$ is the selected hypothesis. Let's say that $g$ matches $f_1$. Now we calculate how much disagreement there is with $g$ and all the other $f$s.

  • $f_1$ = 0 (since this is the one we picked)
  • $f_2$ = 1
  • $f_3$ = 1
  • $f_4$ = 2
  • $f_5$ = 1
  • $f_6$ = 2
  • $f_7$ = 2
  • $f_8$ = 3

The average disagreement then is the sum of $(0+1+1+2+1+2+2+3)/8=1.5$. There are 3 points outside of the training set so in this case our hypothesis disagrees with exactly $1.5/3=1/2$ of the other possible hypothesis. Now to my original question, I asked why does it matter that $f$ is uniformly distributed. Notice that the only reason the above works is that every $f$ completely disagrees with every other $f$ outside the training data. If there were agreement between some of the $f$s the math would no longer hold up.

Deriving the ^2

This is really less of a derivation and more just requires some logical thought about the nature of binary. Look at the chart of the various $f$s mapped. Notice again, assuming that all the $f$s completely disagree with each other, there are exactly $2^n$ which agree with the training data. Think about it logically - if all $f$s must disagree and you only have 3 bits, then there are only 8 possibilities. It does not matter what the training data is, the points outside the training data only have 8 possibilities before they would start to overlap. This scales to any number of points. $n$ could be a billion and this would still be true.

page 1page 2

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .