4
$\begingroup$

This is a question about statistical learning theory. Consider a hypothesis class $\mathcal{F}$, parameterized by real vectors $w \in \mathbb{R}^p$. Suppose I have a data distribution $D \sim \mu$ and a learning algorithm $P_{\mathcal{F}|D}$. We can assume, if it's helpful, that $\mu$ concentrates isoperimetrically/has sub-Gaussian tails (ie. for Lipschitz functions $g$, we have $g(X), E[g(X)]$ are close with high probability under $\mu$). I can combine them to get a distribution over my hypothesis class $P_{\mathcal{F}}(f) = E_{\mu}[P_{\mathcal{F}|D}(f)]$. Each hypothesis $f_w$ is parameterized by $w \in \mathbb{R}^p$. Then I can define the set $$\mathcal{F}_m = \{ f \in \mathcal{F} : P_{\mathcal{F}}(f) \ge 1 - \delta \}$$ for some small $\delta$, or equivalently because of our parameterization I can define $$\mathcal{W}_m = \{ w \in \mathbb{R}^p : P_{\mathcal{F}}(f_w) \ge 1 - \delta \}$$ This is the set of hypothesis that are probable under the distribution $P_{\mathcal{F}}$. I want to know how I can bound the covering number of this set can be related to the mutual information between the distributions $\mu$ and $P_{\mathcal{F}|D}$ where $D \sim \mu$. Are there any such known relations between covering number (of $W_m$ and hence $\mathcal{F}_m$) and mutual information $D_{KL}(\mu; P_{\mathcal{F})$, or if not, what would be a good place to get started to derive such a relation?

I have two ideas so far:

  • Somehow applying Fano's inequality, treating the learning algorithm as a noisy channel with variance (roughly) purely due to sampling $D \sim \mu$, since that's the only bound I know relating size of a set and mutual information.
  • Using concentration of $\mu$ to argue that everything in $\mathcal{F}_m$ is ``close" to $E_{F_P}[f]$, together with the known bound that $E_{P_F}[f] < I(D; \mathcal{F})$. The notion of closeness comes from the Euclidean metric on $\mathbb{R}^p$ that parameterized our functions. We can assume anything and everything is Lipschitz if this makes life any easier.

I'd even be happy to hear about relations between mutual information and other measures of complexity/dimension of the above set of functions. Cross-posted from MSE because I wasn't sure whether there is much known about this.

$\endgroup$
1

0