Known relations between mutual information and covering number?

Ask Question

Asked 6 months ago

Modified 6 months ago

Viewed 140 times

This is a question about statistical learning theory. Consider a hypothesis class $\mathcal{F}$, parameterized by real vectors $w \in \mathbb{R}^p$. Suppose I have a data distribution $D \sim \mu$ and a learning algorithm $P_{\mathcal{F}|D}$. We can assume, if it's helpful, that $\mu$ concentrates isoperimetrically/has sub-Gaussian tails (ie. for Lipschitz functions $g$, we have $g(X), E[g(X)]$ are close with high probability under $\mu$). I can combine them to get a distribution over my hypothesis class $P_{\mathcal{F}}(f) = E_{\mu}[P_{\mathcal{F}|D}(f)]$. Each hypothesis $f_w$ is parameterized by $w \in \mathbb{R}^p$. Then I can define the set $$\mathcal{F}_m = \{ f \in \mathcal{F} : P_{\mathcal{F}}(f) \ge 1 - \delta \}$$ for some small $\delta$, or equivalently because of our parameterization I can define $$\mathcal{W}_m = \{ w \in \mathbb{R}^p : P_{\mathcal{F}}(f_w) \ge 1 - \delta \}$$ This is the set of hypothesis that are probable under the distribution $P_{\mathcal{F}}$. I want to know how I can bound the covering number of this set can be related to the mutual information between the distributions $\mu$ and $P_{\mathcal{F}|D}$ where $D \sim \mu$. Are there any such known relations between covering number (of $W_m$ and hence $\mathcal{F}_m$) and mutual information $D_{KL}(\mu; P_{\mathcal{F})$, or if not, what would be a good place to get started to derive such a relation?

I have two ideas so far:

Somehow applying Fano's inequality, treating the learning algorithm as a noisy channel with variance (roughly) purely due to sampling $D \sim \mu$, since that's the only bound I know relating size of a set and mutual information.
Using concentration of $\mu$ to argue that everything in $\mathcal{F}_m$ is ``close" to $E_{F_P}[f]$, together with the known bound that $E_{P_F}[f] < I(D; \mathcal{F})$. The notion of closeness comes from the Euclidean metric on $\mathbb{R}^p$ that parameterized our functions. We can assume anything and everything is Lipschitz if this makes life any easier.

I'd even be happy to hear about relations between mutual information and other measures of complexity/dimension of the above set of functions. Cross-posted from MSE because I wasn't sure whether there is much known about this.

edited Jan 1 at 5:40

asked Dec 31, 2023 at 7:45

Tanishq Kumar

1413 bronze badges

$\begingroup$ The corresponding post on Mathematics: Known relations between mutual information and covering number? $\endgroup$
– Martin Sleziak
Commented Dec 31, 2023 at 17:47

Add a comment |

Stack Exchange Network

Known relations between mutual information and covering number?

0

Browse other questions tagged
pr.probability
gn.general-topology
st.statistics
machine-learning
learning-theory
or ask your own question.

Known relations between mutual information and covering number?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged pr.probabilitygn.general-topologyst.statisticsmachine-learninglearning-theory or ask your own question.

Related

Browse other questions tagged
pr.probability
gn.general-topology
st.statistics
machine-learning
learning-theory
or ask your own question.