Show bagging helps under squared-error loss

This question is about chapter 8.7 Bagging from Element of Statistical Learning (ESL) textbook. Assume our training observations $\left(x_i, y_i\right), i=1, \ldots, N$ are independently drawn from a ...
When is my lightgbm going to find cut points in random variables that reduce entropy more than a naturally correlated variable with the target? [closed]

In machine learning sometimes we build models using hundreds of variables/features that we don't know (at least at first) if they might have a relation with the target. Usually we find that some of ...
Empirical distribution learns w.r.t total variation distance

I am trying to prove or disprove that the empirical distribution can learn any continuous distribution w.r.t the total variation distance. The context is the one of statistical learning. I am quite ...
Optimizing function defined by integral

Let the two functions $q: \mathbb{R}^d \rightarrow\mathbb{R}^{+}$ and $s: \mathbb{R^d} \times \mathbb{R^d} \rightarrow \mathbb{R}^{+}$, $d \in \mathbb{N,}$ where both are assumed to be continuous and ...
VC dimension of indicator functions is equal to pseudo dimension

I am reading the "Foundation of machine learning" by Mehryar Mohri ( In the proof of Theorem 11.8, it said the following statement, which I can not ...
Expectation of linear form multiplied by quadratic form for MVN distribution

Assume that $\bf{x}$ is a random vector that is distributed multivariate normal with mean $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$. Let $\bf{A}$ be a matrix of constants. I'm ...
How do I obtain the primal and dual for the regression estimator $\min _\beta[\|\beta\|^2+\sum_{i=1}^n \xi_i^2]$ s.t. $\xi_i=y_i-h(x_i)^\top \beta$?

I am working on a statistical learning exercise that requires some knowledge of convex optimization which I am unfortunately lacking. Consider the linear regression model $$y_i=h(x_i)^\top\beta+\...
$\min$-entropy for the uniform distribution on $[𝑛]$

The min-entropy of a distribution $\nu$ on $[n]$ is given as: $$H_{\infty}(\nu)=\min_{i} \log(\frac{1}{\nu(i)})$$ Now we will prove that that for every distribution $\nu$ on $[n]$ and for $U$ being ...
Reason behind objective function in Linear Discriminant Analysis

I don't really understand the objective function to be optimized in Linear Discriminant Analysis (LDA). My question is centered around the same concepts mentioned in this this other one. The analysis ...
Define 'accuracy' for numerical data?

Normally, people use 'accuracy' to describe the output quality (from a model or methodology for categorical data. However, I am wondering could the ...
Error propagation and Gradient Descent

I was looking at error propagation (or propagation of uncertainty in wikipedia: My primary concern is getting an estimate of error of ...
Unknown normalization for probability distribution in EM algorithm

I am exploring utilizing Expectation Maximization (EM) algorithm in machine learning where the exact distribution of the data is unknown as all the observed sample pairs do not form the complete data. ...
Pseudo-determinant of rank deficient matrix times a constant.

I have a question regarding the pseudo-determinant of a rank deficient matrix times a constant. Lets say matrix $K$ has dimension $n \times n$, however $\text{rank}(K)<n$. Does the following rule ...
Verify whether it's a Bregman loss function, maybe by solving a differential equation

I have a function $f(x, y;\mu) = \frac{\mu}{x}(x-y)^2$, where $\mu > 0$ is a parameter. I want to see whether it's a Bregman loss function. A Bregman loss function is define as: $D_\phi(x,y) = \phi(...
Empirical Rademacher complexity bound

Consider the hypothesis class $$\mathcal{H} = \{\mathcal{X}\ni x\to r^2-\|\Phi(x)-\mathbf{c}\|^2:\|\mathbf{c}\|\leq \Lambda, 0<r\leq R\}$$ where $\Phi: \mathcal{X} \to \mathbb{H}$ is a feature map ...
Rademacher complexity of Binary classification

I am trying to show the inequality below, please note that in this case I am considering the labels to be $Y_i \in \{0, 1\}$, I state this since I have seen results but for labels that are in $\{0,1\}$...
Fitting a non-linear curve in symmetric positive definite matrix manifold

I have a variable $x \in \mathbb{R}$. For some values of $x$, $\{x_1, ..., x_n\}$, I have measured a covariance matrix of the variable $y \in \mathbb{R}^n$, conditional on these values of the variable ...
Bound on inverse covariance from covariance in regularized covariance estimation problem

In this paper by Bickel and Levina, I am confused about result (A15) which claims that since $$ (A14) \qquad \| \text{Var}(\mathbf{X}) - \widehat{\text{Var}}(\mathbf{X})\|_{\max} = O_P(n^{-1/2} \log^{...
What's the significance of Mean Squared Error? Why not something else?

Background: Masters in CS/Math. I'm brushing up on statistics I see Mean Squared Error(MSE) everywhere. As a student I took it for granted, but now when I tried to find the reasons for why it's so ...
Bound on the difference between the true risk of an empirical risk minimizer $\hat{h}$ and an oracle $\bar{h}$.

In Lecture 3 on Concentration Inequalities in Philippe Rigollet's Mathematics of Machine Learning course on MIT OCW, there is the following theorem, which is the Theorem on the first page of this ...
Are these two inequalities are equivalent?

Let's assume that $I_j \in \mathcal{J}$, where $\mathcal{J}$ is a set of images that are correctly classified and $p(I)$ is the output probability distribution of the used underlying model. Out of $\...
Why is the KL Divergence between contracted $Bin(N,\theta)$ and $Bernoulli(\lambda)$ convex function in $\lambda$?

I need to prove that the the KL Divergence $D(\bar{\mu}(\theta)||Y(\lambda))$ between the following variables is convex w.r.t $\lambda$. The variables are defined as: $ \bar{\mu}(\theta) =\frac{1}{N}\...
$UCB-\alpha$ policy for multi-armed bandit - conditions on UCB indices for picking suboptimal arm

While reading the optimality proof for the $UCB-\alpha$ policy for the multi-armed bandit problem , I came across a claim which I couldn't understand the logic of. Notations: $I_{i}(t) = \hat{\mu}_{i}(...
Relationship between eigenspectrum of gram matrix and kernel operator

A kernel operator is a function $k: \mathbb{R}^2 \to \mathbb{R}$, for instance $k(x, x') = \exp(|x-x'|/2)$ or $k(x, x') = x \cdot x'$. There are many kernels common in statistics and machine learning, ...
Application of Hoeffding's inequality to the Stochastic Multi-Armed Bandit Problem

I'm following this note to learn about deriving an upper bound of the UCB algorithm on the Stochastic Multi-Armed Bandit Problem. In particular, the proof of Lemma 15.6 there connotes that we can ...
(Reference Request, I guess?) No-Free-Lunch Theorem for Unsupervised Learning

I am familiar with David H. Wolpert's No-Free-Lunch Theorem for Supervised Learning. Now I am wondering: Is there some sort of such a theorem for unsupervised learning? E.g. for Clustering? And what ...
Role of variance in consistent estimators

By definition, a consistent estimator or rather a weak consistent estimator is one that causes data points to converges to their true value as the number of data points increases. So naturally, bias ...
Taylor Series Approximation of the Variance Function

The problem setting is offline learning from bandit feedback data. Given a context vector $x$, a policy chooses an action $a$, with policy defined as $h_w(y \vert x)$, where $w$ is the 'learnable' ...
Understanding the gradient contribution of each point in a linear regression line

Given a data set $D={\{(-1,0),(1,-2),(2,-1),(3,1)}\}$ which consists of $(x, y)$ pairs. Consider a linear regression model of the form $y = {\theta}^Tx + {\theta}_0$ where $\theta = 0.5 $ and $\...
Statistical test for comparing number of clusters in data

I am performing $K$-means clustering on a dataset consisting of $n$ observations and $d$ variables, and I'm trying to determine the optimal number of clusters. Is there a test that can determine the ...
Why errors are random in linear regression

In data science when we study linear regression from a mathematical point of view we often have the following hypothesis: We have points $y_i = x_i\beta + \epsilon_i$ with $\epsilon_i$ being a random ...
Can SVM be special case of PCA?

Let $X$ and $Y$ two linearly separable finite subsets of a $K$-dimensional real vector space $V$ with orthonormal basis $A = \{a_1,\ldots, a_K\}$. The covariance matrix $\Sigma_A$ of the set $X \cup Y$...
Convergence of the Expectation-Maximization algorithm

Studying the Expectation-Maximization algorithm, I noticed that I couldn't find any proof that the parameters actually converge, nor that the limit is a local extremum of the likelihood (or even just ...
What does it mean for a function to be differentiable/continuous when the input is a function?

I have the loss function $$L(h) = \sum_{i=1}^{n}(h(x_i) - y_i)^2$$ $h$ is a function that spits out the predicted value when fed in a vector $x_i$. The domain then for $h$ is $\mathbb R^d$ and the ...
error bounds for semi-online Learning problem

I want to solve the following problem: Consider the noise-free classification setup. Let $\mathcal{F}$ denote an infinite class of (binary) classifiers with finite VC dimension $d$. Let $f* ∈ \...
Difference of sampling and calculating probability distribution function.

I have basic question with sampling from a probability distribution. For instance in Importance Sampling, it's hard to sample from $p$ directly, so we sample from the proposal distribution $q$ then ...
Example of tightness of Sauer-Shelah lemma

I was given the task of finding an example of a family of events (or hypothesis, concepts, etc.) such that the Sauer-Shelah lemma is tight. The lemma states: Assume that the Vapnik-Chervonenkis ...
Is the minus absolute value of the difference a kernel?

Precisely, is $k(x_i- x_j) = -\|x_i-x_j\| \quad x_i, x_j \in \mathbb R$ a valid kernel? I know that the absolute value of kernel formulation is not a valid kernel since it is not positive semi-...
Simple statement in the elementary proof of the Johnson-Lindenstrauss lemma (random projections)

In the simple proof of the johnson lindenstrauss lemma written by Sanjoy Dasgupta, Anupam Gupta that can be found here they state the following (p.$62$): Repeating this projection $O(n)$ times can ...
How is it possible to have a maximum of a quantity, but then no argument that attains this maximum?

I am having difficulty with intuitive understanding of a definition in an exposition of the proof of the Vapnik Chervonenkis inequality in the notes of Robert Nowak (2009). The proof strategy is taken ...
Intuition for Local vs. Global notions of Metric Entropy in Statistics

I am looking for intuition regarding the following statement on page 4 of this paper by Gassiat and Van Handel: However, in finite dimensional settings, global entropy bounds are known to yield sub-...
Conditional Second Moments of Multivariate Normal Variable on Binary Vectors

Suppose we observe a binary table $Y \in \mathbb R^{N \times G}$, corresponding to $N$ observations of $G$ dimensional binary vectors $Y_1, \cdots, Y_n$. We imagine each vector $Y_i$ is generated from ...
Variational Autencoders and the inequality $\mathbb{E}_{z∼q(z|x)}log (p_{model}(x | z)) − D_{KL}(q(z | x)||p_{model}(z)) ≤ log (p_{model}(x))$

I am reading section $20.10.3$ of the book Deep Learning on Variational Autoencoders, where the authors write: To generate a sample from the model, the VAE first draws a sample $z$ from the code ...
What is the meaning of "with probability at least 1-x over the drawing of the m training patterns"?

I'm reading a book about Support vector machine and I encounters this ,so what is the meaning of $\text { with probability at least } 1-\delta \text { over the drawing of the m training patterns}$
Proving this upper bound involving VC dimension

Let $S_n = \{ (x_i, y_i)\}^n_{i=1}$ be a data set. Let $\mathcal H$ and $\mathcal H' $ be hypothesis classes, such that $\mathcal H' = \{ h \in \mathcal H: \hat {er}_{S_n}(h) \leq \beta\}$, where $\...
Deriving the solution for ridge polynomial regression.

We have the following loss function: $$\operatorname{Err}(x)=\frac{1}{n}\sum_{i=1}^n(h_w(x_i)-y_i)^2+ λ\|w\|^2$$ I need to derive the solution for a polynomial of degree $0$ ($h_w(x)=w_0$) and ...
For a Bayes classifier, can we prove that adding noise to data does not increase its accuracy?

In terms of a Bayes classifier, it's intuitive to consider that adding noise to data CANNOT increase the accuracy. Taking a binary classification problem as an example. The data distribution is $(x,y)\...
How to interpret $P(z|x, y; k)$ with examples?

Following this answer and this, I am trying to understand what can be meant by $P(z|x, y; k)$ notation. How to interpret this with examples in terms of machine learning? As I understand, $P(z|x, y)$ ...
How much datapoints must be in a subset of a dataset before the subset is representative of that parent dataset?

It makes sense to me that a randomly sampled subset of a dataset should still be theoretically representative of its parent. When you take data and split into training and test sets, you assume that ...
How to give a high-probability uniform estimation of a potential having access to noisy pointwise estimates of the associated vector field?

As in the title, our goal is to estimate uniformly and with high probability (and up to a constant) a potential having access to noisy pointwise estimates of the associated vector field (i.e., the ...
