Generalization in Neural Networks: Can one Impose Conditions on the Data?

There is a well-developed theory on generalization bounds for deep neural networks, using VC dimensions and Rademacher Complexities. They work for any underlying "true" distribution $\...
the meaning of with probability at least 1-\delta

In the theoretical analysis of some algorithm for stochastic optimization, we often need to prove that something like $$error\leq\epsilon,~~~~(under~some~conditions)$$ holds with probability at least $...
What statistics books would you recommend for an undergraduate student who wants to be a machine learning engineer?

I'm an undergraduate software engineering student and I will be taking a statistics course in this semester. I was thinking of buying a statistics textbook such as Probability and Statistics for ...
Probabilistic interpretation linear regression implication step

I am reading Andrew Ng's notes on linear regression, and in this section, he attempts to derive the formula for the least squares using a probability approach:
Minimizing the variance in a variant of bagging Weighted Aggregation(Wagging)

In our machine learning course we have learned Bagging, wherein A variant of bagging call Weighted Aggregation is introduced, where the result is a weighted sum of all the estimators instead of ...
Phased version of Upper-Confidence Bound Algorithm (UCB)

I am interested in an exercise (specifically Exercise 7.4) from Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. It studies a phased version of the popular UCB algorithm. The algorithm takes ...
Matrix devision - Bias Variance Tradeoff

I am currently trying to prove that the ordinary least squares estimate doesn't have a bias with a given dataset with the bias given as Why does this identity hold in the following calculation $$(X^...
Theoretical Machine Learning: How to calculate the expected risk of a model with unknown distribution $\hat{h}$?

If we have fixed, deterministic feature vectors $x_1, x_2, ..., x_n \in \mathbb{R^d}$ with an unknown model parameter $\theta^*$ and the error $z$ with $N(0,\sigma^2)$. For the feature vector $x_i$ ...
Linear Regression: Correlation between predictors and residuals

I am reading Chapter 3 from Elements of Statistical Learning. In the explanation for Forward Stagewise Regression and Least Angle Regression, the authors explain that reducing the correlation between ...
What are some functions $\mathbb{R}^+ \to \mathbb{R}$ other than $\log$?

I am interested in functions $f: \mathbb{R}^+ \to \mathbb{R}$, for the purpose of mapping non-negative statistical features of objects (such as lengths) to the whole real line. Then, I intend to use ...
Why can i move the summation sign down?

Hi, i'm trying to teach my self machine learning by going through the book "An introduction to Statistical Learning", and got stuck on one of the exercise questions. In the attached image ...
The relation between Bregman divergence and KL divergence

I see that Bregman divergence is defined as $d_\phi(x,y)=\phi(x)-\phi(y)-<x-y,\nabla\phi(y)>$, where $x,y\in R^d$ and $\phi$ is a strictly convex function. KL divergence is an instance of ...
When is it true that $\sup g - \inf g \le 2\sup g$?

I am currently reading the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma. In Lemma C.4 of the ...
Choosing a loss function for minimize total sum

I've the following regression problem. I'm forecasting a random variable $X$ for every day of a month, represented as $X_{ij}$ where $i$ is the day of the month and $j$ is the month number. I care if ...
Request for reference: uniform convergence for non 0-1 loss functions

In the book "Understanding machine learning", there is Theorem 6.11 with the following statement Let ${\cal H}$ be a class and let $\tau_{\cal H}$ be its growth function. Then, for every $\...
Data Groupings and Bayesian Analysis for a Generative Model

I have a simple question that I think will have potentially many solutions, depending on the level of complexity with which one wants to approach it. I've built a generative model with two variables ...
How do I compute the derivative of the cross-entropy loss $H(P,Q)$ with respect to the weights $W$?

I'm trying to understand the cross-entropy loss with iris dataset for binary classification where y=1 denotes the plant belongs to Setosa and y=0 denotes the example belongs to Non-Setosa. Consider ...
Optimization: max to softmax for convexity?

Assume we have the following optimization problem: for a family of $m$ vectors $\{x_i\}\in \mathbb{R}^n$, a family of $l$ vectors $\{c_i\}\in \mathbb{R}^n$ with $l\ll m$ and for a family of $l$ ...
Perplexities about Bayesian inference and model averaging (BMA)

reading about Bayesian approach on model selection, I was just wondering about the more mathematical meaning of Bayesian model averaging. Say for example that we are given a dataset $\mathcal{D} = \{\...
How to verify whether a metric is of negative type or not?

A metric $d(\cdot,\cdot)$ of a space $S$ is said to be of negative type, if for $\forall n \geq 2, z_{1}, \ldots, z_{n} \in S$, and $\alpha_{1}, \ldots, \alpha_{n} \in \mathbb{R}$ with $\sum\limits_{i=...
Measurability issues in the symmetrization step in the proof of $\varepsilon$-sample theorem

Let $(\mathcal{X},d)$ be a metric space and $\mu$ be a Borel probability measure on $(\mathcal{X},d)$. Let $m \in \mathbb{N}$ and define the two probability product measures $\mu^{m} := \otimes_{k=1}^{...
Empirical Fisher Information but with unknown true parameters and distribution?

I am not sure if I ask it correctly. I am working on using Fisher Information to examine the information in a model (say neural networks for simplicity). What I know is that the definition of Fisher ...
Deriving the regularization term in bayesian lasso regression

the title is probably not very good, I thought hard about how to phrase this correctly. I'd be grateful if someone tells me it's wrong and how to correct it. I am practicing for my exams and I have ...
How to obtain the parameter update for the multiclass classification (general loss and activation function)?

Consider the feature space $\mathcal{X}=\mathbb R^{d}$ and $\mathcal{Y}=\{1,...,c\}$ such that $c > 2$. We consider some activation function $\alpha: \mathbb R^{c} \to \mathbb R^{c}$ and out weight ...
BFGS Formula from Kullback-Leibler Divergence

On page 411 in this book, the authors give the following BFGS formula $$ \boxed{\boldsymbol C_{\textrm{BFGS}} = \boldsymbol C + \underbrace{\frac{\boldsymbol g^\top\boldsymbol\delta+\boldsymbol g^\top\...
Generative model evaluation metric : Precision & Recall

In this paper, a new metric was proposed to evaluate generative model. The equation (1) decomposes generative distribution and real distribution into two parts w.r.t their intersection of the ...
Bayes classifier: handling conditional expectation / probability

I am learning about the Bayes optimal classifier, and there is a step in a proof I struggle with. One can find this proof also on the Wikipedia page:
Which standard deviation for model averaging?

I hope math stack exchange is the right place for this question, even though it comes from an AI point of view. Say I have a machine learning model and for robustness of results, I initialize it with ...
Definition of Ergodicity in Theodoridis' Machine Learning

This is related, but is not the same as Note that I am not assuming stationarity. Theodoridis, in his Machine ...
Understanding the $\alpha$-regularity assumption for trees

In this paper, definition 4 claims that a tree grown by recursive partitioning is $\alpha$-regular for some $\alpha>0$ if each split leaves at least a fraction $\alpha$ of the available training ...
EM algorithm for maximum of 2 normal distribution

Let $X_i \sim N(\mu_1,\sigma^2), Y_i \sim N(\mu_2,\sigma^2)$ $O_i = \max(X_i,Y_i)$ i need to find $\mu_1, \mu_2$ using EM my attempt: first i defined $Z_i = \left\{\begin{matrix} 1, ~ X_i \ge Y_i \\ 0,...
Optimizing recursive functions with time series data

I have a recursive function, $f(0,a)$ is known, $f(t+1;a,b)=f(t;a,b)+g(t,b)$ where $a,b$ are constants and $g$ is a function. I also have a sequence of data, $D(t)$. I am trying to optimize $f$ with ...
Derivation of the bias-variance tradeoff

I'm having trouble understanding the derivation of the bias-variance tradeoff which is also given in the article on the mean squared error. Let some data be represented by the random variable $X$ with ...
How to prove total test error is independent of the selected learning algorithm

I'm looking at the following proof: Where: Note: I'm new to this but I think I understand all the below variables correctly now. Mistakes are possible though. $f$ is an ideal function with perfect ...
Is a Chi-Squared goodness of fit test appropriate for Neural Network regression?

So I always have wished that regression of neural networks gave more interpretable results and I'm pretty hopeful that chi-squared tests anchor these MSE values in the same way that accuracy anchors ...
Support Vector Machine Optimization Problem,

The formulation of the SVM optimization problem is: \begin{equation} \begin{aligned} & max_{w,b} \frac{1}{||w||} \\ & \text{ subject to } \\ & y_i(w^{T}x_i+b) \geq 1 \end{aligned} \end{...
Distance weighted uniform sampling - sampling procedure

In the excellent paper Wu, Chao-Yuan; Manmatha, R.; Smola, Alexander J.; Krähenbühl, Philipp (2017): Sampling Matters in Deep Embedding Learning. Available online at ...
Conditional Bias Variance Decomposition

The standard bias variance decomposition says that: $$ E |f(X) - Y|^2 = \int_{\mathbb{R}^d} |f(x) - m(x)|^2 \mu(dx) + E|m(X) - Y|^2, $$ where $\mu$ is some distribution over $X$. I am trying to ...
reconstructed error for PCA-analysis not equal to zero?

I am working on an assignment for school where they ask us to perform PCA-analysis on a data set consisting of 500 data points where each data point is of dimension $p=256$. You usually project your ...
Find solution for optimal regression coefficients

Consider the cost function $E(\mathbf{w}) = \displaystyle\frac{1}{2} \sum_{i=1}^{N}{(\mathbf{d}_i - \mathbf{x}_i^T\mathbf{w})^2} + \frac{\lambda}{2} \left\lVert \mathbf{w}\right\rVert^2$ where $(\...
Two-way ANOVA for machine learning model analysis.

I have different machine learning models and losses, each trained using 5-fold cross-validation. Would it make sense to run a two-way ANOVA two evaluate which are statistically performing better? ...
Does the law of large numbers hold for covering numbers?

I am self-studying empirical process theory. I have encountered the covering number $N(\delta,\mathcal{G},P)$, as well as the empirical version $N(\delta,\mathcal{G},P_n)$. It seems intuitive to ...
Proof of unbiased estimator

Assume: $$ \phi = \int f(x)p(x)dx = E_p(f)$$ Let $x_s \sim p, s=1,.....,S$ iid $(p(x_s=x)=p(x)$ and $p(x_1,x_2) = p(x_1) p(x_2)$. \begin{align} \hat{\phi} &= \frac{1}{S}\sum_{s=1}^{S}f(x_s) \\ E[\...
Linear Regression Prediction Errors

Suppose that we perform linear regression on data $\mathbf{X}$ (an $N \times {(D+1)}$ matrix) and predictions $\mathbf{y}$ (an $N \times 1$ vector). Let $\mathbf{w}$ ($(D+1) \times 1$ vector) be the ...
Is logistic regression cost function in SciKit Learn different from standard derivations?

I am trying to understand the math behind logistic regression. Going through a couple of websites, lectures and books, I tried to derive the cost function by thinking of it as the negative of the ...
In Fisher’s discriminant for multiple classes, How do you manage when $(Sw)$ is singular matrix (so you cant get $(Sw)^{-1}$)?

I am trying to use Fisher’s discriminant for multiple classes to reduce the Dimension of the MNIST data set, similar to this post:
Which is the algorithm for knn density estimator?

I am reading Pattern Recognition and Machine Learning by Christopher Bishop. In chapter two he talk about using knn to density estimation. I want to replicate a plot using python/R/matlab. He is ...
Need a probabilistic approach to determine if a data-set A includes all the elements of data-set B

My job is to identify if the two given datasets are same. This is to be done on computers using some programming language (C++). Since the data could be huge, I don't want to read all the elements of ...
Covariance of a Vector-Valued Random Variable

I'm reading through Andrew Ng Lecture Notes for CS229 and he makes the statement that, for a random variable $Z \in \mathbb{R}^{n}$, \begin{align} Cov(Z) &= E[(Z - E[Z])(Z - E[Z])^{T}]\\ &...
Intuition behind the exponential loss function

I'm reading about AdaBoost from the book The Elements of Statistical Learning. The book mentions that, to train the model, the exponential loss function is used: $$L(y, f (x)) = e^{−y f (x)},$$ where $...
