All Questions
Tagged with statistics machine-learning
659
questions
1
vote
0
answers
25
views
Bayesian classifier and linear regression on dummy variables
EDIT: I am finally not so sure about one thing: they say regression but not linear regression. I may have misunderstood the whole paragraph.
In the book Elements of Statistical Learning, (Hastie-...
3
votes
0
answers
98
views
Normalizing Flow Penalization
I am looking to fit a normalizing flow, specifically a Masked Autoregressive Flow model. However, this model leads to high variance on lower dimensional, less complex data. I am using a neural network ...
1
vote
0
answers
13
views
How is this R squared calculated in context to clusteiring?
I was reading the paper "Consistent Individualized Feature Attribution for Tree Ensembles" by Scott Lundberg et al and cannot understand how the calculation for the $R^2$ works here - see ...
0
votes
2
answers
74
views
How can I perform polynomial regression with this dataset?
I have the training set $\{(0,0), (1,1), (2,1), (1,2)\}$. I want to find the best quadratic polynomial of the form $f(x) = a + bx + cx^2$ that minimizes the sum of squared error between $y$ and the ...
1
vote
1
answer
125
views
Binary classification problem
This problem is in the context of binary classification. Let $f_\omega (t) = \mathcal I[\sin (\omega t) \geq 0]$, and $\mathcal F = \{ f_{\omega} : \omega \in R\}$, $t\in\mathbb R$. For any given $m = ...
0
votes
1
answer
721
views
What does w.p. mean in formulas?
I'm checking Facebook's paper about Prophet algorithm. I don't understand a part of formula, "w.p.". It's hard to search on Google. Could anyone help me understand this?
https://peerj.com/...
0
votes
0
answers
70
views
Question about KL divergence and this formula.
I was reading papers : https://arxiv.org/pdf/1503.03585.pdf for understanding https://arxiv.org/pdf/2006.11239.pdf (this paper is about denoising diffusion model)
But I can't figure how the author ...
1
vote
0
answers
33
views
How to calculate the unifrom entropy or VC dimension of the following class of functions?
When dealing with U process I meet with such a uniform entropy to calculate.
For any $\eta>0$, function class $\mathcal{F}$ containing functions $f=\left(f_{i, j}\right)_{1 \leq i \neq j \leq n}: \...
2
votes
1
answer
81
views
What does this math notation mean? $\min_{i} \|x_{i}\|$ [closed]
$\qquad\min_{i} \|x_{i}\|$
I'm doing some machine learning problems and I ran into this notation.
I don't understand what "mini" means in this case.
Is it the smallest element in the norm of ...
0
votes
0
answers
34
views
How do I stop my normals from spreading out under perturbation?
In many AI generative models you start with a so-called latent vector $\mathbf{z}$ of high dimension $d$ such that each $z_i \in \mathcal{N}(0,1)$. I'd like to randomly perturb this distribution in ...
0
votes
1
answer
244
views
clarification about the kl divergence between a continuous and a discrete distribution
I was reading this blog post on bayesian neural networks, where the author shows that if we use as a variational distribution a product of delta function, then minimizing the loss function of a BNN is ...
4
votes
0
answers
390
views
Multiclass Linear Discriminant Analysis
This question is based on the Multiclass Linear Discriminant Analysis (MLDA) describe in Lectures slides by Olga Veksler, which is a generalization of Fisher's Linear Discriminant. My use in MLDA is ...
0
votes
0
answers
213
views
Training, validation, and test dataset and i.i.d. assumption
I wonder whether we should distinguish the validation and test dataset based on i.i.d. assumption.
According to the statistical learning theory, the i.i.d. assumption is required to affect ...
0
votes
0
answers
54
views
Likelihood in MAP and MLE for linear regression
In MAP estimation for linear regression task, the posterior of the weight given the data is written as $p(w|X,Y)=\frac{p(Y|X,w)p(w)}{p(Y|X)}$, why the likelihood is not $p(X,Y|w)$?
From my ...
0
votes
0
answers
45
views
Find variance of random vector
I have a question about random vectors
A random vector (X, Y ) has a continuous distribution with a density function
$$f(x, y) = \begin{cases}c · x &\mbox{for}& 0 ≤ x ≤ 2, \max(0, 1 − x) ≤ y ≤ ...
2
votes
2
answers
173
views
Mathematical notation in a machine learning problem, majority rule
(I apologise that the title may be a bit confusing and I don't know if this is the right community to ask my question.)
This is a mathematical notation problem in the field of machine learning.
A ...
1
vote
1
answer
57
views
Understanding a Simple Proof with Integrals
In this machine learning paper, the following lemma is stated (and proven in the Appendix A, cf. page 11):
Lemma A.1 For random variables $X$, $Y$ and function $f(x, y)$ under suitable regularity ...
0
votes
1
answer
71
views
Can someone clarify the use of the width of the ellipsoid regarding Mahalanobis distance?
My knowledge of Math is limited. I was looking up Mahalanobis distance out of curiosity, after seeing a reference.
From Wikipedia: Mahalanobis distance - Intuitive explanation
Putting this on a ...
0
votes
0
answers
58
views
Using Chi-squared Tests for Feature Selection with Big Data?
When dealing with non-linear data, since the reliability of chi-squared tests diminishes with the number of samples, is it reasonable to divide a large dataset into sample spaces of, say, 20 or 40, ...
1
vote
0
answers
21
views
Determining Viability of Chi-squared for Feature Selection
How does one determine the likelihood chi-squared is accurately determining features during feature selection of categorical data? To summarize the rest of this post, the chi-squared test doesn't ...
2
votes
1
answer
186
views
Closure of balls in Reproducing Kernel Hilbert Space (RKHS)
Let $X \subset \mathbb{R}^m$ be compact, and $k: X\times X \rightarrow \mathbb{R}$ be a universal kernel function, in the sense that the corresponding RKHS $\mathcal{H}_k$ is dense in $C(X)$ under the ...
0
votes
1
answer
124
views
How to combine various measures into a single measure?
So I'm trying to understand the intuition behind the accepted answer here which is used to combine several scores into a single score.
Namely, this part:
...
1
vote
0
answers
70
views
preventing rare extreme values in linear regression prediction
I am trying to train a model with a lot of input variables using linear regression. For technical reasons, my training data is obtained from a simulation that closely but not perfectly mirrors the ...
1
vote
0
answers
36
views
Area Under Precision-Recall and Area Under ROC curve for different amount of observations
I am doing a research and thus comparing some algorithms for binary classification. Worth to mention that, the data set is highly imbalanced i.e., the minority class is only 0.2%.
Notation:
Area Under ...
4
votes
1
answer
183
views
Can Machine Learning models be considered as "Approximate Dynamic Programming"?
In the context of certain statistical/machine learning models, such as models that are trying to estimate "optimal policies" (e.g. reinforcement learning) - can we consider these models as &...
0
votes
0
answers
48
views
Complexity of Lebesgue measurable spaces
Consider a discrete finite set $\Omega=X\times Y \in \mathbb{R}^{m\times n}$ for finite $m,n$. Let $(\Omega,\Sigma,\mu)$ be the measure space. ($\Sigma$ is the power set and $\mu$ is $\sigma$-finite ...
0
votes
0
answers
21
views
"Manipulating" Normal Distributions
I am reading the following book https://algorithmsbook.com/optimization/files/optimization.pdf at page 281:
I am trying to understand how to manipulate the matrix terms to verify the following 2 ...
2
votes
2
answers
1k
views
Relationship Between Bayesian Optimization and Gaussian Process
In Bayesian Optimization, the function (i.e. objective function) that we are trying to optimize is modelled using some surrogate function - this surrogate function usually turns out to be a Gaussian ...
0
votes
1
answer
151
views
Box-Muller Transformation: Polar Coordinates Interpretation
I am aware that the Box-Muller transform leverages polar coordinates to arrive at the final transformations by plotting two uniform random variables, $(u, v)$ in the Cartesian plane. I have not seen ...
3
votes
1
answer
182
views
Convergence of $M$-estimators when the argmin is not unique
Let $(X_i)_{1\le i\le n}$ be i.i.d. random variables taking values in a compact set $\mathcal X\subseteq \mathbb R^d$, and let $\mathcal P_n = \mathcal P_n(\cdot\mid X_1,\ldots,X_n)$ and $\mathcal P$ ...
1
vote
0
answers
43
views
Reducing variance in linear regression
While reading The Elements of Statistical Learning the author states that by shriking the coefficients of a liinear regression you raise the bias while lowering the variance and thus, sometimes, ...
1
vote
1
answer
73
views
Parameter estimation in linear model - why standard deviation of parameter increases as X matrix gets wider?
Intro
Let $Y = X\beta + \epsilon$ where $X$ is randomly generated data from normal distribution fitted into $n \times m$ matrix and $\epsilon$ is a vector of normal random errors. Say that first 5 ...
2
votes
1
answer
209
views
Are Machine Learning Optimization Problem ever Categorized as "P" or "NP"?
In the context of Computer Science and Optimization, I have heard that different problems can be classified using the "P vs NP" framework. Essentially, there is a hierarchy of problems based ...
0
votes
1
answer
99
views
Deriving the Bayes Optimal Classifier (Mitchell, Machine Learning)
I am trying to recreate the Bayes Optimal Classifier result given in Machine Learning textbook by Mitchell. Below, I've added the desired result from the text and my work.
I think I've taken the right ...
0
votes
1
answer
81
views
Mean squared error minimization
I'm studying machine learning right now and I have find to following exercise:
We define the mean squared error of a number $x \in \mathbb{R}$ , where $a_{1}...,a_{n} \in \mathbb{R}$
$$f(x)= \frac{1}{...
0
votes
0
answers
130
views
Textbook recommendation on rigorous machine learning results
I am looking for textbook(s) in machine learning theory that satisfies the following:
The text should be graduate level. It assumes all undergraduate level mathematics and early graduate level of ...
1
vote
0
answers
35
views
How to tell geometrically/graphically the statistical properties of a ring distribution?
I pulled this distribution of a 2D random variable from page 5: https://arxiv.org/pdf/1606.05908.pdf
I want to know:
How can we infer the covariance matrix from the plot?
How can we infer the ...
1
vote
0
answers
23
views
On the bounds of estimated conditional correlations and a follow-up question on the inferred properties of underlying structural parameters.
Framework
It is assumed that the data is Gaussian and follows the following structural equation model/additive noise model
$$ Y = \sum_{j=1}^{p} X_j \theta_j + \epsilon$$
$$ ||\theta||_0=s<<n $$...
1
vote
1
answer
113
views
Can't we just use PCA to solve the problem of Linear Regression
From my intuitive understanding till now
If I have let's say a set of 2D points, then performing the PCA will give me the direction that reduces the variance along one direction drastically right. But ...
1
vote
0
answers
35
views
Why are latent spaces able to learn representations - autoencoder?
As the title states, why are latent spaces even able to intelligently learn representations? There's no guarantee that we learn the most important features since it's all done automatically in ...
1
vote
0
answers
51
views
An efficient stopping rule to determine the sign of the mean of an i.i.d. sequence of random variables.
Do there exist a family of measurable functions $(f_t^\delta)_{t \in \mathbb{N}, \delta \in (0,1)}$ and constants $C,c>0$ such that, for each $t \in \mathbb{N}$ and $\delta \in (0,1)$ we have that $...
0
votes
1
answer
28
views
Problem with conditional expected value
In the book Element of Statistical Learning, the author says that the Expected Prediction Error, for an arbitrary test point $x_0$ is:
$$EPE(x_0) = E_{y_0 | x_0}E_\mathcal{T}(y_0 -\hat{y}_0)^2$$
where ...
2
votes
3
answers
397
views
Implementing multiclass logistic regression from scratch
This is a sequel to a previous question about implementing binary logistic regression from scratch.
Background knowledge:
To train a logistic regression model for a classification problem with $K$ ...
1
vote
1
answer
356
views
How does the bias weight $w_0$ get computed during ridge regression?
I am given a full-rank feature matrix $\mathbf{X}$ to which I am supposed to provide a closed form solution for the weights $\hat{\mathbf{w}}_{ridge}$ of a ridge-regression optimization problem. The ...
3
votes
1
answer
196
views
Implementing binary logistic regression from scratch
Background knowledge:
To train a logistic regression model for a classification problem with two classes (called class $0$ and class $1$), we are given a training dataset consisting of feature vectors ...
6
votes
1
answer
251
views
Proof of $\frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n}$
I am trying to find a proof for the MSE of a linear regression:
\begin{gather}
\frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\...
4
votes
1
answer
123
views
How to compute the dual of an optimization problem defined on a function space?
I am interested in one result in the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma.
In Equation ...
2
votes
1
answer
1k
views
Difference between EPE and MSE
In the book ESL (Element of Statistical Learning), the author introduces the EPE (Expected prediction Error) and the MSE (Mean Squared Error).
I know that the EPE is defined as:
$$EPE(f)=E(Y-f(X))^2$$
...
2
votes
1
answer
173
views
Growth function $\tau_{\mathcal{H}}(m)$ lower bound
I have been working on this problem for a long time and I would like some help. They ask me to find for each $ n $ a hypothesis class $ \mathcal {H} \subset \{\pm 1 \}^{\mathbb {N}} $ with $ n $ ...
1
vote
1
answer
103
views
Generalization in Neural Networks: Can one Impose Conditions on the Data?
There is a well-developed theory on generalization bounds for deep neural networks, using VC dimensions and Rademacher Complexities. They work for any underlying "true" distribution
$\...