Skip to main content

All Questions

1 vote
0 answers
25 views

Bayesian classifier and linear regression on dummy variables

EDIT: I am finally not so sure about one thing: they say regression but not linear regression. I may have misunderstood the whole paragraph. In the book Elements of Statistical Learning, (Hastie-...
Plop's user avatar
  • 2,719
3 votes
0 answers
98 views

Normalizing Flow Penalization

I am looking to fit a normalizing flow, specifically a Masked Autoregressive Flow model. However, this model leads to high variance on lower dimensional, less complex data. I am using a neural network ...
user2793618's user avatar
1 vote
0 answers
13 views

How is this R squared calculated in context to clusteiring?

I was reading the paper "Consistent Individualized Feature Attribution for Tree Ensembles" by Scott Lundberg et al and cannot understand how the calculation for the $R^2$ works here - see ...
Penguines's user avatar
0 votes
2 answers
74 views

How can I perform polynomial regression with this dataset?

I have the training set $\{(0,0), (1,1), (2,1), (1,2)\}$. I want to find the best quadratic polynomial of the form $f(x) = a + bx + cx^2$ that minimizes the sum of squared error between $y$ and the ...
jem do's user avatar
  • 185
1 vote
1 answer
125 views

Binary classification problem

This problem is in the context of binary classification. Let $f_\omega (t) = \mathcal I[\sin (\omega t) \geq 0]$, and $\mathcal F = \{ f_{\omega} : \omega \in R\}$, $t\in\mathbb R$. For any given $m = ...
Keio203's user avatar
  • 561
0 votes
1 answer
721 views

What does w.p. mean in formulas?

I'm checking Facebook's paper about Prophet algorithm. I don't understand a part of formula, "w.p.". It's hard to search on Google. Could anyone help me understand this? https://peerj.com/...
dmjy's user avatar
  • 103
0 votes
0 answers
70 views

Question about KL divergence and this formula.

I was reading papers : https://arxiv.org/pdf/1503.03585.pdf for understanding https://arxiv.org/pdf/2006.11239.pdf (this paper is about denoising diffusion model) But I can't figure how the author ...
NeverneverNever's user avatar
1 vote
0 answers
33 views

How to calculate the unifrom entropy or VC dimension of the following class of functions?

When dealing with U process I meet with such a uniform entropy to calculate. For any $\eta>0$, function class $\mathcal{F}$ containing functions $f=\left(f_{i, j}\right)_{1 \leq i \neq j \leq n}: \...
leslie zhang's user avatar
2 votes
1 answer
81 views

What does this math notation mean? $\min_{i} \|x_{i}\|$ [closed]

$\qquad\min_{i} \|x_{i}\|$ I'm doing some machine learning problems and I ran into this notation. I don't understand what "mini" means in this case. Is it the smallest element in the norm of ...
hagendatz1113's user avatar
0 votes
0 answers
34 views

How do I stop my normals from spreading out under perturbation?

In many AI generative models you start with a so-called latent vector $\mathbf{z}$ of high dimension $d$ such that each $z_i \in \mathcal{N}(0,1)$. I'd like to randomly perturb this distribution in ...
Hooked's user avatar
  • 6,697
0 votes
1 answer
244 views

clarification about the kl divergence between a continuous and a discrete distribution

I was reading this blog post on bayesian neural networks, where the author shows that if we use as a variational distribution a product of delta function, then minimizing the loss function of a BNN is ...
Alucard's user avatar
  • 284
4 votes
0 answers
390 views

Multiclass Linear Discriminant Analysis

This question is based on the Multiclass Linear Discriminant Analysis (MLDA) describe in Lectures slides by Olga Veksler, which is a generalization of Fisher's Linear Discriminant. My use in MLDA is ...
Triceratops's user avatar
0 votes
0 answers
213 views

Training, validation, and test dataset and i.i.d. assumption

I wonder whether we should distinguish the validation and test dataset based on i.i.d. assumption. According to the statistical learning theory, the i.i.d. assumption is required to affect ...
Minsik Seo's user avatar
0 votes
0 answers
54 views

Likelihood in MAP and MLE for linear regression

In MAP estimation for linear regression task, the posterior of the weight given the data is written as $p(w|X,Y)=\frac{p(Y|X,w)p(w)}{p(Y|X)}$, why the likelihood is not $p(X,Y|w)$? From my ...
William Lin's user avatar
0 votes
0 answers
45 views

Find variance of random vector

I have a question about random vectors A random vector (X, Y ) has a continuous distribution with a density function $$f(x, y) = \begin{cases}c · x &\mbox{for}& 0 ≤ x ≤ 2, \max(0, 1 − x) ≤ y ≤ ...
user1074987's user avatar
2 votes
2 answers
173 views

Mathematical notation in a machine learning problem, majority rule

(I apologise that the title may be a bit confusing and I don't know if this is the right community to ask my question.) This is a mathematical notation problem in the field of machine learning. A ...
rrchtr's user avatar
  • 23
1 vote
1 answer
57 views

Understanding a Simple Proof with Integrals

In this machine learning paper, the following lemma is stated (and proven in the Appendix A, cf. page 11): Lemma A.1 For random variables $X$, $Y$ and function $f(x, y)$ under suitable regularity ...
Hermi's user avatar
  • 702
0 votes
1 answer
71 views

Can someone clarify the use of the width of the ellipsoid regarding Mahalanobis distance?

My knowledge of Math is limited. I was looking up Mahalanobis distance out of curiosity, after seeing a reference. From Wikipedia: Mahalanobis distance - Intuitive explanation Putting this on a ...
user avatar
0 votes
0 answers
58 views

Using Chi-squared Tests for Feature Selection with Big Data?

When dealing with non-linear data, since the reliability of chi-squared tests diminishes with the number of samples, is it reasonable to divide a large dataset into sample spaces of, say, 20 or 40, ...
midmath's user avatar
  • 63
1 vote
0 answers
21 views

Determining Viability of Chi-squared for Feature Selection

How does one determine the likelihood chi-squared is accurately determining features during feature selection of categorical data? To summarize the rest of this post, the chi-squared test doesn't ...
midmath's user avatar
  • 63
2 votes
1 answer
186 views

Closure of balls in Reproducing Kernel Hilbert Space (RKHS)

Let $X \subset \mathbb{R}^m$ be compact, and $k: X\times X \rightarrow \mathbb{R}$ be a universal kernel function, in the sense that the corresponding RKHS $\mathcal{H}_k$ is dense in $C(X)$ under the ...
masala's user avatar
  • 23
0 votes
1 answer
124 views

How to combine various measures into a single measure?

So I'm trying to understand the intuition behind the accepted answer here which is used to combine several scores into a single score. Namely, this part: ...
aqibjr1's user avatar
  • 237
1 vote
0 answers
70 views

preventing rare extreme values in linear regression prediction

I am trying to train a model with a lot of input variables using linear regression. For technical reasons, my training data is obtained from a simulation that closely but not perfectly mirrors the ...
poisonDartFrog's user avatar
1 vote
0 answers
36 views

Area Under Precision-Recall and Area Under ROC curve for different amount of observations

I am doing a research and thus comparing some algorithms for binary classification. Worth to mention that, the data set is highly imbalanced i.e., the minority class is only 0.2%. Notation: Area Under ...
Gaussen's user avatar
  • 81
4 votes
1 answer
183 views

Can Machine Learning models be considered as "Approximate Dynamic Programming"?

In the context of certain statistical/machine learning models, such as models that are trying to estimate "optimal policies" (e.g. reinforcement learning) - can we consider these models as &...
stats_noob's user avatar
  • 3,268
0 votes
0 answers
48 views

Complexity of Lebesgue measurable spaces

Consider a discrete finite set $\Omega=X\times Y \in \mathbb{R}^{m\times n}$ for finite $m,n$. Let $(\Omega,\Sigma,\mu)$ be the measure space. ($\Sigma$ is the power set and $\mu$ is $\sigma$-finite ...
rookie's user avatar
  • 1,728
0 votes
0 answers
21 views

"Manipulating" Normal Distributions

I am reading the following book https://algorithmsbook.com/optimization/files/optimization.pdf at page 281: I am trying to understand how to manipulate the matrix terms to verify the following 2 ...
stats_noob's user avatar
  • 3,268
2 votes
2 answers
1k views

Relationship Between Bayesian Optimization and Gaussian Process

In Bayesian Optimization, the function (i.e. objective function) that we are trying to optimize is modelled using some surrogate function - this surrogate function usually turns out to be a Gaussian ...
stats_noob's user avatar
  • 3,268
0 votes
1 answer
151 views

Box-Muller Transformation: Polar Coordinates Interpretation

I am aware that the Box-Muller transform leverages polar coordinates to arrive at the final transformations by plotting two uniform random variables, $(u, v)$ in the Cartesian plane. I have not seen ...
TipsyMath's user avatar
3 votes
1 answer
182 views

Convergence of $M$-estimators when the argmin is not unique

Let $(X_i)_{1\le i\le n}$ be i.i.d. random variables taking values in a compact set $\mathcal X\subseteq \mathbb R^d$, and let $\mathcal P_n = \mathcal P_n(\cdot\mid X_1,\ldots,X_n)$ and $\mathcal P$ ...
Stratos supports the strike's user avatar
1 vote
0 answers
43 views

Reducing variance in linear regression

While reading The Elements of Statistical Learning the author states that by shriking the coefficients of a liinear regression you raise the bias while lowering the variance and thus, sometimes, ...
Guilherme Takata's user avatar
1 vote
1 answer
73 views

Parameter estimation in linear model - why standard deviation of parameter increases as X matrix gets wider?

Intro Let $Y = X\beta + \epsilon$ where $X$ is randomly generated data from normal distribution fitted into $n \times m$ matrix and $\epsilon$ is a vector of normal random errors. Say that first 5 ...
Brzoskwinia's user avatar
2 votes
1 answer
209 views

Are Machine Learning Optimization Problem ever Categorized as "P" or "NP"?

In the context of Computer Science and Optimization, I have heard that different problems can be classified using the "P vs NP" framework. Essentially, there is a hierarchy of problems based ...
stats_noob's user avatar
  • 3,268
0 votes
1 answer
99 views

Deriving the Bayes Optimal Classifier (Mitchell, Machine Learning)

I am trying to recreate the Bayes Optimal Classifier result given in Machine Learning textbook by Mitchell. Below, I've added the desired result from the text and my work. I think I've taken the right ...
takeaseat123's user avatar
0 votes
1 answer
81 views

Mean squared error minimization

I'm studying machine learning right now and I have find to following exercise: We define the mean squared error of a number $x \in \mathbb{R}$ , where $a_{1}...,a_{n} \in \mathbb{R}$ $$f(x)= \frac{1}{...
Herrpeter's user avatar
  • 1,324
0 votes
0 answers
130 views

Textbook recommendation on rigorous machine learning results

I am looking for textbook(s) in machine learning theory that satisfies the following: The text should be graduate level. It assumes all undergraduate level mathematics and early graduate level of ...
温泽海's user avatar
  • 2,497
1 vote
0 answers
35 views

How to tell geometrically/graphically the statistical properties of a ring distribution?

I pulled this distribution of a 2D random variable from page 5: https://arxiv.org/pdf/1606.05908.pdf I want to know: How can we infer the covariance matrix from the plot? How can we infer the ...
user3180's user avatar
  • 729
1 vote
0 answers
23 views

On the bounds of estimated conditional correlations and a follow-up question on the inferred properties of underlying structural parameters.

Framework It is assumed that the data is Gaussian and follows the following structural equation model/additive noise model $$ Y = \sum_{j=1}^{p} X_j \theta_j + \epsilon$$ $$ ||\theta||_0=s<<n $$...
Jorge de la Cal's user avatar
1 vote
1 answer
113 views

Can't we just use PCA to solve the problem of Linear Regression

From my intuitive understanding till now If I have let's say a set of 2D points, then performing the PCA will give me the direction that reduces the variance along one direction drastically right. But ...
Abhishek Mittal's user avatar
1 vote
0 answers
35 views

Why are latent spaces able to learn representations - autoencoder?

As the title states, why are latent spaces even able to intelligently learn representations? There's no guarantee that we learn the most important features since it's all done automatically in ...
user2793618's user avatar
1 vote
0 answers
51 views

An efficient stopping rule to determine the sign of the mean of an i.i.d. sequence of random variables.

Do there exist a family of measurable functions $(f_t^\delta)_{t \in \mathbb{N}, \delta \in (0,1)}$ and constants $C,c>0$ such that, for each $t \in \mathbb{N}$ and $\delta \in (0,1)$ we have that $...
Bob's user avatar
  • 5,783
0 votes
1 answer
28 views

Problem with conditional expected value

In the book Element of Statistical Learning, the author says that the Expected Prediction Error, for an arbitrary test point $x_0$ is: $$EPE(x_0) = E_{y_0 | x_0}E_\mathcal{T}(y_0 -\hat{y}_0)^2$$ where ...
Federico Mondaini's user avatar
2 votes
3 answers
397 views

Implementing multiclass logistic regression from scratch

This is a sequel to a previous question about implementing binary logistic regression from scratch. Background knowledge: To train a logistic regression model for a classification problem with $K$ ...
littleO's user avatar
  • 52.5k
1 vote
1 answer
356 views

How does the bias weight $w_0$ get computed during ridge regression?

I am given a full-rank feature matrix $\mathbf{X}$ to which I am supposed to provide a closed form solution for the weights $\hat{\mathbf{w}}_{ridge}$ of a ridge-regression optimization problem. The ...
Nero's user avatar
  • 73
3 votes
1 answer
196 views

Implementing binary logistic regression from scratch

Background knowledge: To train a logistic regression model for a classification problem with two classes (called class $0$ and class $1$), we are given a training dataset consisting of feature vectors ...
littleO's user avatar
  • 52.5k
6 votes
1 answer
251 views

Proof of $\frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n}$

I am trying to find a proof for the MSE of a linear regression: \begin{gather} \frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\...
Nero's user avatar
  • 73
4 votes
1 answer
123 views

How to compute the dual of an optimization problem defined on a function space?

I am interested in one result in the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma. In Equation ...
Stratos supports the strike's user avatar
2 votes
1 answer
1k views

Difference between EPE and MSE

In the book ESL (Element of Statistical Learning), the author introduces the EPE (Expected prediction Error) and the MSE (Mean Squared Error). I know that the EPE is defined as: $$EPE(f)=E(Y-f(X))^2$$ ...
Federico Mondaini's user avatar
2 votes
1 answer
173 views

Growth function $\tau_{\mathcal{H}}(m)$ lower bound

I have been working on this problem for a long time and I would like some help. They ask me to find for each $ n $ a hypothesis class $ \mathcal {H} \subset \{\pm 1 \}^{\mathbb {N}} $ with $ n $ ...
bravoralph's user avatar
1 vote
1 answer
103 views

Generalization in Neural Networks: Can one Impose Conditions on the Data?

There is a well-developed theory on generalization bounds for deep neural networks, using VC dimensions and Rademacher Complexities. They work for any underlying "true" distribution $\...
Claudio Moneo's user avatar

15 30 50 per page
1 2
3
4 5
14