Newest 'statistics+machine-learning' Questions - Page 2

0 votes

0 answers

36 views

Show bagging helps under squared-error loss

This question is about chapter 8.7 Bagging from Element of Statistical Learning (ESL) textbook. Assume our training observations $\left(x_i, y_i\right), i=1, \ldots, N$ are independently drawn from a ...

maskeran

573

asked Oct 2, 2023 at 18:54

-2 votes

1 answer

29 views

When is my lightgbm going to find cut points in random variables that reduce entropy more than a naturally correlated variable with the target? [closed]

In machine learning sometimes we build models using hundreds of variables/features that we don't know (at least at first) if they might have a relation with the target. Usually we find that some of ...

Alejandro Gómez

11

asked Sep 27, 2023 at 11:34

3 votes

1 answer

236 views

Empirical distribution learns w.r.t total variation distance

I am trying to prove or disprove that the empirical distribution can learn any continuous distribution w.r.t the total variation distance. The context is the one of statistical learning. I am quite ...

pppp0l

51

asked Sep 20, 2023 at 10:32

0 votes

0 answers

24 views

Optimizing function defined by integral

Let the two functions $q: \mathbb{R}^d \rightarrow\mathbb{R}^{+}$ and $s: \mathbb{R^d} \times \mathbb{R^d} \rightarrow \mathbb{R}^{+}$, $d \in \mathbb{N,}$ where both are assumed to be continuous and ...

blockchain187

36

asked Sep 18, 2023 at 11:47

0 votes

0 answers

114 views

VC dimension of indicator functions is equal to pseudo dimension

I am reading the "Foundation of machine learning" by Mehryar Mohri (https://cs.nyu.edu/~mohri/mlbook/). In the proof of Theorem 11.8, it said the following statement, which I can not ...

Harry

699

asked Sep 11, 2023 at 13:23

0 votes

0 answers

27 views

Expectation of linear form multiplied by quadratic form for MVN distribution

Assume that $\bf{x}$ is a random vector that is distributed multivariate normal with mean $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$. Let $\bf{A}$ be a matrix of constants. I'm ...

max

194

asked Sep 4, 2023 at 22:37

0 votes

0 answers

30 views

How do I obtain the primal and dual for the regression estimator $\min _\beta[\|\beta\|^2+\sum_{i=1}^n \xi_i^2]$ s.t. $\xi_i=y_i-h(x_i)^\top \beta$?

I am working on a statistical learning exercise that requires some knowledge of convex optimization which I am unfortunately lacking. Consider the linear regression model $$y_i=h(x_i)^\top\beta+\...

Leon

127

asked Sep 4, 2023 at 16:56

1 vote

1 answer

126 views

$\min$-entropy for the uniform distribution on $[𝑛]$

The min-entropy of a distribution $\nu$ on $[n]$ is given as: $$H_{\infty}(\nu)=\min_{i} \log(\frac{1}{\nu(i)})$$ Now we will prove that that for every distribution $\nu$ on $[n]$ and for $U$ being ...

Lifeni

558

asked Sep 1, 2023 at 16:20

0 votes

0 answers

20 views

Reason behind objective function in Linear Discriminant Analysis

I don't really understand the objective function to be optimized in Linear Discriminant Analysis (LDA). My question is centered around the same concepts mentioned in this this other one. The analysis ...

Alberto

503

asked Aug 17, 2023 at 21:40

0 votes

1 answer

62 views

Define 'accuracy' for numerical data?

Normally, people use 'accuracy' to describe the output quality (from a model or methodology https://en.wikipedia.org/wiki/Precision_and_recall) for categorical data. However, I am wondering could the ...

Edamame

113

asked Aug 15, 2023 at 21:01

1 vote

0 answers

61 views

Error propagation and Gradient Descent

I was looking at error propagation (or propagation of uncertainty in wikipedia: https://en.wikipedia.org/wiki/Propagation_of_uncertainty) My primary concern is getting an estimate of error of ...

ponir

204

asked Aug 7, 2023 at 4:56

1 vote

0 answers

35 views

Unknown normalization for probability distribution in EM algorithm

I am exploring utilizing Expectation Maximization (EM) algorithm in machine learning where the exact distribution of the data is unknown as all the observed sample pairs do not form the complete data. ...

amb

11

asked Aug 4, 2023 at 9:10

1 vote

1 answer

80 views

Pseudo-determinant of rank deficient matrix times a constant.

I have a question regarding the pseudo-determinant of a rank deficient matrix times a constant. Lets say matrix $K$ has dimension $n \times n$, however $\text{rank}(K)<n$. Does the following rule ...

Seb L

25

asked Aug 3, 2023 at 14:13

-1 votes

1 answer

39 views

Verify whether it's a Bregman loss function, maybe by solving a differential equation

I have a function $f(x, y;\mu) = \frac{\mu}{x}(x-y)^2$, where $\mu > 0$ is a parameter. I want to see whether it's a Bregman loss function. A Bregman loss function is define as: $D_\phi(x,y) = \phi(...

Jimmy Gao

31

asked Jul 24, 2023 at 7:37

0 votes

1 answer

120 views

Empirical Rademacher complexity bound

Consider the hypothesis class $$\mathcal{H} = \{\mathcal{X}\ni x\to r^2-\|\Phi(x)-\mathbf{c}\|^2:\|\mathbf{c}\|\leq \Lambda, 0<r\leq R\}$$ where $\Phi: \mathcal{X} \to \mathbb{H}$ is a feature map ...

Giorgos Giapitzakis

6,511

asked Jul 21, 2023 at 7:18

2 votes

0 answers

116 views

Rademacher complexity of Binary classification

I am trying to show the inequality below, please note that in this case I am considering the labels to be $Y_i \in \{0, 1\}$, I state this since I have seen results but for labels that are in $\{0,1\}$...

vendrick17

39

asked Jun 12, 2023 at 8:57

0 votes

1 answer

157 views

Fitting a non-linear curve in symmetric positive definite matrix manifold

I have a variable $x \in \mathbb{R}$. For some values of $x$, $\{x_1, ..., x_n\}$, I have measured a covariance matrix of the variable $y \in \mathbb{R}^n$, conditional on these values of the variable ...

dherrera

160

asked Jun 9, 2023 at 4:31

2 votes

0 answers

68 views

Bound on inverse covariance from covariance in regularized covariance estimation problem

In this paper by Bickel and Levina, I am confused about result (A15) which claims that since $$ (A14) \qquad \| \text{Var}(\mathbf{X}) - \widehat{\text{Var}}(\mathbf{X})\|_{\max} = O_P(n^{-1/2} \log^{...

WeakLearner

6,106

asked Jun 6, 2023 at 1:44

4 votes

2 answers

331 views

What's the significance of Mean Squared Error? Why not something else?

Background: Masters in CS/Math. I'm brushing up on statistics I see Mean Squared Error(MSE) everywhere. As a student I took it for granted, but now when I tried to find the reasons for why it's so ...

Teddy K

49

asked May 30, 2023 at 9:04

1 vote

1 answer

124 views

Bound on the difference between the true risk of an empirical risk minimizer $\hat{h}$ and an oracle $\bar{h}$.

In Lecture 3 on Concentration Inequalities in Philippe Rigollet's Mathematics of Machine Learning course on MIT OCW, there is the following theorem, which is the Theorem on the first page of this ...

person

107

asked May 25, 2023 at 16:40

1 vote

0 answers

112 views

Are these two inequalities are equivalent?

Let's assume that $I_j \in \mathcal{J}$, where $\mathcal{J}$ is a set of images that are correctly classified and $p(I)$ is the output probability distribution of the used underlying model. Out of $\...

Shadow_of_the_darks

11

asked May 20, 2023 at 13:12

0 votes

0 answers

72 views

Why is the KL Divergence between contracted $Bin(N,\theta)$ and $Bernoulli(\lambda)$ convex function in $\lambda$?

I need to prove that the the KL Divergence $D(\bar{\mu}(\theta)||Y(\lambda))$ between the following variables is convex w.r.t $\lambda$. The variables are defined as: $ \bar{\mu}(\theta) =\frac{1}{N}\...

IdanC1s2

23

asked May 5, 2023 at 23:23

1 vote

0 answers

56 views

$UCB-\alpha$ policy for multi-armed bandit - conditions on UCB indices for picking suboptimal arm

While reading the optimality proof for the $UCB-\alpha$ policy for the multi-armed bandit problem , I came across a claim which I couldn't understand the logic of. Notations: $I_{i}(t) = \hat{\mu}_{i}(...

IdanC1s2

23

asked May 5, 2023 at 10:27

1 vote

1 answer

341 views

Relationship between eigenspectrum of gram matrix and kernel operator

A kernel operator is a function $k: \mathbb{R}^2 \to \mathbb{R}$, for instance $k(x, x') = \exp(|x-x'|/2)$ or $k(x, x') = x \cdot x'$. There are many kernels common in statistics and machine learning, ...

Tanishq Kumar

681

asked Apr 13, 2023 at 18:19

1 vote

2 answers

355 views

Application of Hoeffding's inequality to the Stochastic Multi-Armed Bandit Problem

I'm following this note to learn about deriving an upper bound of the UCB algorithm on the Stochastic Multi-Armed Bandit Problem. In particular, the proof of Lemma 15.6 there connotes that we can ...

NXWang

167

asked Apr 6, 2023 at 23:10

0 votes

0 answers

101 views

(Reference Request, I guess?) No-Free-Lunch Theorem for Unsupervised Learning

I am familiar with David H. Wolpert's No-Free-Lunch Theorem for Supervised Learning. Now I am wondering: Is there some sort of such a theorem for unsupervised learning? E.g. for Clustering? And what ...

Joseph Expo

500

asked Apr 5, 2023 at 15:22

2 votes

1 answer

323 views

Role of variance in consistent estimators

By definition, a consistent estimator or rather a weak consistent estimator is one that causes data points to converges to their true value as the number of data points increases. So naturally, bias ...

HalfTea

150

asked Apr 2, 2023 at 15:03

0 votes

0 answers

51 views

Taylor Series Approximation of the Variance Function

The problem setting is offline learning from bandit feedback data. Given a context vector $x$, a policy chooses an action $a$, with policy defined as $h_w(y \vert x)$, where $w$ is the 'learnable' ...

Shashank

187

asked Mar 16, 2023 at 13:43

0 votes

0 answers

522 views

Understanding the gradient contribution of each point in a linear regression line

Given a data set $D={\{(-1,0),(1,-2),(2,-1),(3,1)}\}$ which consists of $(x, y)$ pairs. Consider a linear regression model of the form $y = {\theta}^Tx + {\theta}_0$ where $\theta = 0.5 $ and $\...

Ben Harris

21

asked Mar 13, 2023 at 16:18

0 votes

0 answers

28 views

Statistical test for comparing number of clusters in data

I am performing $K$-means clustering on a dataset consisting of $n$ observations and $d$ variables, and I'm trying to determine the optimal number of clusters. Is there a test that can determine the ...

RyRy the Fly Guy

6,099

asked Mar 10, 2023 at 16:10

1 vote

1 answer

97 views

Why errors are random in linear regression

In data science when we study linear regression from a mathematical point of view we often have the following hypothesis: We have points $y_i = x_i\beta + \epsilon_i$ with $\epsilon_i$ being a random ...

ConfusionMatrix

23

asked Mar 3, 2023 at 14:25

1 vote

1 answer

81 views

Can SVM be special case of PCA?

Let $X$ and $Y$ two linearly separable finite subsets of a $K$-dimensional real vector space $V$ with orthonormal basis $A = \{a_1,\ldots, a_K\}$. The covariance matrix $\Sigma_A$ of the set $X \cup Y$...

Alberto Carraro

33

asked Feb 23, 2023 at 12:27

2 votes

0 answers

35 views

Convergence of the Expectation-Maximization algorithm

Studying the Expectation-Maximization algorithm, I noticed that I couldn't find any proof that the parameters actually converge, nor that the limit is a local extremum of the likelihood (or even just ...

user25640

1,594

asked Feb 15, 2023 at 14:41

0 votes

1 answer

73 views

What does it mean for a function to be differentiable/continuous when the input is a function?

I have the loss function $$L(h) = \sum_{i=1}^{n}(h(x_i) - y_i)^2$$ $h$ is a function that spits out the predicted value when fed in a vector $x_i$. The domain then for $h$ is $\mathbb R^d$ and the ...

beginner

1,774

asked Feb 12, 2023 at 22:12

1 vote

0 answers

17 views

error bounds for semi-online Learning problem

I want to solve the following problem: Consider the noise-free classification setup. Let $\mathcal{F}$ denote an infinite class of (binary) classifiers with finite VC dimension $d$. Let $f* ∈ \...

y4nik

193

asked Feb 6, 2023 at 15:49

0 votes

0 answers

24 views

Difference of sampling and calculating probability distribution function.

I have basic question with sampling from a probability distribution. For instance in Importance Sampling, it's hard to sample from $p$ directly, so we sample from the proposal distribution $q$ then ...

tworiver

1

asked Feb 5, 2023 at 5:53

2 votes

1 answer

366 views

Example of tightness of Sauer-Shelah lemma

I was given the task of finding an example of a family of events (or hypothesis, concepts, etc.) such that the Sauer-Shelah lemma is tight. The lemma states: Assume that the Vapnik-Chervonenkis ...

y4nik

193

asked Feb 4, 2023 at 14:47

0 votes

1 answer

120 views

Is the minus absolute value of the difference a kernel?

Precisely, is $k(x_i- x_j) = -\|x_i-x_j\| \quad x_i, x_j \in \mathbb R$ a valid kernel? I know that the absolute value of kernel formulation is not a valid kernel since it is not positive semi-...

Rajmadan Lakshmanan

13

asked Jan 25, 2023 at 10:59

1 vote

1 answer

196 views

Simple statement in the elementary proof of the Johnson-Lindenstrauss lemma (random projections)

In the simple proof of the johnson lindenstrauss lemma written by Sanjoy Dasgupta, Anupam Gupta that can be found here they state the following (p.$62$): Repeating this projection $O(n)$ times can ...

jakobhellander

11

asked Jan 10, 2023 at 14:17

1 vote

0 answers

65 views

How is it possible to have a maximum of a quantity, but then no argument that attains this maximum?

I am having difficulty with intuitive understanding of a definition in an exposition of the proof of the Vapnik Chervonenkis inequality in the notes of Robert Nowak (2009). The proof strategy is taken ...

microhaus

934

asked Jan 10, 2023 at 3:02

2 votes

0 answers

112 views

Intuition for Local vs. Global notions of Metric Entropy in Statistics

I am looking for intuition regarding the following statement on page 4 of this paper by Gassiat and Van Handel: However, in finite dimensional settings, global entropy bounds are known to yield sub-...

WeakLearner

6,106

asked Dec 28, 2022 at 22:49

2 votes

0 answers

129 views

Conditional Second Moments of Multivariate Normal Variable on Binary Vectors

Suppose we observe a binary table $Y \in \mathbb R^{N \times G}$, corresponding to $N$ observations of $G$ dimensional binary vectors $Y_1, \cdots, Y_n$. We imagine each vector $Y_i$ is generated from ...

md19jli

15

asked Dec 23, 2022 at 18:37

2 votes

0 answers

40 views

Variational Autencoders and the inequality $\mathbb{E}_{z∼q(z|x)}log (p_{model}(x | z)) − D_{KL}(q(z | x)||p_{model}(z)) ≤ log (p_{model}(x))$

I am reading section $20.10.3$ of the book Deep Learning on Variational Autoencoders, where the authors write: To generate a sample from the model, the VAE ﬁrst draws a sample $z$ from the code ...

IntegrateThis

3,848

asked Dec 8, 2022 at 4:19

0 votes

1 answer

46 views

What is the meaning of "with probability at least 1-x over the drawing of the m training patterns"?

I'm reading a book about Support vector machine and I encounters this ,so what is the meaning of $\text { with probability at least } 1-\delta \text { over the drawing of the m training patterns}$

吴yuer

321

asked Nov 17, 2022 at 2:18

0 votes

0 answers

60 views

Proving this upper bound involving VC dimension

Let $S_n = \{ (x_i, y_i)\}^n_{i=1}$ be a data set. Let $\mathcal H$ and $\mathcal H' $ be hypothesis classes, such that $\mathcal H' = \{ h \in \mathcal H: \hat {er}_{S_n}(h) \leq \beta\}$, where $\...

Keio203

561

asked Nov 12, 2022 at 20:35

0 votes

0 answers

56 views

Deriving the solution for ridge polynomial regression.

We have the following loss function: $$\operatorname{Err}(x)=\frac{1}{n}\sum_{i=1}^n(h_w(x_i)-y_i)^2+ λ\|w\|^2$$ I need to derive the solution for a polynomial of degree $0$ ($h_w(x)=w_0$) and ...

dan

1

asked Nov 4, 2022 at 9:54

1 vote

0 answers

40 views

For a Bayes classifier, can we prove that adding noise to data does not increase its accuracy?

In terms of a Bayes classifier, it's intuitive to consider that adding noise to data CANNOT increase the accuracy. Taking a binary classification problem as an example. The data distribution is $(x,y)\...

Autumnii

61

asked Oct 21, 2022 at 14:46

0 votes

0 answers

58 views

How to interpret $P(z|x, y; k)$ with examples?

Following this answer and this, I am trying to understand what can be meant by $P(z|x, y; k)$ notation. How to interpret this with examples in terms of machine learning? As I understand, $P(z|x, y)$ ...

B200011011

101

asked Oct 19, 2022 at 11:50

0 votes

0 answers

55 views

How much datapoints must be in a subset of a dataset before the subset is representative of that parent dataset?

It makes sense to me that a randomly sampled subset of a dataset should still be theoretically representative of its parent. When you take data and split into training and test sets, you assume that ...

Sanger Steel

1

asked Oct 17, 2022 at 16:29

2 votes

0 answers

24 views

How to give a high-probability uniform estimation of a potential having access to noisy pointwise estimates of the associated vector field?

As in the title, our goal is to estimate uniformly and with high probability (and up to a constant) a potential having access to noisy pointwise estimates of the associated vector field (i.e., the ...

Bob

5,783

asked Oct 12, 2022 at 8:31

All Questions

Related Tags