Skip to main content

All Questions

0 votes
0 answers
36 views

Show bagging helps under squared-error loss

This question is about chapter 8.7 Bagging from Element of Statistical Learning (ESL) textbook. Assume our training observations $\left(x_i, y_i\right), i=1, \ldots, N$ are independently drawn from a ...
maskeran's user avatar
  • 573
-2 votes
1 answer
29 views

When is my lightgbm going to find cut points in random variables that reduce entropy more than a naturally correlated variable with the target? [closed]

In machine learning sometimes we build models using hundreds of variables/features that we don't know (at least at first) if they might have a relation with the target. Usually we find that some of ...
Alejandro Gómez's user avatar
3 votes
1 answer
236 views

Empirical distribution learns w.r.t total variation distance

I am trying to prove or disprove that the empirical distribution can learn any continuous distribution w.r.t the total variation distance. The context is the one of statistical learning. I am quite ...
pppp0l's user avatar
  • 51
0 votes
0 answers
24 views

Optimizing function defined by integral

Let the two functions $q: \mathbb{R}^d \rightarrow\mathbb{R}^{+}$ and $s: \mathbb{R^d} \times \mathbb{R^d} \rightarrow \mathbb{R}^{+}$, $d \in \mathbb{N,}$ where both are assumed to be continuous and ...
blockchain187's user avatar
0 votes
0 answers
114 views

VC dimension of indicator functions is equal to pseudo dimension

I am reading the "Foundation of machine learning" by Mehryar Mohri (https://cs.nyu.edu/~mohri/mlbook/). In the proof of Theorem 11.8, it said the following statement, which I can not ...
Harry's user avatar
  • 699
0 votes
0 answers
27 views

Expectation of linear form multiplied by quadratic form for MVN distribution

Assume that $\bf{x}$ is a random vector that is distributed multivariate normal with mean $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$. Let $\bf{A}$ be a matrix of constants. I'm ...
max's user avatar
  • 194
0 votes
0 answers
30 views

How do I obtain the primal and dual for the regression estimator $\min _\beta[\|\beta\|^2+\sum_{i=1}^n \xi_i^2]$ s.t. $\xi_i=y_i-h(x_i)^\top \beta$?

I am working on a statistical learning exercise that requires some knowledge of convex optimization which I am unfortunately lacking. Consider the linear regression model $$y_i=h(x_i)^\top\beta+\...
Leon's user avatar
  • 127
1 vote
1 answer
126 views

$\min$-entropy for the uniform distribution on $[𝑛]$

The min-entropy of a distribution $\nu$ on $[n]$ is given as: $$H_{\infty}(\nu)=\min_{i} \log(\frac{1}{\nu(i)})$$ Now we will prove that that for every distribution $\nu$ on $[n]$ and for $U$ being ...
Lifeni's user avatar
  • 558
0 votes
0 answers
20 views

Reason behind objective function in Linear Discriminant Analysis

I don't really understand the objective function to be optimized in Linear Discriminant Analysis (LDA). My question is centered around the same concepts mentioned in this this other one. The analysis ...
Alberto's user avatar
  • 503
0 votes
1 answer
62 views

Define 'accuracy' for numerical data?

Normally, people use 'accuracy' to describe the output quality (from a model or methodology https://en.wikipedia.org/wiki/Precision_and_recall) for categorical data. However, I am wondering could the ...
Edamame's user avatar
  • 113
1 vote
0 answers
61 views

Error propagation and Gradient Descent

I was looking at error propagation (or propagation of uncertainty in wikipedia: https://en.wikipedia.org/wiki/Propagation_of_uncertainty) My primary concern is getting an estimate of error of ...
ponir's user avatar
  • 204
1 vote
0 answers
35 views

Unknown normalization for probability distribution in EM algorithm

I am exploring utilizing Expectation Maximization (EM) algorithm in machine learning where the exact distribution of the data is unknown as all the observed sample pairs do not form the complete data. ...
amb's user avatar
  • 11
1 vote
1 answer
80 views

Pseudo-determinant of rank deficient matrix times a constant.

I have a question regarding the pseudo-determinant of a rank deficient matrix times a constant. Lets say matrix $K$ has dimension $n \times n$, however $\text{rank}(K)<n$. Does the following rule ...
Seb L's user avatar
  • 25
-1 votes
1 answer
39 views

Verify whether it's a Bregman loss function, maybe by solving a differential equation

I have a function $f(x, y;\mu) = \frac{\mu}{x}(x-y)^2$, where $\mu > 0$ is a parameter. I want to see whether it's a Bregman loss function. A Bregman loss function is define as: $D_\phi(x,y) = \phi(...
Jimmy Gao's user avatar
0 votes
1 answer
120 views

Empirical Rademacher complexity bound

Consider the hypothesis class $$\mathcal{H} = \{\mathcal{X}\ni x\to r^2-\|\Phi(x)-\mathbf{c}\|^2:\|\mathbf{c}\|\leq \Lambda, 0<r\leq R\}$$ where $\Phi: \mathcal{X} \to \mathbb{H}$ is a feature map ...
Giorgos Giapitzakis's user avatar
2 votes
0 answers
116 views

Rademacher complexity of Binary classification

I am trying to show the inequality below, please note that in this case I am considering the labels to be $Y_i \in \{0, 1\}$, I state this since I have seen results but for labels that are in $\{0,1\}$...
vendrick17's user avatar
0 votes
1 answer
157 views

Fitting a non-linear curve in symmetric positive definite matrix manifold

I have a variable $x \in \mathbb{R}$. For some values of $x$, $\{x_1, ..., x_n\}$, I have measured a covariance matrix of the variable $y \in \mathbb{R}^n$, conditional on these values of the variable ...
dherrera's user avatar
  • 160
2 votes
0 answers
68 views

Bound on inverse covariance from covariance in regularized covariance estimation problem

In this paper by Bickel and Levina, I am confused about result (A15) which claims that since $$ (A14) \qquad \| \text{Var}(\mathbf{X}) - \widehat{\text{Var}}(\mathbf{X})\|_{\max} = O_P(n^{-1/2} \log^{...
WeakLearner's user avatar
  • 6,106
4 votes
2 answers
331 views

What's the significance of Mean Squared Error? Why not something else?

Background: Masters in CS/Math. I'm brushing up on statistics I see Mean Squared Error(MSE) everywhere. As a student I took it for granted, but now when I tried to find the reasons for why it's so ...
Teddy K's user avatar
  • 49
1 vote
1 answer
124 views

Bound on the difference between the true risk of an empirical risk minimizer $\hat{h}$ and an oracle $\bar{h}$.

In Lecture 3 on Concentration Inequalities in Philippe Rigollet's Mathematics of Machine Learning course on MIT OCW, there is the following theorem, which is the Theorem on the first page of this ...
person's user avatar
  • 107
1 vote
0 answers
112 views

Are these two inequalities are equivalent?

Let's assume that $I_j \in \mathcal{J}$, where $\mathcal{J}$ is a set of images that are correctly classified and $p(I)$ is the output probability distribution of the used underlying model. Out of $\...
Shadow_of_the_darks's user avatar
0 votes
0 answers
72 views

Why is the KL Divergence between contracted $Bin(N,\theta)$ and $Bernoulli(\lambda)$ convex function in $\lambda$?

I need to prove that the the KL Divergence $D(\bar{\mu}(\theta)||Y(\lambda))$ between the following variables is convex w.r.t $\lambda$. The variables are defined as: $ \bar{\mu}(\theta) =\frac{1}{N}\...
IdanC1s2's user avatar
1 vote
0 answers
56 views

$UCB-\alpha$ policy for multi-armed bandit - conditions on UCB indices for picking suboptimal arm

While reading the optimality proof for the $UCB-\alpha$ policy for the multi-armed bandit problem , I came across a claim which I couldn't understand the logic of. Notations: $I_{i}(t) = \hat{\mu}_{i}(...
IdanC1s2's user avatar
1 vote
1 answer
341 views

Relationship between eigenspectrum of gram matrix and kernel operator

A kernel operator is a function $k: \mathbb{R}^2 \to \mathbb{R}$, for instance $k(x, x') = \exp(|x-x'|/2)$ or $k(x, x') = x \cdot x'$. There are many kernels common in statistics and machine learning, ...
Tanishq Kumar's user avatar
1 vote
2 answers
355 views

Application of Hoeffding's inequality to the Stochastic Multi-Armed Bandit Problem

I'm following this note to learn about deriving an upper bound of the UCB algorithm on the Stochastic Multi-Armed Bandit Problem. In particular, the proof of Lemma 15.6 there connotes that we can ...
NXWang's user avatar
  • 167
0 votes
0 answers
101 views

(Reference Request, I guess?) No-Free-Lunch Theorem for Unsupervised Learning

I am familiar with David H. Wolpert's No-Free-Lunch Theorem for Supervised Learning. Now I am wondering: Is there some sort of such a theorem for unsupervised learning? E.g. for Clustering? And what ...
Joseph Expo's user avatar
2 votes
1 answer
323 views

Role of variance in consistent estimators

By definition, a consistent estimator or rather a weak consistent estimator is one that causes data points to converges to their true value as the number of data points increases. So naturally, bias ...
HalfTea's user avatar
  • 150
0 votes
0 answers
51 views

Taylor Series Approximation of the Variance Function

The problem setting is offline learning from bandit feedback data. Given a context vector $x$, a policy chooses an action $a$, with policy defined as $h_w(y \vert x)$, where $w$ is the 'learnable' ...
Shashank 's user avatar
0 votes
0 answers
522 views

Understanding the gradient contribution of each point in a linear regression line

Given a data set $D={\{(-1,0),(1,-2),(2,-1),(3,1)}\}$ which consists of $(x, y)$ pairs. Consider a linear regression model of the form $y = {\theta}^Tx + {\theta}_0$ where $\theta = 0.5 $ and $\...
Ben Harris's user avatar
0 votes
0 answers
28 views

Statistical test for comparing number of clusters in data

I am performing $K$-means clustering on a dataset consisting of $n$ observations and $d$ variables, and I'm trying to determine the optimal number of clusters. Is there a test that can determine the ...
RyRy the Fly Guy's user avatar
1 vote
1 answer
97 views

Why errors are random in linear regression

In data science when we study linear regression from a mathematical point of view we often have the following hypothesis: We have points $y_i = x_i\beta + \epsilon_i$ with $\epsilon_i$ being a random ...
ConfusionMatrix's user avatar
1 vote
1 answer
81 views

Can SVM be special case of PCA?

Let $X$ and $Y$ two linearly separable finite subsets of a $K$-dimensional real vector space $V$ with orthonormal basis $A = \{a_1,\ldots, a_K\}$. The covariance matrix $\Sigma_A$ of the set $X \cup Y$...
Alberto Carraro's user avatar
2 votes
0 answers
35 views

Convergence of the Expectation-Maximization algorithm

Studying the Expectation-Maximization algorithm, I noticed that I couldn't find any proof that the parameters actually converge, nor that the limit is a local extremum of the likelihood (or even just ...
user25640's user avatar
  • 1,594
0 votes
1 answer
73 views

What does it mean for a function to be differentiable/continuous when the input is a function?

I have the loss function $$L(h) = \sum_{i=1}^{n}(h(x_i) - y_i)^2$$ $h$ is a function that spits out the predicted value when fed in a vector $x_i$. The domain then for $h$ is $\mathbb R^d$ and the ...
beginner's user avatar
  • 1,774
1 vote
0 answers
17 views

error bounds for semi-online Learning problem

I want to solve the following problem: Consider the noise-free classification setup. Let $\mathcal{F}$ denote an infinite class of (binary) classifiers with finite VC dimension $d$. Let $f* ∈ \...
y4nik's user avatar
  • 193
0 votes
0 answers
24 views

Difference of sampling and calculating probability distribution function.

I have basic question with sampling from a probability distribution. For instance in Importance Sampling, it's hard to sample from $p$ directly, so we sample from the proposal distribution $q$ then ...
tworiver's user avatar
2 votes
1 answer
366 views

Example of tightness of Sauer-Shelah lemma

I was given the task of finding an example of a family of events (or hypothesis, concepts, etc.) such that the Sauer-Shelah lemma is tight. The lemma states: Assume that the Vapnik-Chervonenkis ...
y4nik's user avatar
  • 193
0 votes
1 answer
120 views

Is the minus absolute value of the difference a kernel?

Precisely, is $k(x_i- x_j) = -\|x_i-x_j\| \quad x_i, x_j \in \mathbb R$ a valid kernel? I know that the absolute value of kernel formulation is not a valid kernel since it is not positive semi-...
Rajmadan Lakshmanan's user avatar
1 vote
1 answer
196 views

Simple statement in the elementary proof of the Johnson-Lindenstrauss lemma (random projections)

In the simple proof of the johnson lindenstrauss lemma written by Sanjoy Dasgupta, Anupam Gupta that can be found here they state the following (p.$62$): Repeating this projection $O(n)$ times can ...
jakobhellander's user avatar
1 vote
0 answers
65 views

How is it possible to have a maximum of a quantity, but then no argument that attains this maximum?

I am having difficulty with intuitive understanding of a definition in an exposition of the proof of the Vapnik Chervonenkis inequality in the notes of Robert Nowak (2009). The proof strategy is taken ...
microhaus's user avatar
  • 934
2 votes
0 answers
112 views

Intuition for Local vs. Global notions of Metric Entropy in Statistics

I am looking for intuition regarding the following statement on page 4 of this paper by Gassiat and Van Handel: However, in finite dimensional settings, global entropy bounds are known to yield sub-...
WeakLearner's user avatar
  • 6,106
2 votes
0 answers
129 views

Conditional Second Moments of Multivariate Normal Variable on Binary Vectors

Suppose we observe a binary table $Y \in \mathbb R^{N \times G}$, corresponding to $N$ observations of $G$ dimensional binary vectors $Y_1, \cdots, Y_n$. We imagine each vector $Y_i$ is generated from ...
md19jli's user avatar
  • 15
2 votes
0 answers
40 views

Variational Autencoders and the inequality $\mathbb{E}_{z∼q(z|x)}log (p_{model}(x | z)) − D_{KL}(q(z | x)||p_{model}(z)) ≤ log (p_{model}(x))$

I am reading section $20.10.3$ of the book Deep Learning on Variational Autoencoders, where the authors write: To generate a sample from the model, the VAE first draws a sample $z$ from the code ...
IntegrateThis's user avatar
0 votes
1 answer
46 views

What is the meaning of "with probability at least 1-x over the drawing of the m training patterns"?

I'm reading a book about Support vector machine and I encounters this ,so what is the meaning of $\text { with probability at least } 1-\delta \text { over the drawing of the m training patterns}$
吴yuer's user avatar
  • 321
0 votes
0 answers
60 views

Proving this upper bound involving VC dimension

Let $S_n = \{ (x_i, y_i)\}^n_{i=1}$ be a data set. Let $\mathcal H$ and $\mathcal H' $ be hypothesis classes, such that $\mathcal H' = \{ h \in \mathcal H: \hat {er}_{S_n}(h) \leq \beta\}$, where $\...
Keio203's user avatar
  • 561
0 votes
0 answers
56 views

Deriving the solution for ridge polynomial regression.

We have the following loss function: $$\operatorname{Err}(x)=\frac{1}{n}\sum_{i=1}^n(h_w(x_i)-y_i)^2+ λ\|w\|^2$$ I need to derive the solution for a polynomial of degree $0$ ($h_w(x)=w_0$) and ...
dan's user avatar
  • 1
1 vote
0 answers
40 views

For a Bayes classifier, can we prove that adding noise to data does not increase its accuracy?

In terms of a Bayes classifier, it's intuitive to consider that adding noise to data CANNOT increase the accuracy. Taking a binary classification problem as an example. The data distribution is $(x,y)\...
Autumnii's user avatar
0 votes
0 answers
58 views

How to interpret $P(z|x, y; k)$ with examples?

Following this answer and this, I am trying to understand what can be meant by $P(z|x, y; k)$ notation. How to interpret this with examples in terms of machine learning? As I understand, $P(z|x, y)$ ...
B200011011's user avatar
0 votes
0 answers
55 views

How much datapoints must be in a subset of a dataset before the subset is representative of that parent dataset?

It makes sense to me that a randomly sampled subset of a dataset should still be theoretically representative of its parent. When you take data and split into training and test sets, you assume that ...
Sanger Steel's user avatar
2 votes
0 answers
24 views

How to give a high-probability uniform estimation of a potential having access to noisy pointwise estimates of the associated vector field?

As in the title, our goal is to estimate uniformly and with high probability (and up to a constant) a potential having access to noisy pointwise estimates of the associated vector field (i.e., the ...
Bob's user avatar
  • 5,783

15 30 50 per page
1
2
3 4 5
14