Newest 'statistics+machine-learning' Questions - Page 4

1 vote

1 answer

103 views

Generalization in Neural Networks: Can one Impose Conditions on the Data?

There is a well-developed theory on generalization bounds for deep neural networks, using VC dimensions and Rademacher Complexities. They work for any underlying "true" distribution $\...

Claudio Moneo

2,188

asked Oct 15, 2021 at 19:58

1 vote

2 answers

441 views

the meaning of with probability at least 1-\delta

In the theoretical analysis of some algorithm for stochastic optimization, we often need to prove that something like $$error\leq\epsilon,~~~~(under~some~conditions)$$ holds with probability at least $...

lazyleo

73

asked Oct 12, 2021 at 3:13

0 votes

1 answer

314 views

What statistics books would you recommend for an undergraduate student who wants to be a machine learning engineer?

I'm an undergraduate software engineering student and I will be taking a statistics course in this semester. I was thinking of buying a statistics textbook such as Probability and Statistics for ...

anıl ateşsaçan

29

asked Oct 6, 2021 at 12:16

2 votes

1 answer

63 views

Probabilistic interpretation linear regression implication step

I am reading Andrew Ng's notes on linear regression, and in this section, he attempts to derive the formula for the least squares using a probability approach: http://cs229.stanford.edu/summer2020/...

K Split X

6,575

asked Sep 29, 2021 at 17:43

1 vote

0 answers

68 views

Minimizing the variance in a variant of bagging Weighted Aggregation(Wagging)

In our machine learning course we have learned Bagging, wherein A variant of bagging call Weighted Aggregation is introduced, where the result is a weighted sum of all the estimators instead of ...

Sheen

11

asked Sep 22, 2021 at 2:38

1 vote

0 answers

207 views

Phased version of Upper-Confidence Bound Algorithm (UCB)

I am interested in an exercise (specifically Exercise 7.4) from Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. It studies a phased version of the popular UCB algorithm. The algorithm takes ...

Brian

113

asked Sep 21, 2021 at 22:00

0 votes

1 answer

43 views

Matrix devision - Bias Variance Tradeoff

I am currently trying to prove that the ordinary least squares estimate doesn't have a bias with a given dataset with the bias given as Why does this identity hold in the following calculation $$(X^...

christheliz

23

asked Sep 18, 2021 at 14:08

1 vote

0 answers

41 views

Theoretical Machine Learning: How to calculate the expected risk of a model with unknown distribution $\hat{h}$?

If we have fixed, deterministic feature vectors $x_1, x_2, ..., x_n \in \mathbb{R^d}$ with an unknown model parameter $\theta^*$ and the error $z$ with $N(0,\sigma^2)$. For the feature vector $x_i$ ...

christheliz

23

asked Sep 13, 2021 at 15:56

0 votes

2 answers

350 views

Linear Regression: Correlation between predictors and residuals

I am reading Chapter 3 from Elements of Statistical Learning. In the explanation for Forward Stagewise Regression and Least Angle Regression, the authors explain that reducing the correlation between ...

temp_user

35

asked Sep 8, 2021 at 4:54

1 vote

0 answers

82 views

What are some functions $\mathbb{R}^+ \to \mathbb{R}$ other than $\log$?

I am interested in functions $f: \mathbb{R}^+ \to \mathbb{R}$, for the purpose of mapping non-negative statistical features of objects (such as lengths) to the whole real line. Then, I intend to use ...

tapphughesn

63

asked Aug 31, 2021 at 1:49

0 votes

0 answers

168 views

Why can i move the summation sign down?

Hi, i'm trying to teach my self machine learning by going through the book "An introduction to Statistical Learning", and got stuck on one of the exercise questions. In the attached image ...

Kenneth .J

541

asked Aug 19, 2021 at 10:32

3 votes

1 answer

396 views

The relation between Bregman divergence and KL divergence

I see that Bregman divergence is defined as $d_\phi(x,y)=\phi(x)-\phi(y)-<x-y,\nabla\phi(y)>$, where $x,y\in R^d$ and $\phi$ is a strictly convex function. KL divergence is an instance of ...

user1388672

71

asked Aug 18, 2021 at 14:10

2 votes

1 answer

105 views

When is it true that $\sup g - \inf g \le 2\sup g$?

I am currently reading the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma. In Lemma C.4 of the ...

Stratos supports the strike

4,730

asked Aug 8, 2021 at 8:04

0 votes

0 answers

60 views

Choosing a loss function for minimize total sum

I've the following regression problem. I'm forecasting a random variable $X$ for every day of a month, represented as $X_{ij}$ where $i$ is the day of the month and $j$ is the month number. I care if ...

broccoli

463

asked Aug 3, 2021 at 3:46

1 vote

0 answers

69 views

Request for reference: uniform convergence for non 0-1 loss functions

In the book "Understanding machine learning", there is Theorem 6.11 with the following statement Let ${\cal H}$ be a class and let $\tau_{\cal H}$ be its growth function. Then, for every $\...

Elnur

352

asked Jul 23, 2021 at 12:18

1 vote

0 answers

13 views

Data Groupings and Bayesian Analysis for a Generative Model

I have a simple question that I think will have potentially many solutions, depending on the level of complexity with which one wants to approach it. I've built a generative model with two variables ...

JKM

449

asked Jul 22, 2021 at 15:21

1 vote

1 answer

1k views

How do I compute the derivative of the cross-entropy loss $H(P,Q)$ with respect to the weights $W$?

I'm trying to understand the cross-entropy loss with iris dataset for binary classification where y=1 denotes the plant belongs to Setosa and y=0 denotes the example belongs to Non-Setosa. Consider ...

JakeMZ

283

asked Jul 21, 2021 at 4:27

1 vote

0 answers

212 views

Optimization: max to softmax for convexity?

Assume we have the following optimization problem: for a family of $m$ vectors $\{x_i\}\in \mathbb{R}^n$, a family of $l$ vectors $\{c_i\}\in \mathbb{R}^n$ with $l\ll m$ and for a family of $l$ ...

Marion

2,239

asked Jul 20, 2021 at 9:05

1 vote

0 answers

22 views

Perplexities about Bayesian inference and model averaging (BMA)

reading about Bayesian approach on model selection, I was just wondering about the more mathematical meaning of Bayesian model averaging. Say for example that we are given a dataset $\mathcal{D} = \{\...

James Arten

1,953

asked Jul 19, 2021 at 9:12

1 vote

1 answer

479 views

How to verify whether a metric is of negative type or not?

A metric $d(\cdot,\cdot)$ of a space $S$ is said to be of negative type, if for $\forall n \geq 2, z_{1}, \ldots, z_{n} \in S$, and $\alpha_{1}, \ldots, \alpha_{n} \in \mathbb{R}$ with $\sum\limits_{i=...

Zhao Zhao

73

asked Jul 17, 2021 at 12:49

2 votes

0 answers

47 views

Measurability issues in the symmetrization step in the proof of $\varepsilon$-sample theorem

Let $(\mathcal{X},d)$ be a metric space and $\mu$ be a Borel probability measure on $(\mathcal{X},d)$. Let $m \in \mathbb{N}$ and define the two probability product measures $\mu^{m} := \otimes_{k=1}^{...

Bob

5,783

asked Jul 16, 2021 at 7:49

1 vote

0 answers

132 views

Empirical Fisher Information but with unknown true parameters and distribution?

I am not sure if I ask it correctly. I am working on using Fisher Information to examine the information in a model (say neural networks for simplicity). What I know is that the definition of Fisher ...

Daniel H. Leung

49

asked Jul 11, 2021 at 3:44

1 vote

0 answers

51 views

Deriving the regularization term in bayesian lasso regression

the title is probably not very good, I thought hard about how to phrase this correctly. I'd be grateful if someone tells me it's wrong and how to correct it. I am practicing for my exams and I have ...

oliver

675

asked Jul 4, 2021 at 11:12

1 vote

0 answers

19 views

How to obtain the parameter update for the multiclass classification (general loss and activation function)?

Consider the feature space $\mathcal{X}=\mathbb R^{d}$ and $\mathcal{Y}=\{1,...,c\}$ such that $c > 2$. We consider some activation function $\alpha: \mathbb R^{c} \to \mathbb R^{c}$ and out weight ...

MinaThuma

998

asked Jul 3, 2021 at 10:38

6 votes

2 answers

685 views

BFGS Formula from Kullback-Leibler Divergence

On page 411 in this book, the authors give the following BFGS formula $$ \boxed{\boldsymbol C_{\textrm{BFGS}} = \boldsymbol C + \underbrace{\frac{\boldsymbol g^\top\boldsymbol\delta+\boldsymbol g^\top\...

LaguerreGroup

94

asked Jun 30, 2021 at 17:05

2 votes

0 answers

75 views

Generative model evaluation metric : Precision & Recall

In this paper, a new metric was proposed to evaluate generative model. The equation (1) decomposes generative distribution and real distribution into two parts w.r.t their intersection of the ...

Code mx

31

asked Jun 9, 2021 at 1:00

0 votes

2 answers

190 views

Bayes classifier: handling conditional expectation / probability

I am learning about the Bayes optimal classifier, and there is a step in a proof I struggle with. One can find this proof also on the Wikipedia page: https://en.wikipedia.org/wiki/Bayes_classifier#...

noam.szyfer

1,600

asked Jun 2, 2021 at 15:41

0 votes

0 answers

36 views

Which standard deviation for model averaging?

I hope math stack exchange is the right place for this question, even though it comes from an AI point of view. Say I have a machine learning model and for robustness of results, I initialize it with ...

frederik

1

asked May 28, 2021 at 10:44

2 votes

1 answer

158 views

Definition of Ergodicity in Theodoridis' Machine Learning

This is related, but is not the same as https://stats.stackexchange.com/questions/319190/wide-sense-stationary-but-not-ergodic. Note that I am not assuming stationarity. Theodoridis, in his Machine ...

Clarinetist

19.6k

asked May 27, 2021 at 21:02

2 votes

0 answers

88 views

Understanding the $\alpha$-regularity assumption for trees

In this paper, definition 4 claims that a tree grown by recursive partitioning is $\alpha$-regular for some $\alpha>0$ if each split leaves at least a fraction $\alpha$ of the available training ...

WeakLearner

6,106

asked May 26, 2021 at 1:56

2 votes

0 answers

57 views

EM algorithm for maximum of 2 normal distribution

Let $X_i \sim N(\mu_1,\sigma^2), Y_i \sim N(\mu_2,\sigma^2)$ $O_i = \max(X_i,Y_i)$ i need to find $\mu_1, \mu_2$ using EM my attempt: first i defined $Z_i = \left\{\begin{matrix} 1, ~ X_i \ge Y_i \\ 0,...

Roi Hezkiyahu

425

asked May 11, 2021 at 10:28

0 votes

1 answer

36 views

Optimizing recursive functions with time series data

I have a recursive function, $f(0,a)$ is known, $f(t+1;a,b)=f(t;a,b)+g(t,b)$ where $a,b$ are constants and $g$ is a function. I also have a sequence of data, $D(t)$. I am trying to optimize $f$ with ...

Xia

542

asked May 8, 2021 at 10:31

2 votes

1 answer

532 views

Derivation of the bias-variance tradeoff

I'm having trouble understanding the derivation of the bias-variance tradeoff which is also given in the article on the mean squared error. Let some data be represented by the random variable $X$ with ...

20_Limes

93

asked May 4, 2021 at 15:22

0 votes

1 answer

66 views

How to prove total test error is independent of the selected learning algorithm

I'm looking at the following proof: Where: Note: I'm new to this but I think I understand all the below variables correctly now. Mistakes are possible though. $f$ is an ideal function with perfect ...

Grant Curell

199

asked Apr 26, 2021 at 0:09

1 vote

0 answers

140 views

Is a Chi-Squared goodness of fit test appropriate for Neural Network regression?

So I always have wished that regression of neural networks gave more interpretable results and I'm pretty hopeful that chi-squared tests anchor these MSE values in the same way that accuracy anchors ...

profPlum

337

asked Apr 24, 2021 at 15:34

0 votes

1 answer

305 views

Support Vector Machine Optimization Problem,

The formulation of the SVM optimization problem is: \begin{equation} \begin{aligned} & max_{w,b} \frac{1}{||w||} \\ & \text{ subject to } \\ & y_i(w^{T}x_i+b) \geq 1 \end{aligned} \end{...

wizz

519

asked Apr 20, 2021 at 20:19

1 vote

0 answers

156 views

Distance weighted uniform sampling - sampling procedure

In the excellent paper Wu, Chao-Yuan; Manmatha, R.; Smola, Alexander J.; Krähenbühl, Philipp (2017): Sampling Matters in Deep Embedding Learning. Available online at https://arxiv.org/pdf/1706.07567. ...

2Obe

185

asked Apr 10, 2021 at 18:31

5 votes

0 answers

79 views

Conditional Bias Variance Decomposition

The standard bias variance decomposition says that: $$ E |f(X) - Y|^2 = \int_{\mathbb{R}^d} |f(x) - m(x)|^2 \mu(dx) + E|m(X) - Y|^2, $$ where $\mu$ is some distribution over $X$. I am trying to ...

WeakLearner

6,106

asked Apr 8, 2021 at 17:09

0 votes

1 answer

69 views

reconstructed error for PCA-analysis not equal to zero?

I am working on an assignment for school where they ask us to perform PCA-analysis on a data set consisting of 500 data points where each data point is of dimension $p=256$. You usually project your ...

JBosmans

1

asked Apr 4, 2021 at 23:11

0 votes

1 answer

85 views

Find solution for optimal regression coefficients

Consider the cost function $E(\mathbf{w}) = \displaystyle\frac{1}{2} \sum_{i=1}^{N}{(\mathbf{d}_i - \mathbf{x}_i^T\mathbf{w})^2} + \frac{\lambda}{2} \left\lVert \mathbf{w}\right\rVert^2$ where $(\...

Monya Feldman

87

asked Mar 30, 2021 at 0:30

1 vote

0 answers

102 views

Two-way ANOVA for machine learning model analysis.

I have different machine learning models and losses, each trained using 5-fold cross-validation. Would it make sense to run a two-way ANOVA two evaluate which are statistically performing better? ...

Ramon

123

asked Mar 11, 2021 at 17:12

6 votes

1 answer

183 views

Does the law of large numbers hold for covering numbers?

I am self-studying empirical process theory. I have encountered the covering number $N(\delta,\mathcal{G},P)$, as well as the empirical version $N(\delta,\mathcal{G},P_n)$. It seems intuitive to ...

Idontgetit

1,391

asked Feb 28, 2021 at 10:02

0 votes

0 answers

28 views

Proof of unbiased estimator

Assume: $$ \phi = \int f(x)p(x)dx = E_p(f)$$ Let $x_s \sim p, s=1,.....,S$ iid $(p(x_s=x)=p(x)$ and $p(x_1,x_2) = p(x_1) p(x_2)$. \begin{align} \hat{\phi} &= \frac{1}{S}\sum_{s=1}^{S}f(x_s) \\ E[\...

Swakshar Deb

281

asked Feb 22, 2021 at 17:43

0 votes

1 answer

144 views

Linear Regression Prediction Errors

Suppose that we perform linear regression on data $\mathbf{X}$ (an $N \times {(D+1)}$ matrix) and predictions $\mathbf{y}$ (an $N \times 1$ vector). Let $\mathbf{w}$ ($(D+1) \times 1$ vector) be the ...

Bobo

409

asked Feb 20, 2021 at 22:39

0 votes

1 answer

100 views

Is logistic regression cost function in SciKit Learn different from standard derivations?

I am trying to understand the math behind logistic regression. Going through a couple of websites, lectures and books, I tried to derive the cost function by thinking of it as the negative of the ...

Anu

311

asked Feb 20, 2021 at 3:49

2 votes

1 answer

247 views

In Fisher’s discriminant for multiple classes, How do you manage when $(Sw)$ is singular matrix (so you cant get $(Sw)^{-1}$)?

I am trying to use Fisher’s discriminant for multiple classes to reduce the Dimension of the MNIST data set, similar to this post: https://towardsdatascience.com/an-illustrative-introduction-to-...

Nicolas Pacheco

133

asked Feb 20, 2021 at 1:12

4 votes

1 answer

2k views

Which is the algorithm for knn density estimator?

I am reading Pattern Recognition and Machine Learning by Christopher Bishop. In chapter two he talk about using knn to density estimation. I want to replicate a plot using python/R/matlab. He is ...

Nicolas Pacheco

133

asked Feb 17, 2021 at 1:53

0 votes

0 answers

28 views

Need a probabilistic approach to determine if a data-set A includes all the elements of data-set B

My job is to identify if the two given datasets are same. This is to be done on computers using some programming language (C++). Since the data could be huge, I don't want to read all the elements of ...

ultimate cause

163

asked Feb 4, 2021 at 3:25

1 vote

2 answers

91 views

Covariance of a Vector-Valued Random Variable

I'm reading through Andrew Ng Lecture Notes for CS229 and he makes the statement that, for a random variable $Z \in \mathbb{R}^{n}$, \begin{align} Cov(Z) &= E[(Z - E[Z])(Z - E[Z])^{T}]\\ &...

keggythekeg

53

asked Jan 22, 2021 at 15:08

2 votes

1 answer

3k views

Intuition behind the exponential loss function

I'm reading about AdaBoost from the book The Elements of Statistical Learning. The book mentions that, to train the model, the exponential loss function is used: $$L(y, f (x)) = e^{−y f (x)},$$ where $...

user3889486

237

asked Jan 21, 2021 at 1:25

All Questions

Related Tags