Newest 'statistics+machine-learning' Questions - Page 5

3 votes

0 answers

110 views

Generating a sequence of i.i.d permutations by a single uniform random variable

I am learning how the Mini-Batch-Gradient-Descent (MBGD) Algorithm works and I came across one thing, that I find a bit weird and dont know how to show this. In the MBGD algorithm we have a loop of $N$...

Gilligans

115

asked Jan 6, 2021 at 17:33

2 votes

1 answer

100 views

hint on a solved old exam question on probabilistic methods calcualation

In my note I have some previous exam solved question as follows in Probabilistic methods section: Example: We have $k$ classes $C_1, C_2,...,C_k$ where each $C_i$ has uniform distribution over $-(2^{...

S. Christin

63

asked Jan 5, 2021 at 11:31

2 votes

0 answers

71 views

MAP estimation for discriminative models

I have some problems in understanding the MAP estimation for discriminative models. I will use the notation used in the very first two pages of this paper https://www.microsoft.com/en-us/research/wp-...

francesco bertolotti

21

asked Dec 28, 2020 at 8:15

1 vote

0 answers

145 views

Logistic Regression | Exercise

I am trying to solve questions 2 to 7 of this exercise exercise sheet At the moment, I don't know how to answer these questions. But I have some ideas for the second question : Do I have to study ...

Bsh

11

asked Dec 27, 2020 at 19:28

1 vote

1 answer

2k views

Why does multicollinearity cause the standard errors of the coefficients to go up?

I understand that multicollinearity is a problem because the stronger the correlation, the more difficult it is to change one predictor without changing another and it becomes difficult for the model ...

Eisen

223

asked Dec 21, 2020 at 23:00

1 vote

1 answer

153 views

k-NN average distance bound

I need to show the following inequality concerning the k-NN algorithm: $$ $$ Data $ S = \{X_1, X_2, ..., X_n\} $ is split in the $ \leq k$ parts: $$ S_j = \{ X_i L i - (j-1) \lfloor \frac{n}{k} \...

wklm

93

asked Dec 17, 2020 at 10:00

1 vote

1 answer

457 views

Variance of a Random Forest

I'm reading through Introduction to Statistical Learning (ISL) right now and I'm having trouble with understanding the variance of a random forest. Does anyone know how this variance is derived? What ...

Eisen

223

asked Dec 10, 2020 at 15:42

1 vote

1 answer

341 views

Question about VC-dimension [closed]

Because I'm not used to the theorem of VC-dimension, I don't know why these statements hold. Why axis parallel square can shatter a set of three points? and can't shatter any set of four points?

Ronald

103

asked Dec 9, 2020 at 2:48

1 vote

0 answers

95 views

Empirical Risk minimization, symmetrization lemma

I have a question related to obtaining uniformly good estimates of error for the class of hypothesis function. The following images are taken from the paper: "The Complexity of Learning According ...

Gantavya Bhatt

66

asked Nov 22, 2020 at 10:04

1 vote

0 answers

49 views

Getting the coefficients from Partial Least Squares

I am currently reading The Element of Statistical Learning. In section 3.5.2 (Partial Least Squares), it describes the algorithm for it: $1.$ Standardize each $x_j$ to have mean zero and variance one. ...

just_asking123

11

asked Nov 21, 2020 at 5:03

2 votes

1 answer

5k views

Understanding how EM algorithm actually works for missing data

I am currently studying EM algorithm for handling missing data in a data set. I understand that the final goal of EM algorithm is not to impute data, but to calculate the parameters of interest. ...

Albi Toro

69

asked Nov 16, 2020 at 17:33

1 vote

1 answer

99 views

Optimal Rule for Rank Loss

In a binary classification setting, the classification risk of a classifier $h$ is defined by $$ R(h) = \mathbb{P}(Y \neq h(X)), $$ where $(X,Y) \sim P$. It is well known that the classifier that ...

WeakLearner

6,106

asked Nov 13, 2020 at 3:31

0 votes

1 answer

114 views

MAE(mean absolute error)

For objects $x_1,..., x_n$ with correct answers $y_1,...,y_n$ from R, construct a constant model $a(x)=c$ for the loss function. $$MAE=\frac{1}{N}\sum_{i=1}^{n}|y_i-c|$$ As I understand, I need to ...

GIFT

321

asked Nov 6, 2020 at 15:46

1 vote

0 answers

45 views

Different formulations of within-class scatter matrix

If we have a dataset $X= {x_1,x_2,....,x_n}$ where all the datapoints are in $d-$dimensional feature space and there are $2$ classes $c_1$ and $c_2$ for which $n_1$ points from $X$ are for class $c_1$ ...

ankit

353

asked Nov 1, 2020 at 4:16

1 vote

0 answers

229 views

L^2 Norm and Chi-Squared distributions for a Gaussian Process

I am new to the computer science and ML community. I learn best by doing which is why I have created a project for myself to help me get to know gaussian processes and ML in general. I was pointed to ...

Trevor Haas

11

asked Oct 27, 2020 at 22:56

0 votes

1 answer

76 views

Error for model building

After building a model from historical data to create prediction for example $\hat{Y}=B_0+B_1X$ and now I want to calculate MSE or $R^2$ value, do I use same data that was used to create the model to ...

Basil Abu Mallooh

1

asked Oct 27, 2020 at 9:51

2 votes

1 answer

81 views

Finding the conditional probabilities of a latent dirichlet allocation model

Let's say I'm defining a LDA as the following: For each doc $m$: Sample topic probabilities $\theta_m \sim Dirichlet(\alpha)$ For each word $n$: Sample a topic $z_{mn} \sim Multinomial(\theta_m)$ ...

Jonathan

736

asked Oct 21, 2020 at 3:57

1 vote

0 answers

26 views

Questions about polynomial algbras

If $x$ is vector of dimension 3, say $(x_1,x_2,x_3)^T$, then $S^d$ is an operator such that $$ S^d(x)=\left( \begin{matrix} x_2 & x_1 & 0 &0 & 0& ...& 0 &0&0\\ ...

Phat Cao

21

asked Oct 19, 2020 at 7:35

0 votes

1 answer

184 views

Maximum Entropy Continuous Distribution

In Pattern Recognition and Machine Learning Ch 1.6, the author derives the distribution which maximises the differential entropy; $$H(\textbf{x})-\int p(\textbf{x}) \ln (p(\textbf{x})) d\textbf{x}$$ ...

tail_recursion

453

asked Oct 17, 2020 at 7:11

0 votes

1 answer

101 views

How to minimize the KL divergence with respect to fixed parameters?

I read the LDA paper multiple times but I'm having trouble with the following. Let's say I define a LDA model as: For each doc $m$: Sample topic probabilities $\theta_m \sim Dirichlet(\alpha)$ For ...

Jonathan

736

asked Oct 17, 2020 at 4:02

1 vote

0 answers

42 views

Visualizing data using vectors

Say there are 10 houses and we have three pieces of information for each of them, area, nbedrooms, price I can view this as 10 different vectors in space where there are 3 axes. Basically 10 arrows ...

randomness312

11

asked Oct 7, 2020 at 13:11

6 votes

3 answers

758 views

Application of the chain rule to $3$-layers neural network

Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),L^3(x^3,\theta^3)$, where every $x_k,\theta^k$ are real vectors, for $k=1,2,3$. Also define $\theta=(\theta^1,\theta^2,\theta^3)...

Lilla

2,109

asked Oct 3, 2020 at 16:15

1 vote

3 answers

429 views

Application of chain rule, and some recursion

Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),...,L^l(x^l,\theta^l)$, where every $x_k,\theta^k$ are real vectors, for $k=1,...,l$. Also define $\theta=(\theta^1,...,\theta^...

Lilla

2,109

asked Oct 2, 2020 at 22:50

1 vote

0 answers

119 views

Expected Risk in Machine Learning

I am currently working through some Statistical Learning Theory and the following is confusing me. For a fixed learning algorithm $A$ that maps training data $S$ to a function ("prediction") ...

Claudio Moneo

2,188

asked Sep 17, 2020 at 13:50

1 vote

1 answer

374 views

How to compute evidence lower bound (ELBO) when the complete log-likelihood is intractable?

As an example, assume that I have data $\mathbf{X}$, unobserved variables $\mathbf{Z}$ and model parameters $\pmb{\alpha}$, $\pmb{\beta}$. I am omitting the mentioning of any variational parameters. ...

JKB

21

asked Sep 8, 2020 at 22:32

0 votes

1 answer

65 views

Conditional Probability vs Joint Probability

I have a model that predicts the color of clothing items (red, blue, green, etc). I have another model that predicts the category of the item (shirt, pants, dress, hat, etc). Given an image, if I run ...

Hellboy

135

asked Sep 2, 2020 at 21:48

1 vote

0 answers

89 views

Tree Graphs as mappings

By tree we mean a graph $G$ in which any two vertices $v, y$ are connected by a single path. Equivalently, an undirected, connected, acyclical graph. So, there's this thing in statistics called a ...

Pedro Cavalcante

91

asked Sep 1, 2020 at 19:41

0 votes

0 answers

35 views

How are random forests used to estimate missing data? Literature on the area?

I am exploring imputation techniques to deal with my missing data and I have come across using a random Forest to deal with these values. Does anybody have any literature behind how this could be ...

123123

49

asked Aug 26, 2020 at 18:36

1 vote

0 answers

33 views

How to Explain the Basis Functions in Regression Splines

I am going through the Elements of Statistical Learning and am currently working through a chapter on using splines in regression. I have a question about deriving the basis functions for the cubic ...

glawley

87

asked Aug 20, 2020 at 16:34

2 votes

1 answer

255 views

Is There a Connection Between Minimum $ {L}_{1} $ Norm Solution and LASSO?

I am reading a book about sparsity Statistical Learning with Sparsity: The Lasso and Generalizations. I want to know the relationship between the following two optimization problem: $$\min_{\beta} \| \...

XIONG ZENG

31

asked Aug 11, 2020 at 16:22

1 vote

1 answer

212 views

Stochastic Gradient Descent for iterated expectation?

Normally, SGD comes up in the context like $$\min_\theta ~ \mathbb{E}(f(X, \theta))$$, where $\theta$ is some parameter, $f$ is a function like $f(X, \theta) = (X-\theta)^2$ (to find the mean), and ...

chausies

2,230

asked Aug 1, 2020 at 10:22

1 vote

0 answers

39 views

Bound on error probability of 1-D ideal Stoller split

Consider the binary decision rule $g_c: \mathcal{X} \rightarrow \{0, 1\}$ given by: $$g_c(x) = \begin{cases} 1& \text{if } x \geq c\\ 0 & \text{otherwise} \end{cases}$$ Show that the minimum ...

dmh

3,012

asked Jul 21, 2020 at 23:41

2 votes

1 answer

388 views

Bayes classifier for binary decision problem with Reject option

Consider the decision problem where three decisions are valid: $0, 1$ and a third option $reject$. An optimal rule has the lowest probability of error at a fixed "reject" probability. More ...

dmh

3,012

asked Jul 20, 2020 at 22:36

1 vote

1 answer

82 views

Soft-EM: E-step for fitting mixed linear regression model

I want to derive the formulas for the soft EM algorithm for the following model $P[y_i | x_i, \pi_{1,\dots,m}, a_{1,\dots,m}] = \sum_{j=1}^m \pi_j \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(a_j^T x_i - ...

qwipo

49

asked Jul 20, 2020 at 13:36

1 vote

0 answers

142 views

Clarification on Likelihod and Maximum Likelihood Estimation (MLE) Notation; PLUS a solution for taking into account the uncertainty of data points

Upon reading a significant number of papers related to probabilistic methods of Machine Learning, some of the notation about MLE are still vague to me. So I decided to ask this question once for all ...

sorooshi

83

asked Jul 20, 2020 at 13:35

1 vote

1 answer

65 views

Tightness of bound on true risk in the simplest optimistic case

Vapnik (Statistical Learning Theory) describes the "simplest optimistic" case of learning with empirical risk minimization as the case where at least one of the functions we are selecting ...

dmh

3,012

asked Jul 15, 2020 at 0:46

0 votes

1 answer

144 views

Bayes interpretation of regularization in linear regression

I am deriving L2 regularization by considering Bayes theorem. In doing so I came across the following article which stated that the probability of a parameter theta has a probability distribution that ...

cojoye

47

asked Jul 10, 2020 at 18:29

1 vote

0 answers

58 views

High-probability bounds using pseudo-dimension or Rademacher complexity

Let $F$ be a set of functions mapping $\mathbb{R}^n$ to $[0,1]$ with pseudo-dimension $d$ and let $D$ be a distribution over $\mathbb{R}^n \times [0,1]$. We know that for any $\epsilon, \delta \in (0,...

EMV

139

asked Jun 30, 2020 at 18:40

0 votes

0 answers

37 views

Why do we use parameters in linear regression for multiple variables problems?

I am new to machine learning and I am confused with the use of parameters in linear regression for multiple variables. I do understand the hypothesis function's parameters (I know that theta 0 is the ...

ParPari

1

asked Jun 27, 2020 at 22:32

1 vote

0 answers

64 views

Derivation of M step in Gaussian EM-Algorithm (Maximize a given function for 2 parameters).

Question The Expectation-maximization algorithm is an alternating algorithm. This means that it alternates between the M-step (maximization step) and the E-step (expectation step). My question is how ...

user4933

317

asked Jun 25, 2020 at 18:30

1 vote

0 answers

55 views

Active vs Passive statistical learning: How do we say which one is better?

In statistical learning theory, to pose a regression/classification problem, one starts by selecting a set of points $\{X_1, \dots, X_n\}$ then, one labels this to get a dataset $S=\{(X_1,Y_1), \dots, ...

Saleh

649

asked Jun 19, 2020 at 13:13

0 votes

1 answer

39 views

Linear regression ML denotations

Could somebody explain what does it mean this denotation: $$\min_w ||Xw - y||^2_2$$ P.S. Linear regression is described this way in Machine Learning.

Adolf Miszka

480

asked Jun 18, 2020 at 21:30

1 vote

2 answers

218 views

Solution for $\beta$ in ridge regression

The RSS of the ridge regression in matrix form is: $$RSS(\lambda) = (y−X\beta)^T(y−X\beta) +λ\beta^T\beta$$ the ridge regression solutions are easily seen to be $$β_{ridge}= (X^TX+λI)^{−1}X^Ty$$ See ...

Trajan

5,244

asked Jun 15, 2020 at 14:37

3 votes

2 answers

397 views

What is the expected cost of using LDA?

Suppose that you observe $(X_1,Y_1),...,(X_{100}Y_{100})$, which you assume to be i.i.d. copies of a random pair $(X,Y)$ taking values in $\mathbb{R}^2 \times \{1,2\}$. I have that the cost of ...

user634512

asked Jun 13, 2020 at 22:50

2 votes

0 answers

34 views

Deriving hyperparameter updates in Online Interactive Collaborative Filtering

I've been going through "Online Interactive Collaborative Filtering Using Multi-Armed Bandit with Dependent Arms" by Wang et al. and am unable to understand how the update equations for the ...

Shashank Gupta

21

asked Jun 13, 2020 at 5:22

0 votes

0 answers

486 views

Maximum KL-divergence between two discrete distributions with non-zero mass on each point of support.

Suppose we are given a discrete probability distribution $p$ defined over a finite set $\mathcal{S}$. We have $p(s) > 0, \forall s\in\mathcal{S}$. Suppose we now want to find the distribution $q$ ...

Brian

113

asked Jun 7, 2020 at 17:08

1 vote

0 answers

123 views

Why (multi marginal) optimal transport?

I recently learned about optimal transport (OT) and its generalization to comparing multiple distributions jointly, called multi-marginal optimal transport (MMOT) In a nutshell, the OT does $ \...

SaganTheSag

11

asked Jun 3, 2020 at 18:14

0 votes

1 answer

723 views

how to find an equation representing a decision boundary in logistic regression

I'm new to machine learning and currently working on logistic regression. but i don't know how to deal this problem. let us consider the logistic regression for a dataset $(x_n,y_n)\ (x_i \in \mathbb ...

J.Maisel

57

asked Jun 1, 2020 at 14:45

1 vote

1 answer

69 views

Finding Marginalize the product of p(x|z,μ) and p(z|π)

Consider now n i.i.d. observations of the vector data $({x_1,...,x_n}).$ Using the pdf, we can write the log-likelihood expression: $$l(\boldsymbol x)=\sum_{i=1}^nln(\sum_{k=1}^K\pi_kp(x_i|\mu_k))$$ ...

Eilysh Mucha

11

asked May 30, 2020 at 14:37

1 vote

0 answers

24 views

Consistency of regression function estimate

This question is in the context of regression with squared error loss. Let $(X,Y) \in \mathbb{R}^p\times\mathbb{R}$ be random variables with joint distribution ${F}$. We randomly sample a training ...

tygaking

43

asked May 19, 2020 at 15:18

All Questions

Related Tags