Skip to main content

All Questions

3 votes
0 answers
110 views

Generating a sequence of i.i.d permutations by a single uniform random variable

I am learning how the Mini-Batch-Gradient-Descent (MBGD) Algorithm works and I came across one thing, that I find a bit weird and dont know how to show this. In the MBGD algorithm we have a loop of $N$...
Gilligans's user avatar
  • 115
2 votes
1 answer
100 views

hint on a solved old exam question on probabilistic methods calcualation

In my note I have some previous exam solved question as follows in Probabilistic methods section: Example: We have $k$ classes $C_1, C_2,...,C_k$ where each $C_i$ has uniform distribution over $-(2^{...
S. Christin's user avatar
2 votes
0 answers
71 views

MAP estimation for discriminative models

I have some problems in understanding the MAP estimation for discriminative models. I will use the notation used in the very first two pages of this paper https://www.microsoft.com/en-us/research/wp-...
francesco bertolotti's user avatar
1 vote
0 answers
145 views

Logistic Regression | Exercise

I am trying to solve questions 2 to 7 of this exercise exercise sheet At the moment, I don't know how to answer these questions. But I have some ideas for the second question : Do I have to study ...
Bsh's user avatar
  • 11
1 vote
1 answer
2k views

Why does multicollinearity cause the standard errors of the coefficients to go up?

I understand that multicollinearity is a problem because the stronger the correlation, the more difficult it is to change one predictor without changing another and it becomes difficult for the model ...
Eisen's user avatar
  • 223
1 vote
1 answer
153 views

k-NN average distance bound

I need to show the following inequality concerning the k-NN algorithm: $$ $$ Data $ S = \{X_1, X_2, ..., X_n\} $ is split in the $ \leq k$ parts: $$ S_j = \{ X_i L i - (j-1) \lfloor \frac{n}{k} \...
wklm's user avatar
  • 93
1 vote
1 answer
457 views

Variance of a Random Forest

I'm reading through Introduction to Statistical Learning (ISL) right now and I'm having trouble with understanding the variance of a random forest. Does anyone know how this variance is derived? What ...
Eisen's user avatar
  • 223
1 vote
1 answer
341 views

Question about VC-dimension [closed]

Because I'm not used to the theorem of VC-dimension, I don't know why these statements hold. Why axis parallel square can shatter a set of three points? and can't shatter any set of four points?
Ronald's user avatar
  • 103
1 vote
0 answers
95 views

Empirical Risk minimization, symmetrization lemma

I have a question related to obtaining uniformly good estimates of error for the class of hypothesis function. The following images are taken from the paper: "The Complexity of Learning According ...
Gantavya Bhatt's user avatar
1 vote
0 answers
49 views

Getting the coefficients from Partial Least Squares

I am currently reading The Element of Statistical Learning. In section 3.5.2 (Partial Least Squares), it describes the algorithm for it: $1.$ Standardize each $x_j$ to have mean zero and variance one. ...
just_asking123's user avatar
2 votes
1 answer
5k views

Understanding how EM algorithm actually works for missing data

I am currently studying EM algorithm for handling missing data in a data set. I understand that the final goal of EM algorithm is not to impute data, but to calculate the parameters of interest. ...
Albi Toro's user avatar
1 vote
1 answer
99 views

Optimal Rule for Rank Loss

In a binary classification setting, the classification risk of a classifier $h$ is defined by $$ R(h) = \mathbb{P}(Y \neq h(X)), $$ where $(X,Y) \sim P$. It is well known that the classifier that ...
WeakLearner's user avatar
  • 6,106
0 votes
1 answer
114 views

MAE(mean absolute error)

For objects $x_1,..., x_n$ with correct answers $y_1,...,y_n$ from R, construct a constant model $a(x)=c$ for the loss function. $$MAE=\frac{1}{N}\sum_{i=1}^{n}|y_i-c|$$ As I understand, I need to ...
GIFT's user avatar
  • 321
1 vote
0 answers
45 views

Different formulations of within-class scatter matrix

If we have a dataset $X= {x_1,x_2,....,x_n}$ where all the datapoints are in $d-$dimensional feature space and there are $2$ classes $c_1$ and $c_2$ for which $n_1$ points from $X$ are for class $c_1$ ...
ankit's user avatar
  • 353
1 vote
0 answers
229 views

L^2 Norm and Chi-Squared distributions for a Gaussian Process

I am new to the computer science and ML community. I learn best by doing which is why I have created a project for myself to help me get to know gaussian processes and ML in general. I was pointed to ...
Trevor Haas's user avatar
0 votes
1 answer
76 views

Error for model building

After building a model from historical data to create prediction for example $\hat{Y}=B_0+B_1X$ and now I want to calculate MSE or $R^2$ value, do I use same data that was used to create the model to ...
Basil Abu Mallooh's user avatar
2 votes
1 answer
81 views

Finding the conditional probabilities of a latent dirichlet allocation model

Let's say I'm defining a LDA as the following: For each doc $m$: Sample topic probabilities $\theta_m \sim Dirichlet(\alpha)$ For each word $n$: Sample a topic $z_{mn} \sim Multinomial(\theta_m)$ ...
Jonathan's user avatar
  • 736
1 vote
0 answers
26 views

Questions about polynomial algbras

If $x$ is vector of dimension 3, say $(x_1,x_2,x_3)^T$, then $S^d$ is an operator such that $$ S^d(x)=\left( \begin{matrix} x_2 & x_1 & 0 &0 & 0& ...& 0 &0&0\\ ...
Phat Cao's user avatar
0 votes
1 answer
184 views

Maximum Entropy Continuous Distribution

In Pattern Recognition and Machine Learning Ch 1.6, the author derives the distribution which maximises the differential entropy; $$H(\textbf{x})-\int p(\textbf{x}) \ln (p(\textbf{x})) d\textbf{x}$$ ...
tail_recursion's user avatar
0 votes
1 answer
101 views

How to minimize the KL divergence with respect to fixed parameters?

I read the LDA paper multiple times but I'm having trouble with the following. Let's say I define a LDA model as: For each doc $m$: Sample topic probabilities $\theta_m \sim Dirichlet(\alpha)$ For ...
Jonathan's user avatar
  • 736
1 vote
0 answers
42 views

Visualizing data using vectors

Say there are 10 houses and we have three pieces of information for each of them, area, nbedrooms, price I can view this as 10 different vectors in space where there are 3 axes. Basically 10 arrows ...
randomness312's user avatar
6 votes
3 answers
758 views

Application of the chain rule to $3$-layers neural network

Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),L^3(x^3,\theta^3)$, where every $x_k,\theta^k$ are real vectors, for $k=1,2,3$. Also define $\theta=(\theta^1,\theta^2,\theta^3)...
Lilla's user avatar
  • 2,109
1 vote
3 answers
429 views

Application of chain rule, and some recursion

Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),...,L^l(x^l,\theta^l)$, where every $x_k,\theta^k$ are real vectors, for $k=1,...,l$. Also define $\theta=(\theta^1,...,\theta^...
Lilla's user avatar
  • 2,109
1 vote
0 answers
119 views

Expected Risk in Machine Learning

I am currently working through some Statistical Learning Theory and the following is confusing me. For a fixed learning algorithm $A$ that maps training data $S$ to a function ("prediction") ...
Claudio Moneo's user avatar
1 vote
1 answer
374 views

How to compute evidence lower bound (ELBO) when the complete log-likelihood is intractable?

As an example, assume that I have data $\mathbf{X}$, unobserved variables $\mathbf{Z}$ and model parameters $\pmb{\alpha}$, $\pmb{\beta}$. I am omitting the mentioning of any variational parameters. ...
JKB's user avatar
  • 21
0 votes
1 answer
65 views

Conditional Probability vs Joint Probability

I have a model that predicts the color of clothing items (red, blue, green, etc). I have another model that predicts the category of the item (shirt, pants, dress, hat, etc). Given an image, if I run ...
Hellboy's user avatar
  • 135
1 vote
0 answers
89 views

Tree Graphs as mappings

By tree we mean a graph $G$ in which any two vertices $v, y$ are connected by a single path. Equivalently, an undirected, connected, acyclical graph. So, there's this thing in statistics called a ...
Pedro Cavalcante's user avatar
0 votes
0 answers
35 views

How are random forests used to estimate missing data? Literature on the area?

I am exploring imputation techniques to deal with my missing data and I have come across using a random Forest to deal with these values. Does anybody have any literature behind how this could be ...
123123's user avatar
  • 49
1 vote
0 answers
33 views

How to Explain the Basis Functions in Regression Splines

I am going through the Elements of Statistical Learning and am currently working through a chapter on using splines in regression. I have a question about deriving the basis functions for the cubic ...
glawley's user avatar
  • 87
2 votes
1 answer
255 views

Is There a Connection Between Minimum $ {L}_{1} $ Norm Solution and LASSO?

I am reading a book about sparsity Statistical Learning with Sparsity: The Lasso and Generalizations. I want to know the relationship between the following two optimization problem: $$\min_{\beta} \| \...
XIONG ZENG's user avatar
1 vote
1 answer
212 views

Stochastic Gradient Descent for iterated expectation?

Normally, SGD comes up in the context like $$\min_\theta ~ \mathbb{E}(f(X, \theta))$$, where $\theta$ is some parameter, $f$ is a function like $f(X, \theta) = (X-\theta)^2$ (to find the mean), and ...
chausies's user avatar
  • 2,230
1 vote
0 answers
39 views

Bound on error probability of 1-D ideal Stoller split

Consider the binary decision rule $g_c: \mathcal{X} \rightarrow \{0, 1\}$ given by: $$g_c(x) = \begin{cases} 1& \text{if } x \geq c\\ 0 & \text{otherwise} \end{cases}$$ Show that the minimum ...
dmh's user avatar
  • 3,012
2 votes
1 answer
388 views

Bayes classifier for binary decision problem with Reject option

Consider the decision problem where three decisions are valid: $0, 1$ and a third option $reject$. An optimal rule has the lowest probability of error at a fixed "reject" probability. More ...
dmh's user avatar
  • 3,012
1 vote
1 answer
82 views

Soft-EM: E-step for fitting mixed linear regression model

I want to derive the formulas for the soft EM algorithm for the following model $P[y_i | x_i, \pi_{1,\dots,m}, a_{1,\dots,m}] = \sum_{j=1}^m \pi_j \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(a_j^T x_i - ...
qwipo's user avatar
  • 49
1 vote
0 answers
142 views

Clarification on Likelihod and Maximum Likelihood Estimation (MLE) Notation; PLUS a solution for taking into account the uncertainty of data points

Upon reading a significant number of papers related to probabilistic methods of Machine Learning, some of the notation about MLE are still vague to me. So I decided to ask this question once for all ...
sorooshi's user avatar
1 vote
1 answer
65 views

Tightness of bound on true risk in the simplest optimistic case

Vapnik (Statistical Learning Theory) describes the "simplest optimistic" case of learning with empirical risk minimization as the case where at least one of the functions we are selecting ...
dmh's user avatar
  • 3,012
0 votes
1 answer
144 views

Bayes interpretation of regularization in linear regression

I am deriving L2 regularization by considering Bayes theorem. In doing so I came across the following article which stated that the probability of a parameter theta has a probability distribution that ...
cojoye's user avatar
  • 47
1 vote
0 answers
58 views

High-probability bounds using pseudo-dimension or Rademacher complexity

Let $F$ be a set of functions mapping $\mathbb{R}^n$ to $[0,1]$ with pseudo-dimension $d$ and let $D$ be a distribution over $\mathbb{R}^n \times [0,1]$. We know that for any $\epsilon, \delta \in (0,...
EMV's user avatar
  • 139
0 votes
0 answers
37 views

Why do we use parameters in linear regression for multiple variables problems?

I am new to machine learning and I am confused with the use of parameters in linear regression for multiple variables. I do understand the hypothesis function's parameters (I know that theta 0 is the ...
ParPari's user avatar
1 vote
0 answers
64 views

Derivation of M step in Gaussian EM-Algorithm (Maximize a given function for 2 parameters).

Question The Expectation-maximization algorithm is an alternating algorithm. This means that it alternates between the M-step (maximization step) and the E-step (expectation step). My question is how ...
user4933's user avatar
  • 317
1 vote
0 answers
55 views

Active vs Passive statistical learning: How do we say which one is better?

In statistical learning theory, to pose a regression/classification problem, one starts by selecting a set of points $\{X_1, \dots, X_n\}$ then, one labels this to get a dataset $S=\{(X_1,Y_1), \dots, ...
Saleh's user avatar
  • 649
0 votes
1 answer
39 views

Linear regression ML denotations

Could somebody explain what does it mean this denotation: $$\min_w ||Xw - y||^2_2$$ P.S. Linear regression is described this way in Machine Learning.
Adolf Miszka's user avatar
1 vote
2 answers
218 views

Solution for $\beta$ in ridge regression

The RSS of the ridge regression in matrix form is: $$RSS(\lambda) = (y−X\beta)^T(y−X\beta) +λ\beta^T\beta$$ the ridge regression solutions are easily seen to be $$β_{ridge}= (X^TX+λI)^{−1}X^Ty$$ See ...
Trajan's user avatar
  • 5,244
3 votes
2 answers
397 views

What is the expected cost of using LDA?

Suppose that you observe $(X_1,Y_1),...,(X_{100}Y_{100})$, which you assume to be i.i.d. copies of a random pair $(X,Y)$ taking values in $\mathbb{R}^2 \times \{1,2\}$. I have that the cost of ...
user avatar
2 votes
0 answers
34 views

Deriving hyperparameter updates in Online Interactive Collaborative Filtering

I've been going through "Online Interactive Collaborative Filtering Using Multi-Armed Bandit with Dependent Arms" by Wang et al. and am unable to understand how the update equations for the ...
Shashank Gupta's user avatar
0 votes
0 answers
486 views

Maximum KL-divergence between two discrete distributions with non-zero mass on each point of support.

Suppose we are given a discrete probability distribution $p$ defined over a finite set $\mathcal{S}$. We have $p(s) > 0, \forall s\in\mathcal{S}$. Suppose we now want to find the distribution $q$ ...
Brian's user avatar
  • 113
1 vote
0 answers
123 views

Why (multi marginal) optimal transport?

I recently learned about optimal transport (OT) and its generalization to comparing multiple distributions jointly, called multi-marginal optimal transport (MMOT) In a nutshell, the OT does $ \...
SaganTheSag's user avatar
0 votes
1 answer
723 views

how to find an equation representing a decision boundary in logistic regression

I'm new to machine learning and currently working on logistic regression. but i don't know how to deal this problem. let us consider the logistic regression for a dataset $(x_n,y_n)\ (x_i \in \mathbb ...
J.Maisel's user avatar
1 vote
1 answer
69 views

Finding Marginalize the product of p(x|z,μ) and p(z|π)

Consider now n i.i.d. observations of the vector data $({x_1,...,x_n}).$ Using the pdf, we can write the log-likelihood expression: $$l(\boldsymbol x)=\sum_{i=1}^nln(\sum_{k=1}^K\pi_kp(x_i|\mu_k))$$ ...
Eilysh Mucha's user avatar
1 vote
0 answers
24 views

Consistency of regression function estimate

This question is in the context of regression with squared error loss. Let $(X,Y) \in \mathbb{R}^p\times\mathbb{R}$ be random variables with joint distribution ${F}$. We randomly sample a training ...
tygaking's user avatar

15 30 50 per page
1
3 4
5
6 7
14