All Questions
Tagged with statistics machine-learning
659
questions
3
votes
0
answers
110
views
Generating a sequence of i.i.d permutations by a single uniform random variable
I am learning how the Mini-Batch-Gradient-Descent (MBGD) Algorithm works and I came across one thing, that I find a bit weird and dont know how to show this. In the MBGD algorithm we have a loop of $N$...
2
votes
1
answer
100
views
hint on a solved old exam question on probabilistic methods calcualation
In my note I have some previous exam solved question as follows in Probabilistic methods section:
Example: We have $k$ classes $C_1, C_2,...,C_k$ where each $C_i$ has uniform distribution over
$-(2^{...
2
votes
0
answers
71
views
MAP estimation for discriminative models
I have some problems in understanding the MAP estimation for discriminative models.
I will use the notation used in the very first two pages of this paper https://www.microsoft.com/en-us/research/wp-...
1
vote
0
answers
145
views
Logistic Regression | Exercise
I am trying to solve questions 2 to 7 of this exercise exercise sheet
At the moment, I don't know how to answer these questions.
But I have some ideas for the second question :
Do I have to study ...
1
vote
1
answer
2k
views
Why does multicollinearity cause the standard errors of the coefficients to go up?
I understand that multicollinearity is a problem because the stronger the correlation, the more difficult it is to change one predictor without changing another and it becomes difficult for the model ...
1
vote
1
answer
153
views
k-NN average distance bound
I need to show the following inequality concerning the k-NN algorithm:
$$
$$
Data $ S = \{X_1, X_2, ..., X_n\} $ is split in the $ \leq k$ parts:
$$
S_j = \{ X_i L i - (j-1) \lfloor \frac{n}{k} \...
1
vote
1
answer
457
views
Variance of a Random Forest
I'm reading through Introduction to Statistical Learning (ISL) right now and I'm having trouble with understanding the variance of a random forest. Does anyone know how this variance is derived? What ...
1
vote
1
answer
341
views
Question about VC-dimension [closed]
Because I'm not used to the theorem of VC-dimension, I don't know why these statements hold.
Why axis parallel square can shatter a set of three points?
and can't shatter any set of four points?
1
vote
0
answers
95
views
Empirical Risk minimization, symmetrization lemma
I have a question related to obtaining uniformly good estimates of error for the class of hypothesis function. The following images are taken from the paper: "The Complexity of Learning According ...
1
vote
0
answers
49
views
Getting the coefficients from Partial Least Squares
I am currently reading The Element of Statistical Learning. In section 3.5.2 (Partial Least Squares), it describes the algorithm for it:
$1.$ Standardize each $x_j$ to have mean zero and variance one. ...
2
votes
1
answer
5k
views
Understanding how EM algorithm actually works for missing data
I am currently studying EM algorithm for handling missing data in a data set. I understand that the final goal of EM algorithm is not to impute data, but to calculate the parameters of interest. ...
1
vote
1
answer
99
views
Optimal Rule for Rank Loss
In a binary classification setting, the classification risk of a classifier $h$ is defined by
$$
R(h) = \mathbb{P}(Y \neq h(X)),
$$
where $(X,Y) \sim P$. It is well known that the classifier that ...
0
votes
1
answer
114
views
MAE(mean absolute error)
For objects $x_1,..., x_n$ with correct answers $y_1,...,y_n$ from R, construct a constant model $a(x)=c$ for the loss function.
$$MAE=\frac{1}{N}\sum_{i=1}^{n}|y_i-c|$$
As I understand, I need to ...
1
vote
0
answers
45
views
Different formulations of within-class scatter matrix
If we have a dataset $X= {x_1,x_2,....,x_n}$ where all the datapoints are in $d-$dimensional feature space and there are $2$ classes $c_1$ and $c_2$ for which $n_1$ points from $X$ are for class $c_1$ ...
1
vote
0
answers
229
views
L^2 Norm and Chi-Squared distributions for a Gaussian Process
I am new to the computer science and ML community. I learn best by doing which is why I have created a project for myself to help me get to know gaussian processes and ML in general. I was pointed to ...
0
votes
1
answer
76
views
Error for model building
After building a model from historical data to create prediction for example
$\hat{Y}=B_0+B_1X$
and now I want to calculate MSE or $R^2$ value, do I use same data that was used to create the model to ...
2
votes
1
answer
81
views
Finding the conditional probabilities of a latent dirichlet allocation model
Let's say I'm defining a LDA as the following:
For each doc $m$:
Sample topic probabilities $\theta_m \sim Dirichlet(\alpha)$
For each word $n$:
Sample a topic $z_{mn} \sim Multinomial(\theta_m)$
...
1
vote
0
answers
26
views
Questions about polynomial algbras
If $x$ is vector of dimension 3, say $(x_1,x_2,x_3)^T$, then $S^d$ is an operator such that
$$ S^d(x)=\left(
\begin{matrix}
x_2 & x_1 & 0 &0 & 0& ...& 0 &0&0\\
...
0
votes
1
answer
184
views
Maximum Entropy Continuous Distribution
In Pattern Recognition and Machine Learning Ch 1.6, the author derives the distribution which maximises the differential entropy;
$$H(\textbf{x})-\int p(\textbf{x}) \ln (p(\textbf{x})) d\textbf{x}$$
...
0
votes
1
answer
101
views
How to minimize the KL divergence with respect to fixed parameters?
I read the LDA paper multiple times but I'm having trouble with the following. Let's say I define a LDA model as:
For each doc $m$:
Sample topic probabilities $\theta_m \sim Dirichlet(\alpha)$
For ...
1
vote
0
answers
42
views
Visualizing data using vectors
Say there are 10 houses and we have three pieces of information for each of them, area, nbedrooms, price
I can view this as 10 different vectors in space where there are 3 axes. Basically 10 arrows ...
6
votes
3
answers
758
views
Application of the chain rule to $3$-layers neural network
Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),L^3(x^3,\theta^3)$, where every $x_k,\theta^k$ are real vectors, for $k=1,2,3$. Also define $\theta=(\theta^1,\theta^2,\theta^3)...
1
vote
3
answers
429
views
Application of chain rule, and some recursion
Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),...,L^l(x^l,\theta^l)$, where every $x_k,\theta^k$ are real vectors, for $k=1,...,l$. Also define $\theta=(\theta^1,...,\theta^...
1
vote
0
answers
119
views
Expected Risk in Machine Learning
I am currently working through some Statistical Learning Theory and the following is confusing me.
For a fixed learning algorithm $A$ that maps training data $S$ to a function ("prediction") ...
1
vote
1
answer
374
views
How to compute evidence lower bound (ELBO) when the complete log-likelihood is intractable?
As an example, assume that I have data $\mathbf{X}$, unobserved variables $\mathbf{Z}$ and model parameters $\pmb{\alpha}$, $\pmb{\beta}$. I am omitting the mentioning of any variational parameters. ...
0
votes
1
answer
65
views
Conditional Probability vs Joint Probability
I have a model that predicts the color of clothing items (red, blue, green, etc). I have another model that predicts the category of the item (shirt, pants, dress, hat, etc). Given an image, if I run ...
1
vote
0
answers
89
views
Tree Graphs as mappings
By tree we mean a graph $G$ in which any two vertices $v, y$ are connected by a single path. Equivalently, an undirected, connected, acyclical graph.
So, there's this thing in statistics called a ...
0
votes
0
answers
35
views
How are random forests used to estimate missing data? Literature on the area?
I am exploring imputation techniques to deal with my missing data and I have come across using a random Forest to deal with these values. Does anybody have any literature behind how this could be ...
1
vote
0
answers
33
views
How to Explain the Basis Functions in Regression Splines
I am going through the Elements of Statistical Learning and am currently working through a chapter on using splines in regression. I have a question about deriving the basis functions for the cubic ...
2
votes
1
answer
255
views
Is There a Connection Between Minimum $ {L}_{1} $ Norm Solution and LASSO?
I am reading a book about sparsity Statistical Learning with Sparsity:
The Lasso and Generalizations. I want to know the relationship between the following two optimization problem: $$\min_{\beta} \| \...
1
vote
1
answer
212
views
Stochastic Gradient Descent for iterated expectation?
Normally, SGD comes up in the context like $$\min_\theta ~ \mathbb{E}(f(X, \theta))$$, where $\theta$ is some parameter, $f$ is a function like $f(X, \theta) = (X-\theta)^2$ (to find the mean), and ...
1
vote
0
answers
39
views
Bound on error probability of 1-D ideal Stoller split
Consider the binary decision rule $g_c: \mathcal{X} \rightarrow \{0, 1\}$ given by:
$$g_c(x) =
\begin{cases}
1& \text{if } x \geq c\\
0 & \text{otherwise}
\end{cases}$$
Show that the minimum ...
2
votes
1
answer
388
views
Bayes classifier for binary decision problem with Reject option
Consider the decision problem where three decisions are valid: $0, 1$ and a third option $reject$. An optimal rule has the lowest probability of error at a fixed "reject" probability. More ...
1
vote
1
answer
82
views
Soft-EM: E-step for fitting mixed linear regression model
I want to derive the formulas for the soft EM algorithm for the following model $P[y_i | x_i, \pi_{1,\dots,m}, a_{1,\dots,m}] = \sum_{j=1}^m \pi_j \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(a_j^T x_i - ...
1
vote
0
answers
142
views
Clarification on Likelihod and Maximum Likelihood Estimation (MLE) Notation; PLUS a solution for taking into account the uncertainty of data points
Upon reading a significant number of papers related to probabilistic methods of Machine Learning, some of the notation about MLE are still vague to me. So I decided to ask this question once for all ...
1
vote
1
answer
65
views
Tightness of bound on true risk in the simplest optimistic case
Vapnik (Statistical Learning Theory) describes the "simplest optimistic" case of learning with empirical risk minimization as the case where at least one of the functions we are selecting ...
0
votes
1
answer
144
views
Bayes interpretation of regularization in linear regression
I am deriving L2 regularization by considering Bayes theorem. In doing so I came across the following article which stated that the probability of a parameter theta has a probability distribution that ...
1
vote
0
answers
58
views
High-probability bounds using pseudo-dimension or Rademacher complexity
Let $F$ be a set of functions mapping $\mathbb{R}^n$ to $[0,1]$ with pseudo-dimension $d$ and let $D$ be a distribution over $\mathbb{R}^n \times [0,1]$. We know that for any $\epsilon, \delta \in (0,...
0
votes
0
answers
37
views
Why do we use parameters in linear regression for multiple variables problems?
I am new to machine learning and I am confused with the use of parameters in linear regression for multiple variables. I do understand the hypothesis function's parameters (I know that theta 0 is the ...
1
vote
0
answers
64
views
Derivation of M step in Gaussian EM-Algorithm (Maximize a given function for 2 parameters).
Question
The Expectation-maximization algorithm is an alternating algorithm. This means that it alternates between the M-step (maximization step) and the E-step (expectation step). My question is how ...
1
vote
0
answers
55
views
Active vs Passive statistical learning: How do we say which one is better?
In statistical learning theory, to pose a regression/classification problem, one starts by selecting a set of points $\{X_1, \dots, X_n\}$ then, one labels this to get a dataset $S=\{(X_1,Y_1), \dots, ...
0
votes
1
answer
39
views
Linear regression ML denotations
Could somebody explain what does it mean this denotation:
$$\min_w ||Xw - y||^2_2$$
P.S. Linear regression is described this way in Machine Learning.
1
vote
2
answers
218
views
Solution for $\beta$ in ridge regression
The RSS of the ridge regression in matrix form is:
$$RSS(\lambda) = (y−X\beta)^T(y−X\beta) +λ\beta^T\beta$$
the ridge regression solutions are easily seen to be
$$β_{ridge}= (X^TX+λI)^{−1}X^Ty$$
See ...
3
votes
2
answers
397
views
What is the expected cost of using LDA?
Suppose that you observe $(X_1,Y_1),...,(X_{100}Y_{100})$, which you assume to be i.i.d. copies of a random pair $(X,Y)$ taking values in $\mathbb{R}^2 \times \{1,2\}$.
I have that the cost of ...
2
votes
0
answers
34
views
Deriving hyperparameter updates in Online Interactive Collaborative Filtering
I've been going through "Online Interactive Collaborative Filtering Using Multi-Armed Bandit with Dependent Arms" by Wang et al. and am unable to understand how the update equations for the ...
0
votes
0
answers
486
views
Maximum KL-divergence between two discrete distributions with non-zero mass on each point of support.
Suppose we are given a discrete probability distribution $p$ defined over a finite set $\mathcal{S}$. We have $p(s) > 0, \forall s\in\mathcal{S}$. Suppose we now want to find the distribution $q$ ...
1
vote
0
answers
123
views
Why (multi marginal) optimal transport?
I recently learned about optimal transport (OT) and its generalization to comparing multiple distributions jointly, called multi-marginal optimal transport (MMOT)
In a nutshell, the OT does
$ \...
0
votes
1
answer
723
views
how to find an equation representing a decision boundary in logistic regression
I'm new to machine learning and currently working on logistic regression. but i don't know how to deal this problem. let us consider the logistic regression for a dataset $(x_n,y_n)\ (x_i \in \mathbb ...
1
vote
1
answer
69
views
Finding Marginalize the product of p(x|z,μ) and p(z|π)
Consider now n i.i.d. observations of the vector data $({x_1,...,x_n}).$ Using the pdf, we can write the log-likelihood expression:
$$l(\boldsymbol x)=\sum_{i=1}^nln(\sum_{k=1}^K\pi_kp(x_i|\mu_k))$$
...
1
vote
0
answers
24
views
Consistency of regression function estimate
This question is in the context of regression with squared error loss.
Let $(X,Y) \in \mathbb{R}^p\times\mathbb{R}$ be random variables with joint distribution ${F}$. We randomly sample a training ...