All Questions
Tagged with statistics machine-learning
660
questions
1
vote
1
answer
103
views
Generalization in Neural Networks: Can one Impose Conditions on the Data?
There is a well-developed theory on generalization bounds for deep neural networks, using VC dimensions and Rademacher Complexities. They work for any underlying "true" distribution
$\...
1
vote
2
answers
441
views
the meaning of with probability at least 1-\delta
In the theoretical analysis of some algorithm for stochastic optimization, we often need to prove that something like
$$error\leq\epsilon,~~~~(under~some~conditions)$$
holds with probability at least $...
0
votes
1
answer
314
views
What statistics books would you recommend for an undergraduate student who wants to be a machine learning engineer?
I'm an undergraduate software engineering student and I will be taking a statistics course in this semester. I was thinking of buying a statistics textbook such as Probability and Statistics for ...
2
votes
1
answer
63
views
Probabilistic interpretation linear regression implication step
I am reading Andrew Ng's notes on linear regression, and in this section, he attempts to derive the formula for the least squares using a probability approach: http://cs229.stanford.edu/summer2020/...
1
vote
0
answers
68
views
Minimizing the variance in a variant of bagging Weighted Aggregation(Wagging)
In our machine learning course we have learned Bagging, wherein A variant of bagging call Weighted Aggregation is introduced, where the result is a weighted sum of all the estimators instead of ...
1
vote
0
answers
207
views
Phased version of Upper-Confidence Bound Algorithm (UCB)
I am interested in an exercise (specifically Exercise 7.4) from Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. It studies a phased version of the popular UCB algorithm. The algorithm takes ...
0
votes
1
answer
43
views
Matrix devision - Bias Variance Tradeoff
I am currently trying to prove that the ordinary least squares estimate doesn't have a bias with a given dataset
with the bias given as
Why does this identity hold in the following calculation $$(X^...
1
vote
0
answers
41
views
Theoretical Machine Learning: How to calculate the expected risk of a model with unknown distribution $\hat{h}$?
If we have fixed, deterministic feature vectors $x_1, x_2, ..., x_n \in \mathbb{R^d}$ with an unknown model parameter $\theta^*$ and the error $z$ with $N(0,\sigma^2)$. For the feature vector $x_i$ ...
0
votes
2
answers
350
views
Linear Regression: Correlation between predictors and residuals
I am reading Chapter 3 from Elements of Statistical Learning. In the explanation for Forward Stagewise Regression and Least Angle Regression, the authors explain that reducing the correlation between ...
1
vote
0
answers
82
views
What are some functions $\mathbb{R}^+ \to \mathbb{R}$ other than $\log$?
I am interested in functions $f: \mathbb{R}^+ \to \mathbb{R}$, for the purpose of mapping non-negative statistical features of objects (such as lengths) to the whole real line. Then, I intend to use ...
0
votes
0
answers
168
views
Why can i move the summation sign down?
Hi, i'm trying to teach my self machine learning by going through the book "An introduction to Statistical Learning", and got stuck on one of the exercise questions.
In the attached image ...
3
votes
1
answer
396
views
The relation between Bregman divergence and KL divergence
I see that Bregman divergence is defined as $d_\phi(x,y)=\phi(x)-\phi(y)-<x-y,\nabla\phi(y)>$, where $x,y\in R^d$ and $\phi$ is a strictly convex function.
KL divergence is an instance of ...
2
votes
1
answer
105
views
When is it true that $\sup g - \inf g \le 2\sup g$?
I am currently reading the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma.
In Lemma C.4 of the ...
0
votes
0
answers
60
views
Choosing a loss function for minimize total sum
I've the following regression problem. I'm forecasting a random variable $X$ for every day of a month, represented as $X_{ij}$ where $i$ is the day of the month and $j$ is the month number. I care if ...
1
vote
0
answers
69
views
Request for reference: uniform convergence for non 0-1 loss functions
In the book "Understanding machine learning", there is Theorem 6.11 with the following statement
Let ${\cal H}$ be a class and let $\tau_{\cal H}$ be its growth function. Then, for every $\...
1
vote
0
answers
13
views
Data Groupings and Bayesian Analysis for a Generative Model
I have a simple question that I think will have potentially many solutions, depending on the level of complexity with which one wants to approach it.
I've built a generative model with two variables ...
1
vote
1
answer
1k
views
How do I compute the derivative of the cross-entropy loss $H(P,Q)$ with respect to the weights $W$?
I'm trying to understand the cross-entropy loss with iris dataset for binary classification where y=1 denotes the plant belongs to Setosa and y=0 denotes the example belongs to Non-Setosa.
Consider ...
1
vote
0
answers
212
views
Optimization: max to softmax for convexity?
Assume we have the following optimization problem: for a family of $m$ vectors $\{x_i\}\in \mathbb{R}^n$, a family of $l$ vectors $\{c_i\}\in \mathbb{R}^n$ with $l\ll m$ and for a family of $l$ ...
1
vote
0
answers
22
views
Perplexities about Bayesian inference and model averaging (BMA)
reading about Bayesian approach on model selection, I was just wondering about the more mathematical meaning of Bayesian model averaging.
Say for example that we are given a dataset $\mathcal{D} = \{\...
1
vote
1
answer
479
views
How to verify whether a metric is of negative type or not?
A metric $d(\cdot,\cdot)$ of a space $S$ is said to be of negative type, if for $\forall n \geq 2, z_{1}, \ldots, z_{n} \in S$, and $\alpha_{1}, \ldots, \alpha_{n} \in \mathbb{R}$
with $\sum\limits_{i=...
2
votes
0
answers
47
views
Measurability issues in the symmetrization step in the proof of $\varepsilon$-sample theorem
Let $(\mathcal{X},d)$ be a metric space and $\mu$ be a Borel probability measure on $(\mathcal{X},d)$. Let $m \in \mathbb{N}$ and define the two probability product measures $\mu^{m} := \otimes_{k=1}^{...
1
vote
0
answers
132
views
Empirical Fisher Information but with unknown true parameters and distribution?
I am not sure if I ask it correctly. I am working on using Fisher Information to examine the information in a model (say neural networks for simplicity).
What I know is that the definition of Fisher ...
1
vote
0
answers
51
views
Deriving the regularization term in bayesian lasso regression
the title is probably not very good, I thought hard about how to phrase this correctly. I'd be grateful if someone tells me it's wrong and how to correct it.
I am practicing for my exams and I have ...
1
vote
0
answers
19
views
How to obtain the parameter update for the multiclass classification (general loss and activation function)?
Consider the feature space $\mathcal{X}=\mathbb R^{d}$ and $\mathcal{Y}=\{1,...,c\}$ such that $c > 2$. We consider some activation function $\alpha: \mathbb R^{c} \to \mathbb R^{c}$ and out weight ...
6
votes
2
answers
685
views
BFGS Formula from Kullback-Leibler Divergence
On page 411 in this book, the authors give the following BFGS formula $$ \boxed{\boldsymbol C_{\textrm{BFGS}} = \boldsymbol C + \underbrace{\frac{\boldsymbol g^\top\boldsymbol\delta+\boldsymbol g^\top\...
2
votes
0
answers
75
views
Generative model evaluation metric : Precision & Recall
In this paper, a new metric was proposed to evaluate generative model.
The equation (1) decomposes generative distribution and real distribution into two parts w.r.t their intersection of the ...
0
votes
2
answers
190
views
Bayes classifier: handling conditional expectation / probability
I am learning about the Bayes optimal classifier, and there is a step in a proof I struggle with. One can find this proof also on the Wikipedia page: https://en.wikipedia.org/wiki/Bayes_classifier#...
0
votes
0
answers
36
views
Which standard deviation for model averaging?
I hope math stack exchange is the right place for this question, even though it comes from an AI point of view.
Say I have a machine learning model and for robustness of results, I initialize it with ...
2
votes
1
answer
158
views
Definition of Ergodicity in Theodoridis' Machine Learning
This is related, but is not the same as https://stats.stackexchange.com/questions/319190/wide-sense-stationary-but-not-ergodic.
Note that I am not assuming stationarity. Theodoridis, in his Machine ...
2
votes
0
answers
88
views
Understanding the $\alpha$-regularity assumption for trees
In this paper, definition 4 claims that a
tree grown by recursive partitioning is $\alpha$-regular for some
$\alpha>0$ if each split leaves at least a fraction $\alpha$ of the
available training ...
2
votes
0
answers
57
views
EM algorithm for maximum of 2 normal distribution
Let $X_i \sim N(\mu_1,\sigma^2), Y_i \sim N(\mu_2,\sigma^2)$
$O_i = \max(X_i,Y_i)$
i need to find $\mu_1, \mu_2$ using EM
my attempt:
first i defined $Z_i = \left\{\begin{matrix}
1, ~ X_i \ge Y_i
\\
0,...
0
votes
1
answer
36
views
Optimizing recursive functions with time series data
I have a recursive function, $f(0,a)$ is known, $f(t+1;a,b)=f(t;a,b)+g(t,b)$ where $a,b$ are constants and $g$ is a function. I also have a sequence of data, $D(t)$. I am trying to optimize $f$ with ...
2
votes
1
answer
532
views
Derivation of the bias-variance tradeoff
I'm having trouble understanding the derivation of the bias-variance tradeoff which is also given in the article on the mean squared error.
Let some data be represented by the random variable $X$ with ...
0
votes
1
answer
66
views
How to prove total test error is independent of the selected learning algorithm
I'm looking at the following proof:
Where:
Note: I'm new to this but I think I understand all the below variables correctly now. Mistakes are possible though.
$f$ is an ideal function with perfect ...
1
vote
0
answers
140
views
Is a Chi-Squared goodness of fit test appropriate for Neural Network regression?
So I always have wished that regression of neural networks gave more interpretable results and I'm pretty hopeful that chi-squared tests anchor these MSE values in the same way that accuracy anchors ...
0
votes
1
answer
305
views
Support Vector Machine Optimization Problem,
The formulation of the SVM optimization problem is:
\begin{equation}
\begin{aligned}
& max_{w,b} \frac{1}{||w||} \\
& \text{ subject to } \\
& y_i(w^{T}x_i+b) \geq 1
\end{aligned}
\end{...
1
vote
0
answers
156
views
Distance weighted uniform sampling - sampling procedure
In the excellent paper Wu, Chao-Yuan; Manmatha, R.; Smola, Alexander J.; Krähenbühl, Philipp (2017): Sampling Matters in Deep Embedding Learning. Available online at https://arxiv.org/pdf/1706.07567.
...
5
votes
0
answers
79
views
Conditional Bias Variance Decomposition
The standard bias variance decomposition says that:
$$
E |f(X) - Y|^2 = \int_{\mathbb{R}^d} |f(x) - m(x)|^2 \mu(dx) + E|m(X) - Y|^2,
$$
where $\mu$ is some distribution over $X$. I am trying to ...
0
votes
1
answer
69
views
reconstructed error for PCA-analysis not equal to zero?
I am working on an assignment for school where they ask us to perform PCA-analysis on a data set consisting of 500 data points where each data point is of dimension $p=256$. You usually project your ...
0
votes
1
answer
85
views
Find solution for optimal regression coefficients
Consider the cost function $E(\mathbf{w}) = \displaystyle\frac{1}{2} \sum_{i=1}^{N}{(\mathbf{d}_i - \mathbf{x}_i^T\mathbf{w})^2} + \frac{\lambda}{2} \left\lVert \mathbf{w}\right\rVert^2$ where $(\...
1
vote
0
answers
102
views
Two-way ANOVA for machine learning model analysis.
I have different machine learning models and losses, each trained using 5-fold cross-validation. Would it make sense to run a two-way ANOVA two evaluate which are statistically performing better? ...
6
votes
1
answer
183
views
Does the law of large numbers hold for covering numbers?
I am self-studying empirical process theory.
I have encountered the covering number $N(\delta,\mathcal{G},P)$, as well as the empirical version $N(\delta,\mathcal{G},P_n)$.
It seems intuitive to ...
0
votes
0
answers
28
views
Proof of unbiased estimator
Assume:
$$ \phi = \int f(x)p(x)dx = E_p(f)$$
Let $x_s \sim p, s=1,.....,S$ iid $(p(x_s=x)=p(x)$ and $p(x_1,x_2) = p(x_1) p(x_2)$.
\begin{align}
\hat{\phi} &= \frac{1}{S}\sum_{s=1}^{S}f(x_s) \\
E[\...
0
votes
1
answer
144
views
Linear Regression Prediction Errors
Suppose that we perform linear regression on data $\mathbf{X}$ (an $N \times {(D+1)}$ matrix) and predictions $\mathbf{y}$ (an $N \times 1$ vector). Let $\mathbf{w}$ ($(D+1) \times 1$ vector) be the ...
0
votes
1
answer
100
views
Is logistic regression cost function in SciKit Learn different from standard derivations?
I am trying to understand the math behind logistic regression. Going through a couple of websites, lectures and books, I tried to derive the cost function by thinking of it as the negative of the ...
2
votes
1
answer
247
views
In Fisher’s discriminant for multiple classes, How do you manage when $(Sw)$ is singular matrix (so you cant get $(Sw)^{-1}$)?
I am trying to use Fisher’s discriminant for multiple classes to reduce the Dimension of the MNIST data set, similar to this post: https://towardsdatascience.com/an-illustrative-introduction-to-...
4
votes
1
answer
2k
views
Which is the algorithm for knn density estimator?
I am reading Pattern Recognition and Machine Learning by Christopher Bishop. In chapter two he talk about using knn to density estimation. I want to replicate a plot using python/R/matlab. He is ...
0
votes
0
answers
28
views
Need a probabilistic approach to determine if a data-set A includes all the elements of data-set B
My job is to identify if the two given datasets are same. This is to be done on computers using some programming language (C++).
Since the data could be huge, I don't want to read all the elements of ...
1
vote
2
answers
91
views
Covariance of a Vector-Valued Random Variable
I'm reading through Andrew Ng Lecture Notes for CS229 and he makes the statement that, for a random variable $Z \in \mathbb{R}^{n}$,
\begin{align}
Cov(Z) &= E[(Z - E[Z])(Z - E[Z])^{T}]\\
&...
2
votes
1
answer
3k
views
Intuition behind the exponential loss function
I'm reading about AdaBoost from the book The Elements of Statistical Learning.
The book mentions that, to train the model, the exponential loss function is used:
$$L(y, f (x)) = e^{−y f (x)},$$
where $...