Skip to main content

All Questions

1 vote
1 answer
103 views

Generalization in Neural Networks: Can one Impose Conditions on the Data?

There is a well-developed theory on generalization bounds for deep neural networks, using VC dimensions and Rademacher Complexities. They work for any underlying "true" distribution $\...
Claudio Moneo's user avatar
1 vote
2 answers
441 views

the meaning of with probability at least 1-\delta

In the theoretical analysis of some algorithm for stochastic optimization, we often need to prove that something like $$error\leq\epsilon,~~~~(under~some~conditions)$$ holds with probability at least $...
lazyleo's user avatar
  • 73
0 votes
1 answer
314 views

What statistics books would you recommend for an undergraduate student who wants to be a machine learning engineer?

I'm an undergraduate software engineering student and I will be taking a statistics course in this semester. I was thinking of buying a statistics textbook such as Probability and Statistics for ...
anıl ateşsaçan's user avatar
2 votes
1 answer
63 views

Probabilistic interpretation linear regression implication step

I am reading Andrew Ng's notes on linear regression, and in this section, he attempts to derive the formula for the least squares using a probability approach: http://cs229.stanford.edu/summer2020/...
K Split X's user avatar
  • 6,575
1 vote
0 answers
68 views

Minimizing the variance in a variant of bagging Weighted Aggregation(Wagging)

In our machine learning course we have learned Bagging, wherein A variant of bagging call Weighted Aggregation is introduced, where the result is a weighted sum of all the estimators instead of ...
Sheen's user avatar
  • 11
1 vote
0 answers
207 views

Phased version of Upper-Confidence Bound Algorithm (UCB)

I am interested in an exercise (specifically Exercise 7.4) from Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. It studies a phased version of the popular UCB algorithm. The algorithm takes ...
Brian's user avatar
  • 113
0 votes
1 answer
43 views

Matrix devision - Bias Variance Tradeoff

I am currently trying to prove that the ordinary least squares estimate doesn't have a bias with a given dataset with the bias given as Why does this identity hold in the following calculation $$(X^...
christheliz's user avatar
1 vote
0 answers
41 views

Theoretical Machine Learning: How to calculate the expected risk of a model with unknown distribution $\hat{h}$?

If we have fixed, deterministic feature vectors $x_1, x_2, ..., x_n \in \mathbb{R^d}$ with an unknown model parameter $\theta^*$ and the error $z$ with $N(0,\sigma^2)$. For the feature vector $x_i$ ...
christheliz's user avatar
0 votes
2 answers
350 views

Linear Regression: Correlation between predictors and residuals

I am reading Chapter 3 from Elements of Statistical Learning. In the explanation for Forward Stagewise Regression and Least Angle Regression, the authors explain that reducing the correlation between ...
temp_user's user avatar
1 vote
0 answers
82 views

What are some functions $\mathbb{R}^+ \to \mathbb{R}$ other than $\log$?

I am interested in functions $f: \mathbb{R}^+ \to \mathbb{R}$, for the purpose of mapping non-negative statistical features of objects (such as lengths) to the whole real line. Then, I intend to use ...
tapphughesn's user avatar
0 votes
0 answers
168 views

Why can i move the summation sign down?

Hi, i'm trying to teach my self machine learning by going through the book "An introduction to Statistical Learning", and got stuck on one of the exercise questions. In the attached image ...
Kenneth .J's user avatar
3 votes
1 answer
396 views

The relation between Bregman divergence and KL divergence

I see that Bregman divergence is defined as $d_\phi(x,y)=\phi(x)-\phi(y)-<x-y,\nabla\phi(y)>$, where $x,y\in R^d$ and $\phi$ is a strictly convex function. KL divergence is an instance of ...
user1388672's user avatar
2 votes
1 answer
105 views

When is it true that $\sup g - \inf g \le 2\sup g$?

I am currently reading the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma. In Lemma C.4 of the ...
Stratos supports the strike's user avatar
0 votes
0 answers
60 views

Choosing a loss function for minimize total sum

I've the following regression problem. I'm forecasting a random variable $X$ for every day of a month, represented as $X_{ij}$ where $i$ is the day of the month and $j$ is the month number. I care if ...
broccoli's user avatar
  • 463
1 vote
0 answers
69 views

Request for reference: uniform convergence for non 0-1 loss functions

In the book "Understanding machine learning", there is Theorem 6.11 with the following statement Let ${\cal H}$ be a class and let $\tau_{\cal H}$ be its growth function. Then, for every $\...
Elnur's user avatar
  • 352
1 vote
0 answers
13 views

Data Groupings and Bayesian Analysis for a Generative Model

I have a simple question that I think will have potentially many solutions, depending on the level of complexity with which one wants to approach it. I've built a generative model with two variables ...
JKM's user avatar
  • 449
1 vote
1 answer
1k views

How do I compute the derivative of the cross-entropy loss $H(P,Q)$ with respect to the weights $W$?

I'm trying to understand the cross-entropy loss with iris dataset for binary classification where y=1 denotes the plant belongs to Setosa and y=0 denotes the example belongs to Non-Setosa. Consider ...
JakeMZ's user avatar
  • 283
1 vote
0 answers
212 views

Optimization: max to softmax for convexity?

Assume we have the following optimization problem: for a family of $m$ vectors $\{x_i\}\in \mathbb{R}^n$, a family of $l$ vectors $\{c_i\}\in \mathbb{R}^n$ with $l\ll m$ and for a family of $l$ ...
Marion's user avatar
  • 2,239
1 vote
0 answers
22 views

Perplexities about Bayesian inference and model averaging (BMA)

reading about Bayesian approach on model selection, I was just wondering about the more mathematical meaning of Bayesian model averaging. Say for example that we are given a dataset $\mathcal{D} = \{\...
James Arten's user avatar
  • 1,953
1 vote
1 answer
479 views

How to verify whether a metric is of negative type or not?

A metric $d(\cdot,\cdot)$ of a space $S$ is said to be of negative type, if for $\forall n \geq 2, z_{1}, \ldots, z_{n} \in S$, and $\alpha_{1}, \ldots, \alpha_{n} \in \mathbb{R}$ with $\sum\limits_{i=...
Zhao Zhao's user avatar
2 votes
0 answers
47 views

Measurability issues in the symmetrization step in the proof of $\varepsilon$-sample theorem

Let $(\mathcal{X},d)$ be a metric space and $\mu$ be a Borel probability measure on $(\mathcal{X},d)$. Let $m \in \mathbb{N}$ and define the two probability product measures $\mu^{m} := \otimes_{k=1}^{...
Bob's user avatar
  • 5,783
1 vote
0 answers
132 views

Empirical Fisher Information but with unknown true parameters and distribution?

I am not sure if I ask it correctly. I am working on using Fisher Information to examine the information in a model (say neural networks for simplicity). What I know is that the definition of Fisher ...
Daniel H. Leung's user avatar
1 vote
0 answers
51 views

Deriving the regularization term in bayesian lasso regression

the title is probably not very good, I thought hard about how to phrase this correctly. I'd be grateful if someone tells me it's wrong and how to correct it. I am practicing for my exams and I have ...
oliver's user avatar
  • 675
1 vote
0 answers
19 views

How to obtain the parameter update for the multiclass classification (general loss and activation function)?

Consider the feature space $\mathcal{X}=\mathbb R^{d}$ and $\mathcal{Y}=\{1,...,c\}$ such that $c > 2$. We consider some activation function $\alpha: \mathbb R^{c} \to \mathbb R^{c}$ and out weight ...
MinaThuma's user avatar
  • 998
6 votes
2 answers
685 views

BFGS Formula from Kullback-Leibler Divergence

On page 411 in this book, the authors give the following BFGS formula $$ \boxed{\boldsymbol C_{\textrm{BFGS}} = \boldsymbol C + \underbrace{\frac{\boldsymbol g^\top\boldsymbol\delta+\boldsymbol g^\top\...
LaguerreGroup's user avatar
2 votes
0 answers
75 views

Generative model evaluation metric : Precision & Recall

In this paper, a new metric was proposed to evaluate generative model. The equation (1) decomposes generative distribution and real distribution into two parts w.r.t their intersection of the ...
Code mx's user avatar
  • 31
0 votes
2 answers
190 views

Bayes classifier: handling conditional expectation / probability

I am learning about the Bayes optimal classifier, and there is a step in a proof I struggle with. One can find this proof also on the Wikipedia page: https://en.wikipedia.org/wiki/Bayes_classifier#...
noam.szyfer's user avatar
  • 1,600
0 votes
0 answers
36 views

Which standard deviation for model averaging?

I hope math stack exchange is the right place for this question, even though it comes from an AI point of view. Say I have a machine learning model and for robustness of results, I initialize it with ...
frederik's user avatar
2 votes
1 answer
158 views

Definition of Ergodicity in Theodoridis' Machine Learning

This is related, but is not the same as https://stats.stackexchange.com/questions/319190/wide-sense-stationary-but-not-ergodic. Note that I am not assuming stationarity. Theodoridis, in his Machine ...
Clarinetist's user avatar
  • 19.6k
2 votes
0 answers
88 views

Understanding the $\alpha$-regularity assumption for trees

In this paper, definition 4 claims that a tree grown by recursive partitioning is $\alpha$-regular for some $\alpha>0$ if each split leaves at least a fraction $\alpha$ of the available training ...
WeakLearner's user avatar
  • 6,106
2 votes
0 answers
57 views

EM algorithm for maximum of 2 normal distribution

Let $X_i \sim N(\mu_1,\sigma^2), Y_i \sim N(\mu_2,\sigma^2)$ $O_i = \max(X_i,Y_i)$ i need to find $\mu_1, \mu_2$ using EM my attempt: first i defined $Z_i = \left\{\begin{matrix} 1, ~ X_i \ge Y_i \\ 0,...
Roi Hezkiyahu's user avatar
0 votes
1 answer
36 views

Optimizing recursive functions with time series data

I have a recursive function, $f(0,a)$ is known, $f(t+1;a,b)=f(t;a,b)+g(t,b)$ where $a,b$ are constants and $g$ is a function. I also have a sequence of data, $D(t)$. I am trying to optimize $f$ with ...
Xia's user avatar
  • 542
2 votes
1 answer
532 views

Derivation of the bias-variance tradeoff

I'm having trouble understanding the derivation of the bias-variance tradeoff which is also given in the article on the mean squared error. Let some data be represented by the random variable $X$ with ...
20_Limes's user avatar
0 votes
1 answer
66 views

How to prove total test error is independent of the selected learning algorithm

I'm looking at the following proof: Where: Note: I'm new to this but I think I understand all the below variables correctly now. Mistakes are possible though. $f$ is an ideal function with perfect ...
Grant Curell's user avatar
1 vote
0 answers
140 views

Is a Chi-Squared goodness of fit test appropriate for Neural Network regression?

So I always have wished that regression of neural networks gave more interpretable results and I'm pretty hopeful that chi-squared tests anchor these MSE values in the same way that accuracy anchors ...
profPlum's user avatar
  • 337
0 votes
1 answer
305 views

Support Vector Machine Optimization Problem,

The formulation of the SVM optimization problem is: \begin{equation} \begin{aligned} & max_{w,b} \frac{1}{||w||} \\ & \text{ subject to } \\ & y_i(w^{T}x_i+b) \geq 1 \end{aligned} \end{...
wizz's user avatar
  • 519
1 vote
0 answers
156 views

Distance weighted uniform sampling - sampling procedure

In the excellent paper Wu, Chao-Yuan; Manmatha, R.; Smola, Alexander J.; Krähenbühl, Philipp (2017): Sampling Matters in Deep Embedding Learning. Available online at https://arxiv.org/pdf/1706.07567. ...
2Obe's user avatar
  • 185
5 votes
0 answers
79 views

Conditional Bias Variance Decomposition

The standard bias variance decomposition says that: $$ E |f(X) - Y|^2 = \int_{\mathbb{R}^d} |f(x) - m(x)|^2 \mu(dx) + E|m(X) - Y|^2, $$ where $\mu$ is some distribution over $X$. I am trying to ...
WeakLearner's user avatar
  • 6,106
0 votes
1 answer
69 views

reconstructed error for PCA-analysis not equal to zero?

I am working on an assignment for school where they ask us to perform PCA-analysis on a data set consisting of 500 data points where each data point is of dimension $p=256$. You usually project your ...
JBosmans's user avatar
0 votes
1 answer
85 views

Find solution for optimal regression coefficients

Consider the cost function $E(\mathbf{w}) = \displaystyle\frac{1}{2} \sum_{i=1}^{N}{(\mathbf{d}_i - \mathbf{x}_i^T\mathbf{w})^2} + \frac{\lambda}{2} \left\lVert \mathbf{w}\right\rVert^2$ where $(\...
Monya Feldman's user avatar
1 vote
0 answers
102 views

Two-way ANOVA for machine learning model analysis.

I have different machine learning models and losses, each trained using 5-fold cross-validation. Would it make sense to run a two-way ANOVA two evaluate which are statistically performing better? ...
Ramon's user avatar
  • 123
6 votes
1 answer
183 views

Does the law of large numbers hold for covering numbers?

I am self-studying empirical process theory. I have encountered the covering number $N(\delta,\mathcal{G},P)$, as well as the empirical version $N(\delta,\mathcal{G},P_n)$. It seems intuitive to ...
Idontgetit's user avatar
  • 1,391
0 votes
0 answers
28 views

Proof of unbiased estimator

Assume: $$ \phi = \int f(x)p(x)dx = E_p(f)$$ Let $x_s \sim p, s=1,.....,S$ iid $(p(x_s=x)=p(x)$ and $p(x_1,x_2) = p(x_1) p(x_2)$. \begin{align} \hat{\phi} &= \frac{1}{S}\sum_{s=1}^{S}f(x_s) \\ E[\...
Swakshar Deb's user avatar
0 votes
1 answer
144 views

Linear Regression Prediction Errors

Suppose that we perform linear regression on data $\mathbf{X}$ (an $N \times {(D+1)}$ matrix) and predictions $\mathbf{y}$ (an $N \times 1$ vector). Let $\mathbf{w}$ ($(D+1) \times 1$ vector) be the ...
Bobo's user avatar
  • 409
0 votes
1 answer
100 views

Is logistic regression cost function in SciKit Learn different from standard derivations?

I am trying to understand the math behind logistic regression. Going through a couple of websites, lectures and books, I tried to derive the cost function by thinking of it as the negative of the ...
Anu's user avatar
  • 311
2 votes
1 answer
247 views

In Fisher’s discriminant for multiple classes, How do you manage when $(Sw)$ is singular matrix (so you cant get $(Sw)^{-1}$)?

I am trying to use Fisher’s discriminant for multiple classes to reduce the Dimension of the MNIST data set, similar to this post: https://towardsdatascience.com/an-illustrative-introduction-to-...
Nicolas Pacheco's user avatar
4 votes
1 answer
2k views

Which is the algorithm for knn density estimator?

I am reading Pattern Recognition and Machine Learning by Christopher Bishop. In chapter two he talk about using knn to density estimation. I want to replicate a plot using python/R/matlab. He is ...
Nicolas Pacheco's user avatar
0 votes
0 answers
28 views

Need a probabilistic approach to determine if a data-set A includes all the elements of data-set B

My job is to identify if the two given datasets are same. This is to be done on computers using some programming language (C++). Since the data could be huge, I don't want to read all the elements of ...
ultimate cause's user avatar
1 vote
2 answers
91 views

Covariance of a Vector-Valued Random Variable

I'm reading through Andrew Ng Lecture Notes for CS229 and he makes the statement that, for a random variable $Z \in \mathbb{R}^{n}$, \begin{align} Cov(Z) &= E[(Z - E[Z])(Z - E[Z])^{T}]\\ &...
keggythekeg's user avatar
2 votes
1 answer
3k views

Intuition behind the exponential loss function

I'm reading about AdaBoost from the book The Elements of Statistical Learning. The book mentions that, to train the model, the exponential loss function is used: $$L(y, f (x)) = e^{−y f (x)},$$ where $...
user3889486's user avatar

15 30 50 per page
1 2 3
4
5
14