All Questions
Tagged with statistics machine-learning
660
questions
0
votes
0
answers
22
views
Does probability flow ODE trajectory (in the context of diffusion models) represents a bijective mapping between any distribution to a gaussian? [closed]
I have read several papers about diffusion models in the context of deep learning.
especially this one
As explained in the paper, by learning the score function $(\nabla \log(p_t(x)))$, probability ...
0
votes
0
answers
21
views
Sample complexity bounds of $L_S(h)$
Fix $\mathscr{H} \subset \mathscr{Y}^\mathscr{X}$ and a loss $\ell : \hat{Y} \times Y \to [0,1]$. Fix $S \in (\mathscr{X} \times \mathscr{Y})^{2m}$. Assume for now that $S$ is not random. Suppose we ...
0
votes
0
answers
25
views
Harmonizing Classification and Regression [closed]
I have recently been encountering explanations of classification and regression which start with discrete label values as defining the former and continuous label values as defining the latter. I have ...
1
vote
0
answers
67
views
Relation between values of $ξ_i$ and $\alpha_i$ in SVM?
I have a question in about a property of support vectors of SVM which is stated in subsection "12.2.1 Computing the Support Vector Classifier" of "The Elements of Statistical Learning&...
0
votes
0
answers
10
views
Paired bootstrap test p-value formula in binary classification
Background
For a binary classification task, let $M(A, Z)$ denote an evaluation metric, such as accuracy, for classifier $A$ and test examples $Z.$ Then, let
$$
\delta(Z) = M(A, Z) - M(B, Z)
$$
denote ...
0
votes
0
answers
39
views
least squares minimum test error solution
assume we want to learn a model $y=x^T \beta + \varepsilon $
where
$\beta \in \mathbb{R}^d$ is constant
$ x \in \mathbb{R}^d$ is the input vector with Gaussian distribution $\mathcal{N}(0,\Sigma_x)$ ...
2
votes
0
answers
20
views
Would like to validate whether the AUC equation is correct or not
I found a paper "Chapi, Kamran, et al. "A novel hybrid artificial intelligence approach for flood susceptibility assessment." Environmental modelling & software 95 (2017): 229-245&...
0
votes
1
answer
16
views
Understanding the Reasoning Behind the Growth Function $m_{\mathcal{H}}(N)=2^N$ for Convex Sets
I am currently reading Learning from Data by Abu-Mostafa et al. and I am struggling to understand the reasoning behind the growth function $m_{\mathcal{H}}(N)=2^N$ for convex sets. Here is the ...
0
votes
1
answer
36
views
Estimating the conditional entropy of a discrete variable conditioning on continuous variable
I am doing a machine learning project and I am trying to select the best features by computing their mutual information and select the ones with the highest information gain. I was looking at this ...
0
votes
0
answers
32
views
How to Upper Bound the Spectral Norm of $\left(XX^T\right)^{-1}\left(XX^T\right)^{-1}X$?
I have an observation matrix $ X \in \mathbb{R}^{n \times n}$. Considering $XX^T$, this matrix can be seen as a correlation matrix between individuals, so it generally has elements close to the ...
1
vote
1
answer
41
views
How to expand the double integral in variational objective function?
I am reading John Paisley's lecture note on variational inference. In lecture 6 p.3, he wrote the objective function as follows:
Latex:
$$
\mathcal{L}(a', b', \mu', \Sigma') = \int_{0}^{\infty} \int_{...
0
votes
0
answers
21
views
How to understand likelihood function bayesian
$\mathcal{N}(W^T \cdot X, \beta^{-1})$
This is the likelihood distribution for Bayesian linear regression, right? So, the thing is, if I'm doing batch mode Bayesian regression, then:
Weights (W): Size:...
2
votes
1
answer
34
views
How to derive likelihood function
I have been struggling a lot with the concept of likelihood and I'd really appreciate it if someone could verify if my understanding is correct and give input.
If I understand this correcly, we pick ...
0
votes
0
answers
22
views
Bayesian linear regression about finding the likelihood
Pick a single data point $(x,t)$ and calculate and plot the likelihood for this single data point across all $w$ in your parameter space $(w_0 \times w_1)$ (for a single data point it is a univariate ...
1
vote
0
answers
36
views
Bayes classifiers with cost of misclassification
A minimum ECM classifier disciminate the features $\underline{x}$ to belong to class $t$ ($\delta(\underline{x}) = t$) if $\forall j \ne t$:
$$\sum_{k\ne t} c(t|k) f_k(\underline{x})p_k \le \sum_{k\ne ...
2
votes
1
answer
39
views
Bayesian Inference Intractability
When looking at Bayesian posteriors
$$
p(z \mid x) = \frac{p(x \mid z)p(z)}{\int p(x \mid z')p(z')dz'}
$$
The denominator commonly intractable. I understand this is due to the possibility of high ...
5
votes
1
answer
209
views
Rigorous Mathematical foundations of Machine Learning / Deep Learning / Neural Networks
I am an Engineering Graduate (with a strong background in Probability/Measure Theory, Linear Algebra and Calculus) wanting to dig deep into Deep Learning and Neural Networks, and I'm looking for ...
1
vote
1
answer
54
views
How to visualize conditional maximum likelihood estimation?
In Probabilistic Machine Learning (Murphy, 2022, p. 8) I'm stuck in this part:
1.2.1.6 Maximum likelihood estimation When fitting probabilistic models, it is common to use the negative log ...
0
votes
0
answers
18
views
Express the regularized weight in ridge regression in terms of the linear regression solution .
We would like to minimize the quantity
$E_{in}(\vec{w})=\frac{1}{N}\sum_{i=1}^N(\vec{w}^{T}\vec{x_n}-y_n)^2$
under the constraint
$\vec{w}^T\Gamma^T\Gamma\vec{w}\leq C$ where $\Gamma$ is a matrix, $C$ ...
0
votes
0
answers
31
views
Expected squared Error (bagging)
I'm studying from a Deep Learning book (Ian Goodfellow et al).
At page 256 the text explains that, considering a set of $k$ regression models, each produces an error $ϵ_i$ for every example, drawn ...
2
votes
1
answer
59
views
Expected value and variance of Sigmoid and SiLU on a normally distributed random variable for variational approximation
I am trying to apply Assumed Density Filtering (ADF) according to the paper Lightweight Probabilistic Deep Networks to my own model, and I need to implement the variational approximation layer of ...
0
votes
1
answer
23
views
Logistic map type function with controlled steepness on either side
I am looking for a function that must have the following requirements:
$f(1) = f(-1) = 0$
$f(x) > 0, \forall x \in (-1,1)$
$f$ is differentiable.
Additionally, I would like it to be ...
0
votes
1
answer
33
views
Maximum Likelihood - Information Matrix Identity Derivation
I try to derive the information matrix equality for the Poisson distribution with the log-Likelihood:
$$\mathcal{L}(\lambda; x_1, x_2, \ldots, x_n) = \sum_{i=1}^{n} \left[-\lambda + x_i \log(\lambda) -...
1
vote
0
answers
72
views
Is every convex cone a subset of a half space?
I have come across a proof for this statement(link to paper at end), which however I do not understand. It makes use of Lemma 5.5, which I've also included.
Lemma 1. The interior of the complement of ...
3
votes
0
answers
68
views
A self-proof of Vapnik - Chervonenkis theorem
Theorem: For every $\varepsilon >0$, with the probability greater than $1-\varepsilon$
\begin{align*}
R_p(\hat{g}_{n,\mathcal{G}}) - R_{p}(g^*_{p,\mathcal{G}}) \le 2 \sqrt{\dfrac{2V_{\mathcal{G}...
0
votes
0
answers
9
views
Methods for Efficient Feature Aggregation Maintaining Prediction Accuracy
Given a high-dimensional dataset X containing potentially redundant features, how can we efficiently aggregate and/or select features to achieve accurate prediction of target variable Y while reducing ...
0
votes
1
answer
34
views
Why does a shift in the probability distribution over a label space naturally trigger a shift in the distribution over input space?
Can anyone explain this statement?
"Firstly, let’s define the input space as
X (sensory observations) and the label space as Y (semantic categories). The data distribution is represented
by the ...
0
votes
0
answers
45
views
When to use chi square law for confidence intervals with mahalanobis distance?
So right now i'm reading this paper: Distance-based detection of out-of-distribution silent failures for Covid-19 lung lesion segmentation, available here: https://arxiv.org/abs/2208.03217
In brief, ...
0
votes
1
answer
67
views
Understanding Friedman’s H-statistic
In "Interpretable Machine Learning: A Guide For Making Black Box Models Explainable", I found the following for Friedman's H-statistic:
$$PD_{jk}(x_j, x_k) = PD_j (x_j) + PD_k (x_k),$$
where ...
0
votes
0
answers
19
views
Gaussian Kernel outputs a 2*m feature map?
I am currently writing my masters thesis on the Double Descent Curve in Neural Networks and as I was doing some research, I came across the paper "On the Double Descent of Random Features Models ...
0
votes
0
answers
116
views
Posterior probabilities in a GMM
This is a statistics/probability question formulated in the context of machine learning (problem 6.17 in Bishop's 'Deep Learning' book). We are modelling the conditional distribution $p(\mathbf{t}|\...
1
vote
0
answers
37
views
When does the optimal model exist in learning theory?
In the context of learning theory, we usually have: data $(x,y)\sim P(x,y)$, with $x\in\mathcal{X}\subseteq\mathbb{R}^d$ and $y\in\mathcal{X}\subseteq\mathbb{R}^k$, a hypothesis class $\mathcal{F}\...
0
votes
0
answers
68
views
expectation of the product of Gaussian kernels and their input
I was wondering if anybody knows how to solve: $$\mathbb{E}{\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\left[ (\mathbf{x}{i} - \mathbf{z})(\mathbf{x}{j} - \mathbf{z})^{\top} \exp\left( - (\...
0
votes
0
answers
61
views
Known relations between mutual information and covering number?
This is a question about statistical learning theory. Consider a hypothesis class $\mathcal{F}$, parameterized by real vectors $w \in \mathbb{R}^p$. Suppose I have a data distribution $D \sim \mu$ and ...
1
vote
1
answer
61
views
Interpreting a concentration inequality
In the following paper I am slightly confused about the way they use a concentration inequality derived in Lemma A1. In Lemma A1, under the assumption that $(n ,p)$ satisfies $\log p/n^{1/4} \to 0$ as ...
1
vote
0
answers
50
views
OLS and Conditional Expectation Assumption
In OLS we assume that Given the model : $Y|X = F(X) + U|X $ Where U is the residuals ,we then ASSUME $E(U|X) = 0$ in order to have $prediction = F(X) = E(Y|X)$ . So the $E(U|X) = 0$ is an assumption ...
0
votes
0
answers
46
views
Loss Equation for Training DPMs vs DDPMs
I'm currently trying to wrap my head around the training loss functions for DPMs and how they vary from DDPMs, however there are differences in how the papers describe the processes, making it ...
0
votes
0
answers
50
views
right inverse of a linear, bounded, nonnegative, self-adjoint and trace-class operator on closed subspace of a separable Hilbert space
This question is related to Corollary 3 of the paper: Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces by Kenji Fukumizu, et al.
Basicly they first defined the ...
0
votes
0
answers
22
views
How to prove the relation involving difference between value functions of two different policies and the sum of advantage function over time?
In reinforcement learning, how do you prove the following relation between the difference in value functions of two policies?
The value function $V^\pi(s)$ represents the expected cumulative reward ...
0
votes
0
answers
23
views
Does marginal likelihood on the training set always weakly increase for GPs when adding new features, irrespective of the kernel/hyperparams?
Ive recently been introducing myself to Gaussian Processes. In Bayesian linear regression, one would expect that when adding new features, the likelihood on the training set would weakly increase due ...
0
votes
0
answers
9
views
What determins the length of the length of confidence interval for mean response
Considering a simple OLS predictor $\hat\beta = (X^TX)^{-1}X^TY$, where $X$ is the design matrix and $Y$ is the response.
Given a new observation's covariate $x$, I can estimate the mean response ...
0
votes
0
answers
22
views
Bias and Variance Decomposition to find an optimal tuning parameter - Population vs Empirical
I have a question regarding to this bias and variance decomposition in The Elements of Statistical Learning. In chapter 7.2, it mentioned $\operatorname{Err}\left(x_0\right)=$
$$E\left[\left(Y-\hat{f}\...
0
votes
0
answers
26
views
Generic Chaining/Dudley's integral for supremum of average of indicator random variables in sup norm metric space?
I want to use generic chaining/ Dudley's integral to bind the below stochastic process
\mathbb E\sup_{tin T}\frac{1}{n}\sum_{i=1}^nX_{i,t} where X_{i,t} takes value either 1 or 0 (binary random ...
2
votes
1
answer
50
views
What is the Rademacher complexity of kernelbased hypotheses with offset?
Let $\mathcal{H}=\{ x\rightarrow \langle \mathbf{w},\Phi(x)\rangle+b : \| \mathbf{w}\|_{\mathbb{H}} \le \Lambda, b\in \mathbb{R}\}$ be a function family, where $\Phi$ is a feature mapping, and $\...
0
votes
0
answers
21
views
Relation between VC-dimension and pseudeo-dimension
I am thinking about the relation between the VC-dimension and the pseudo-dimension, and confused about them.
Let $H$ be a family of real-valued functions. We can define a function $c(h,t):x\rightarrow ...
0
votes
0
answers
16
views
Does Gaussian Process solve specified distribution drift problem?
When I was reading about this lecture, the concept of posterior predictive distribution is introduced as the following
$$
P(Y \mid D, X)=\int_{\mathbf{w}} P(Y, \mathbf{w} \mid D, X) d \mathbf{w}=\int_{...
1
vote
0
answers
367
views
What is the correct formula for Within Cluster Sum of Squares
I am studying clustering with K-Means algorithm and I got stumbled in the "inertia", or "within cluster sum of squares" part. First I would appreciate if anyone could explain me ...
1
vote
1
answer
187
views
Proof that $KL(p||q) =0 \iff p(x) = q(x)$
I am studying Machine Learning, and I came across a proof in section 1.61 in Bishop's textbook Pattern Recognition & Machine Learning which I couldn't quite understand. The claim is that $KL(p||q) ...
0
votes
1
answer
46
views
Interpreting notation for binary loss function
I am taking a advanced machine learning class, and in the class notes I noticed a notation that I did not recognize. It is the notation for binary loss function (Please note that it is not binary ...
0
votes
0
answers
36
views
Show bagging helps under squared-error loss
This question is about chapter 8.7 Bagging from Element of Statistical Learning (ESL) textbook.
Assume our training observations $\left(x_i, y_i\right), i=1, \ldots, N$ are independently drawn from a ...