2
$\begingroup$

As a Computer Science student inexperienced in statistics, I'm looking for some advice on selecting the appropriate statistical test for my dataset.

My data, derived from brain scans, is structured into columns: subject, channels, freqbands, measures, value, and group. It involves recording multiple channels (electrodes) per patient, dividing the signal into various frequency bands (freqbands), and calculating measures like Shannon entropy for each. So each signal gets broken down to one data point. This results in 1425 data points per subject (19 channels x 5 freqbands x 15 measures), totalling around 170 subjects.

I aim to determine if there's a significant difference in values (linked to specific channel, freqband, and measure combinations) between two groups. Additionally, I'm interested in identifying any significant differences at the channel, measure or freqband level.

What would be a suitable statistical test for this scenario?

Thanks in advance for any help!

$\endgroup$

1 Answer 1

0
$\begingroup$

Let $n$ $(i=1,2,\ldots,n)$ be the number of rows (patients), and $p$ $(j=1,2,\ldots,p)$ be the number of features (columns). Since your data is based on $n=170$ and $p=1425$, a problem with your dataset for most analyses is that it suffers from the curse of dimensionality, that is, $n \ll p$. Therefore, you need to reduce the dimensions.

A reasonably good dimensional reduction method is principal components analysis (PCA), which is typically employed for two reasons: (1) to perform linear dimension reduction, and (2) noise reduction. You could run PCA on the $p \times p$ correlation matrix $\mathbf{R}$ derived from the $p$ columns. Each of the elements, $r_{jk}$, in $\mathbf{R}$ represent the correlation between the $j$th and $k$th features. The diagonal elements of $\mathbf{R}$ are always 1. Values of $r_{jk}$ have range [-1,1].

Most software packages will use the correlation matrix or the covariance matrix for the input features, so focus on correlation. PCA is mainly used to derive the principal component score vectors, $\mathbf{F}$, or PC scores which is an $n \times m$ matrix, where $m$ is the number of eigenvalues $\lambda_j>1$. Results of PCA often yield $m \ll p$.

Each column of $\mathbf{F}$ forms an n-tuple of PC scores. You can perform pattern recognition by plotting the $n$ score values in PC1 vs. the $n$ score values in PC2, using two different colors for the two groups of patients. PC1 is the first column of $\mathbf{F}$ and PC2 is the second column of $\mathbf{F}$, based on the 1st and 2nd $(\lambda_1>\lambda_2)$ major eigenvalues extracted from $\mathbf{R}$.

Fundamentally, there will be $p=1425$ eigenvalues extracted from $\mathbf{R}$ ($\lambda_1>\lambda_2> \cdots >\lambda_j> \cdots >\lambda_p$), as most software packages will order them in descending order. You can estimate the amount of variation explained by the $m$ PCs by adding up the $m$ eigenvalues, and dividing by $p=1425$, since it can be shown that $\sum_j \lambda_j =p$ when correlation is used. I'm guessing the first $m$ of your eigenvalues describe ~60% of the variance.

It's quite typical, by default, to input the first $m$ PCs (PC$1$ - PC$m$) from PCA into classification runs, when your group variable is used for class labels (truth table).

The above should get you started. I didn't want to jump into software so much because as a CS student you first need to know the matrices, what PCA is used for, and how the results are used in analysis later in the workflow/pipeline.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.