How to z-transform Fst and -log(p) values for genome wide selection scan?

Question

I saw in the literature a lot genome wide scans showing $F_{ST}$ values in a 'z-transformed' format which seems to be bounded between 0 to infinity.

Examples are found easily searching for 'genome-wide scan $F_{ST}$' and here is a paper where I could find such z-transformed $F_{ST}$:

Identification of genomic variants putatively targeted by selection during dog domestication (see figure 1)
Genomic divergence during feralization reveals both conserved and distinct mechanisms of parallel weediness evolution (see figure 3)

How are the z-transformed $F_{ST}$ values calculated?

The only reference I could find so far is this: Calculate $F_{ST}$to draw Manhattan plot. And this Population analysis of the Korean native duck using whole-genome sequencing data where they just take the ($F_{ST}$ - mean $F_{ST}$)/sd_$F_{ST}$. But this is giving a lot of negative values, so should this be also surrounded by an absolute value?

Then is it possible to use the z-scores to get -log(p-values)?

JRodrigoF · Accepted Answer · 2023-01-18 12:18:20Z

The standard score can be estimated by using the sample mean and sample standard deviation as estimates of the population values. Just as you wrote it in the question.

The z-score is given by

$ z = \frac{x - \bar{x}}{s}$

where:

$x$ is the sample observation
$\bar{x}$ is the mean of the sample,
$s$ is the standard deviation of the sample.

Negative values mean that particular observation is $z$ standard deviations below the mean value, and negative $z$-scores are very much expected. The process of standardization is a simple linear transformation and thus does not remove or change any degree of skewness in the original data. I guess your data is skew from the begging to values below the mean.

Yes you can compute the exact $p$-values from the $z$-scores and then apply a negative log transform. You can use a $z$ pre-computed table for this. Check this video for an example.

Thanks for this @JRodrigoF this guy (YouTube link) for teaching critical probability is a teaching maestro. — M__, Commented Jan 18, 2023 at 13:56

M__ · Accepted Answer · 2023-01-18 15:27:10Z

@JRodrigoF is the correct answer. I'll attempt to explain why the z-score is being used, because it is not immediately obvious.

$F_{ST}$ is not trivial for anyone to understand per se. It is extremely useful for assessing migration - but what is the actual index that Ron Fisher devised?

Simulations suggest that $F_{ST}$ = 0.2 is threshold above which which strong population structure is considered. Thus $F_{ST}$ = 0 panmixia and $F_{ST}$ = 1 is zero migration. How to interpret everything in-between is difficult, except for the 0.2 threshold if the theoretical assumptions in the threshold calculation apply to real-world data. Thus the Z-score is an interesting concept.

In other words, the 0.2 threshold simulation suggests $F_{ST}$ is a non-linear index, which is very interesting in understanding it. Non-linear indices are a bit hairy because we all think they are linear, so 0.1 to 0.3 is one increment of migration, 0.3 to 0.5 another increment of migration. This is not true according to the simulation studies.

Z-scores get around this difficult by simply standardising the combined $F_{ST}$ and thats why they are using it.

It becomes particularly interesting if $F_{ST}$ is calculated per allele because then "fitness traits" can then be tracked. The allele doesn't necessarily need to be the actual fitness trait because of molecular "hitchhiking" (old John Maynard Smith terminology). Thus $F_{ST}$ Z-score within a population within a genome is the beginning of a datamining process. So example, the domestication of dogs why it becomes such a big deal, particularly if the comparison is a dog breed that is closest to the wolf genome. The actually traits selected for in dogs can now be identified. It would also operate between genomes, to identify outliers in dog dispersal, but the behavioural (behavioral) modifications that make dogs dogs is very interesting.

The question is are super small probabilities generated to make log meaningful? I'm not sure in this instance probabilities well beyond very highly, highly significant are generated. If they are then a log transformation is useful.

For background the log-scale is used for transforming very small probabilities rather than use e nomenclature. It makes them more understandable for comparative purposes and allows you to do cool things like add and subtract them instead of multiplying and dividing. The smaller the probability value the higher the transformed value ... so in this case the higher the value the more interesting the result.

However, if the field simply does not use log transformation then it's not cool, because no-one will be able to comparatively interpret the transformed values. In phylogenetics it's done all the time, because the probabilities are really small, but anyone can immediately understand the values - because everyone uses this system. A key part of stats is clarity, not necessarily for those outside the field, but certainly for those within the field.

*very minor quibble it was Sewall Wright not Fisher who invented f-statistics — user438383, Commented Jan 20, 2023 at 18:13
@user438383 okay, but F is for Fisher (the statistic is named after him), so he must have done something with the development of the stat to get his name recognised like this. — M__, Commented Jan 20, 2023 at 18:34
Actually these are not Fisher's F statistics (in fact Fisher didn't develop the F-test, it was just named after him), but Sewall Wright's family of F-statistics, or fixation indices. Confusing, I know... — user438383, Commented Jan 23, 2023 at 9:41
@user438383, ok I dunno. The reality is Fisher invented so much of classical statistics prior Bayes and maximum likelihood, discovering something he didn't invent really surprises me. — M__, Commented Jan 23, 2023 at 10:10

Stack Exchange Network

How to z-transform Fst and -log(p) values for genome wide selection scan?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
statistics
gwas
or ask your own question.

Hot Network Questions

How to z-transform Fst and -log(p) values for genome wide selection scan?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged statisticsgwas or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
statistics
gwas
or ask your own question.