How are the "effect_allele" and "other_allele" bases chosen in a PGS file from the PGS Catalog?

Question

How are the "effect_allele" and "other_allele" bases chosen in a PGS file from the PGS Catalog? For example, for PGS002723 (https://www.pgscatalog.org/score/PGS002723/), if I set the "effect_allele" to "ref", then "ref" is equal to the reference base of hg38 in 98.60% of the cases on chromosome 22. On chromosome 22, there are 205 PGS variants where the "effect_allele" does not match the reference base of hg38. In all of these 205 cases, the "other_allele" matches the hg38 base.

Why does PGS Catalog encode the hg38 reference base in 14,477 of cases for chromosome 22 as "effect_allele" and in 205 of cases as "other_allele"?

Finally, how can I deal with this encoding if I want to compute the polygenic score on a certain cohort? Can I simply invert the sign of the "effect_weight" column for these 205 cases where the "other_allele" matches the reference?

I paste my reproducible code below:

suppressMessages(library(BSgenome.Hsapiens.UCSC.hg38))
suppressMessages(library(data.table))
suppressMessages(library(R.utils))
suppressMessages(library(stringr))

setwd('/cluster/home/criccio/PRS/')

source('scripts/get_hg38_base.R')

if (!dir.exists('output/table'))  dir.create('output/table')

PGS_ID <- 'PGS002723'

# Example to ask questions online
system(paste0('wget -P output/ --recursive --no-parent ftp://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/', PGS_ID, '/ScoringFiles/'))

scoring_file <- data.table::fread(paste0('output/ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/', PGS_ID, '/ScoringFiles/Harmonized/', PGS_ID, '_hmPOS_GRCh38.txt.gz'))
data.table::setnames(scoring_file, c('chr_name', 'hm_pos', 'effect_allele', 'other_allele'), c('chr', 'pos', 'ref', 'alt'))

pgs_variants_per_chrom <- scoring_file[, .N, chr]
data.table::setnames(pgs_variants_per_chrom, 'N', 'nb_pgs_in_scoring_file')

chromosome_number <- 22

scoring_file[chr == chromosome_number, ref_hg38 := sapply(scoring_file[chr == chromosome_number, pos], get_hg38_chr22_base)]
scoring_file[chr == chromosome_number, ref_hg38 := toupper(ref_hg38)]
scoring_file[chr == chromosome_number, table(ref == ref_hg38)] # 14477 TRUE and 205 FALSE
scoring_file[chr == chromosome_number, 100 * mean(ref == ref_hg38, na.rm = TRUE)] # 98.60%

scoring_file_mismatch <- scoring_file[chr == chromosome_number & ref != ref_hg38] 
scoring_file_mismatch[chr == chromosome_number, mean(alt == ref_hg38)] # 100% match
data.table::fwrite(scoring_file_mismatch, paste0('output/table/', PGS_ID, '_chr_pos_ref_mismatching_hg38_chr', chromosome_number, '.csv'), sep = ',')

get_hg38_base.R

library(BSgenome.Hsapiens.UCSC.hg38)
Hsapiens_chr22 <- strsplit(as.character(Hsapiens$chr22), '')[[1]]
get_hg38_chr22_base <- function(pos) Hsapiens_chr22[pos]

BigMistake · Accepted Answer · 2023-06-14 20:36:18Z

Effect allele is almost always the reference allele. It theoretically does not have to be, so it's good to not rely on that fact. However, in practice it is typically a safe bet. At first glance, the answer to your question could be this:

The polygenic score from the PGS catalog is using GRCh37. Your genome build is hg38. I expect that some alleles in the score swapped ref/alt positions between builds, resulting in your 205 cases where the effect allele matches the alt allele.

To confirm, could you list a few examples of cases that matched as expected and a few examples of cases that didn't match?

You don't need to deal with this encoding if you want to compute a polygenic score on a cohort. You can match based on the allele itself. (Also, there are existing tools which will calculate polygenic scores for you.) Here is an example (just a snippet, you'd have to iterate through your files, etc. for this to work):

            if PGS_row scoring file identifier like rsID or position == vcf_row identifier: # When the SNP is in both files
            effect_allele = PGS_row[2] # Wherever the affect allele is in your scoring file
            effect_weight = float(PGS_row[4])
            dosage (of ref allele) = (depends on your VCF formatting)

            if effect_allele == vcf_row[3]: # Effect allele matches reference allele
                running_total += (2 - dosage) * effect_weight # We assume dosage of alt is 2 minus dosage of ref. So if you have 2 alt alleles, 2-2 = 0 and there will be nothing added for this SNP.
                num_additions += 1
            elif effect_allele == vcf_row[4]: # Effect allele matches alternate allele
                running_total += dosage * effect_weight
                num_additions += 1
            else:
                nomatch = nomatch + 1

# Calculate final output score
final_output_score = running_total / num_additions
percent_matched = (num_additions)/((nomatch)+(num_additions))

You shouldn't invert the sign of the "effect_weight" column for these 205 cases where the "other_allele" matches the reference. Not having the effect allele doesn't add the opposite beta to the score, it simply doesn't change it.

Try this:

Do you still see those 205 cases if you use a genome build matching the PGS file? (I.e., GRCh37.)
Do you still see those 205 cases if you use the LiftOver PGS scoring file with hg38 coordinates? (Download it here.)

Thanks a lot. This answer is really helpful. Yes, I checked the scoring file of GRCh37 and there were no mismatches, so it must be as you say that the reference base changed between GRCh37 and GRCh38. I also thought inverting the sign was not OK but now you convinced me. Indeed, it add weight if you have the effect allele. Cheers! — Cristian Riccio, Commented Jun 15, 2023 at 7:42

Stack Exchange Network

How are the "effect_allele" and "other_allele" bases chosen in a PGS file from the PGS Catalog?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
genomics
gwas
or ask your own question.

Hot Network Questions

How are the "effect_allele" and "other_allele" bases chosen in a PGS file from the PGS Catalog?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged genomicsgwas or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
genomics
gwas
or ask your own question.