2
$\begingroup$

How are the "effect_allele" and "other_allele" bases chosen in a PGS file from the PGS Catalog? For example, for PGS002723 (https://www.pgscatalog.org/score/PGS002723/), if I set the "effect_allele" to "ref", then "ref" is equal to the reference base of hg38 in 98.60% of the cases on chromosome 22. On chromosome 22, there are 205 PGS variants where the "effect_allele" does not match the reference base of hg38. In all of these 205 cases, the "other_allele" matches the hg38 base.

Why does PGS Catalog encode the hg38 reference base in 14,477 of cases for chromosome 22 as "effect_allele" and in 205 of cases as "other_allele"?

Finally, how can I deal with this encoding if I want to compute the polygenic score on a certain cohort? Can I simply invert the sign of the "effect_weight" column for these 205 cases where the "other_allele" matches the reference?

I paste my reproducible code below:

suppressMessages(library(BSgenome.Hsapiens.UCSC.hg38))
suppressMessages(library(data.table))
suppressMessages(library(R.utils))
suppressMessages(library(stringr))

setwd('/cluster/home/criccio/PRS/')

source('scripts/get_hg38_base.R')

if (!dir.exists('output/table'))  dir.create('output/table')

PGS_ID <- 'PGS002723'

# Example to ask questions online
system(paste0('wget -P output/ --recursive --no-parent ftp://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/', PGS_ID, '/ScoringFiles/'))

scoring_file <- data.table::fread(paste0('output/ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/', PGS_ID, '/ScoringFiles/Harmonized/', PGS_ID, '_hmPOS_GRCh38.txt.gz'))
data.table::setnames(scoring_file, c('chr_name', 'hm_pos', 'effect_allele', 'other_allele'), c('chr', 'pos', 'ref', 'alt'))

pgs_variants_per_chrom <- scoring_file[, .N, chr]
data.table::setnames(pgs_variants_per_chrom, 'N', 'nb_pgs_in_scoring_file')

chromosome_number <- 22

scoring_file[chr == chromosome_number, ref_hg38 := sapply(scoring_file[chr == chromosome_number, pos], get_hg38_chr22_base)]
scoring_file[chr == chromosome_number, ref_hg38 := toupper(ref_hg38)]
scoring_file[chr == chromosome_number, table(ref == ref_hg38)] # 14477 TRUE and 205 FALSE
scoring_file[chr == chromosome_number, 100 * mean(ref == ref_hg38, na.rm = TRUE)] # 98.60%

scoring_file_mismatch <- scoring_file[chr == chromosome_number & ref != ref_hg38] 
scoring_file_mismatch[chr == chromosome_number, mean(alt == ref_hg38)] # 100% match
data.table::fwrite(scoring_file_mismatch, paste0('output/table/', PGS_ID, '_chr_pos_ref_mismatching_hg38_chr', chromosome_number, '.csv'), sep = ',')

get_hg38_base.R

library(BSgenome.Hsapiens.UCSC.hg38)
Hsapiens_chr22 <- strsplit(as.character(Hsapiens$chr22), '')[[1]]
get_hg38_chr22_base <- function(pos) Hsapiens_chr22[pos]
$\endgroup$

1 Answer 1

2
$\begingroup$

Effect allele is almost always the reference allele. It theoretically does not have to be, so it's good to not rely on that fact. However, in practice it is typically a safe bet. At first glance, the answer to your question could be this:

The polygenic score from the PGS catalog is using GRCh37. Your genome build is hg38. I expect that some alleles in the score swapped ref/alt positions between builds, resulting in your 205 cases where the effect allele matches the alt allele.

To confirm, could you list a few examples of cases that matched as expected and a few examples of cases that didn't match?

You don't need to deal with this encoding if you want to compute a polygenic score on a cohort. You can match based on the allele itself. (Also, there are existing tools which will calculate polygenic scores for you.) Here is an example (just a snippet, you'd have to iterate through your files, etc. for this to work):

            if PGS_row scoring file identifier like rsID or position == vcf_row identifier: # When the SNP is in both files
            effect_allele = PGS_row[2] # Wherever the affect allele is in your scoring file
            effect_weight = float(PGS_row[4])
            dosage (of ref allele) = (depends on your VCF formatting)

            if effect_allele == vcf_row[3]: # Effect allele matches reference allele
                running_total += (2 - dosage) * effect_weight # We assume dosage of alt is 2 minus dosage of ref. So if you have 2 alt alleles, 2-2 = 0 and there will be nothing added for this SNP.
                num_additions += 1
            elif effect_allele == vcf_row[4]: # Effect allele matches alternate allele
                running_total += dosage * effect_weight
                num_additions += 1
            else:
                nomatch = nomatch + 1

# Calculate final output score
final_output_score = running_total / num_additions
percent_matched = (num_additions)/((nomatch)+(num_additions))

You shouldn't invert the sign of the "effect_weight" column for these 205 cases where the "other_allele" matches the reference. Not having the effect allele doesn't add the opposite beta to the score, it simply doesn't change it.

Try this:

  1. Do you still see those 205 cases if you use a genome build matching the PGS file? (I.e., GRCh37.)
  2. Do you still see those 205 cases if you use the LiftOver PGS scoring file with hg38 coordinates? (Download it here.)
$\endgroup$
1
  • 1
    $\begingroup$ Thanks a lot. This answer is really helpful. Yes, I checked the scoring file of GRCh37 and there were no mismatches, so it must be as you say that the reference base changed between GRCh37 and GRCh38. I also thought inverting the sign was not OK but now you convinced me. Indeed, it add weight if you have the effect allele. Cheers! $\endgroup$ Commented Jun 15, 2023 at 7:42

Not the answer you're looking for? Browse other questions tagged or ask your own question.