2
$\begingroup$

I ran my 23andMe data through the Sanger Imputation Service that uses Eagle v2.4 for phasing, PBWT v3.1 for imputation.

However, some aspects of the results are very confusing to me. Perhaps see #3 (the most confusing) first.

Point 1

  1. There are 62,623 rsIDs for chromosome 3 in my 23andMe file and 2,821,894 SNPs (actually, unique rsIDs) after imputation. However, only 60,144 of the original 62,623 rsIDs are in that ~2.8 million. Where did the other 2,479 go? Here are some examples of missing rsIDs: rs1516332, rs12374025, rs1170695, rs2731343 and rs63749864.

EDIT Point 1

Looking at rs1516332, the position is 3:105793, which does have a result in the file, but it doesn't have an rsID:

3    105793    .    C    T    .    PASS    RefPanelAF=0.17878;AN=2;AC=0;INFO=1    GT:ADS:DS:GP    0|0:0,0:0:1,0,0

Looking at rs63749864 (3:13564048) it's missing by rsID and position.

EDIT Point 1 (again)

As described here: https://github.com/richarddurbin/pbwt/issues/51

"... of the 2,479 'missing' SNPs, 1,806 can be 'found' [by position] leaving 673 'anchors' [or rsIDs called by 23andMe] missing from chromosome 3."

Point 2

  1. Some imputed rsIDs appear more than once in the results. How should I interpret that? e.g.
3   64071   rs116059446 G   T   .   PASS    RefPanelAF=0.000369572;AN=2;AC=0;INFO=1 GT:ADS:DS:GP    0|0:0,0:0:1,0,0
3   64071   rs116059446 G   A   .   PASS    RefPanelAF=9.2393e-05;AN=2;AC=0;INFO=1  GT:ADS:DS:GP    0|0:0,0:0:1,0,0

EDIT Point 2 From the answer below, this is a variant with more than two alleles: rs116059446

Seems the format of this VCF file only allows one alt per line, hence the two alts are split over two lines.

See my question here: Is it valid VCF not to 'squash' positions with more than one ALT allele?

And the answer from Durbo here: https://github.com/richarddurbin/pbwt/issues/52#issuecomment-1062951105

In the above case, the genotype is easy to interpret (REF/REF in both lines).

However, I'm now finding examples that I just can't interpret:

3   255395  rs331869    C   G   .   PASS    RefPanelAF=0.385864;AN=2;AC=1;INFO=1    GT:ADS:DS:GP    1|0:1,0:1:0,1,0
3   255395  rs331869    C   T   .   PASS    RefPanelAF=0.980998;AN=2;AC=2;INFO=1    GT:ADS:DS:GP    1|1:1,1:2:0,0,1

Is my genotype CG or TT at this position?

Point 3

  1. Sometimes a rsID appears twice but at different locations:
3   4942430 rs71634747  G   A   .   PASS    RefPanelAF=0.445673;AN=2;AC=1;INFO=1    GT:ADS:DS:GP    1|0:1,0:1:0,1,0
3   4942432 rs71634747  C   G   .   PASS    RefPanelAF=0.248907;AN=2;AC=0;INFO=1    GT:ADS:DS:GP    0|0:0,0:0:1,0,0

Unless I'm really confused, this just seems wrong. If it wasn't for this weirdness, I'd seek concrete answers for #1 and #2, but it could be that they are both just a symptom of this potential error?

$\endgroup$
1
  • $\begingroup$ Oh, [email protected] isn't responding to my email. I have asked the above of them. $\endgroup$
    – Dan Bolser
    Commented Mar 7, 2022 at 14:22

2 Answers 2

3
$\begingroup$
  1. Don't know. These could be retired or inconsistent SNPs (e.g. shifted locations, or otherwise found to be different from expected). It might be that the imputation suggests that the SNP should be something other than what is observed. I can't see anything obvious from the dbSNP information that I've looked at.

  2. a) In this specific case for rs116059446, that's a SNP that has more than two variants. One way of normalising VCF files is to rearrange them so that each line only refer to a single polymorphism from the reference sequence, so if there are two non-reference variants, the variant will get two lines in the VCF file.

    b) For rs331869, this is another SNP with more than two variants recorded in dbSNP. Given that you're digging through the weeds of a huge dataset, the way of thinking about this needs to change from "Is this common" to "Is this possible". Problems are much more likely to exist with SNPs that aren't biallelic, because it indicates that something funky is going on with genetics or testing. I'll propose three possibilities for what could be happening here, in what I consider to be descending order of likelihood:

    1. you have one non-reference allele on one of the sister chromosome copies, and another non-reference allele on another copy
    2. you have a somatic mutation (i.e. a mutation that happened during your lifetime), and some of your cells have one variant, and others have a different variant.
    3. the probes are tagging different regions in your genome from what is expected. Probes should be location-unique, but that's impossible to guarantee for everyone, because genetic variation is a thing that happens.
  3. In this specific case for rs71634747, it looks like that SNP has been retired, and has merged with rs724135. It's possible that this is a multi-nucleotide variant where variants are only observed in tandem. This is an intronic variant, so seeing a 3-nucleotide variation is a little unexpected, but given that you're picking out exceptions from a very large dataset, not completely unexpected.

I think the main take-home from this is that you shouldn't ever expect results to be a perfect representation of reality. It can be useful to look at raw data to help understand what's going on, but there's too much weird stuff going on with biology, chemistry and physics for anything to be 100% accurate.

If you did exactly the same test at a later date, some of your 23andMe results will be different. If you did imputation at a later date, with a different program version, or different imputation reference dataset, your results are likely to be different.

Accuracy can be substantially improved by using different technologies to attempt to discover the same thing, and only concentrating on consistently-reported differences. You could try an Infinium Omni2.5 SNPchip, for example, or whole-genome sequencing. Unfortunately, this will lead to another problem: what should be done with the inconsistent results? The answer to that is situation dependent, and is one of the hardest challenges of bioinformatics.

$\endgroup$
3
  • $\begingroup$ Thanks for these answers. I'm not expecting data to reflect reality (I've been doing bioinformatics for too long for that ;-) I just expect data to be sane. This is just a basic sanity test, flagging things that looked wrong. I think point 1 and 3 are now tidied up (I don't think the software is behaving correctly, but I can at least understand how the software is behaving, and it's not bonkers. $\endgroup$
    – Dan Bolser
    Commented Mar 10, 2022 at 13:54
  • $\begingroup$ I still don't understand point 2. Lets look at your potential sources for this data sanity check failure... 2b.1) makes perfect sense, but I would expect the genotype to be 0/1 and 0/1. That would be easy to interpret. That's not what I see (1/0 and 1/1). 2b.2) This may be true, but the data could never show that (imputation). 2b.3) Same, not possible in this data (imputed). Increasingly I think this call is just wrong. Not biologically wrong, just fails basic sanity checks. $\endgroup$
    – Dan Bolser
    Commented Mar 10, 2022 at 13:57
  • $\begingroup$ Like imagine I showed you a vcf that had 3 3333 . A G . . . GT 4/4... Well, you could say, ah ha, this is a sign of a quadruplex DNA at this location in this individual... well maybe. I'd just say the VCF is broken. It's just wrong / impossible to interpret (imho). $\endgroup$
    – Dan Bolser
    Commented Mar 10, 2022 at 14:02
0
$\begingroup$

The answer is to relax and throw away the 0.001% of stuff that doesn't make sense, lower your standards and just get on with finishing the job you set out to do without actually caring about consistency, logic, meaning or formats that are actually defined or interpretable.

Nobody cares about what these SNPs mean, how to interpret them programmatically, why there are version issues in the output of the imputation server or why it throws away ground truth in the output, leaving the user to struggle with all these issues. I mean... if nobody cares, why should I?

I've written up as detailed a description of the problems as I can here: https://github.com/richarddurbin/pbwt/issues/52 and here: https://github.com/richarddurbin/pbwt/issues/51

and written a script to throw away all the junk here: https://github.com/Geromics/covcheck/blob/wip/report-v2/Research/debug_imputation_results.py

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.