I ran my 23andMe data through the Sanger Imputation Service that uses Eagle v2.4 for phasing, PBWT v3.1 for imputation.
However, some aspects of the results are very confusing to me. Perhaps see #3 (the most confusing) first.
Point 1
- There are 62,623 rsIDs for chromosome 3 in my 23andMe file and 2,821,894 SNPs (actually, unique rsIDs) after imputation. However, only 60,144 of the original 62,623 rsIDs are in that ~2.8 million. Where did the other 2,479 go? Here are some examples of missing rsIDs: rs1516332, rs12374025, rs1170695, rs2731343 and rs63749864.
EDIT Point 1
Looking at rs1516332
, the position is 3:105793, which does have a result in the file, but it doesn't have an rsID:
3 105793 . C T . PASS RefPanelAF=0.17878;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
Looking at rs63749864
(3:13564048) it's missing by rsID and position.
EDIT Point 1 (again)
As described here: https://github.com/richarddurbin/pbwt/issues/51
"... of the 2,479 'missing' SNPs, 1,806 can be 'found' [by position] leaving 673 'anchors' [or rsIDs called by 23andMe] missing from chromosome 3."
Point 2
- Some imputed rsIDs appear more than once in the results. How should I interpret that? e.g.
3 64071 rs116059446 G T . PASS RefPanelAF=0.000369572;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
3 64071 rs116059446 G A . PASS RefPanelAF=9.2393e-05;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
EDIT Point 2 From the answer below, this is a variant with more than two alleles: rs116059446
Seems the format of this VCF file only allows one alt per line, hence the two alts are split over two lines.
See my question here: Is it valid VCF not to 'squash' positions with more than one ALT allele?
And the answer from Durbo here: https://github.com/richarddurbin/pbwt/issues/52#issuecomment-1062951105
In the above case, the genotype is easy to interpret (REF/REF in both lines).
However, I'm now finding examples that I just can't interpret:
3 255395 rs331869 C G . PASS RefPanelAF=0.385864;AN=2;AC=1;INFO=1 GT:ADS:DS:GP 1|0:1,0:1:0,1,0
3 255395 rs331869 C T . PASS RefPanelAF=0.980998;AN=2;AC=2;INFO=1 GT:ADS:DS:GP 1|1:1,1:2:0,0,1
Is my genotype CG or TT at this position?
Point 3
- Sometimes a rsID appears twice but at different locations:
3 4942430 rs71634747 G A . PASS RefPanelAF=0.445673;AN=2;AC=1;INFO=1 GT:ADS:DS:GP 1|0:1,0:1:0,1,0
3 4942432 rs71634747 C G . PASS RefPanelAF=0.248907;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
Unless I'm really confused, this just seems wrong. If it wasn't for this weirdness, I'd seek concrete answers for #1 and #2, but it could be that they are both just a symptom of this potential error?