1
$\begingroup$

I'm referring to the text described here:

These tools [NCBI remap, CrossMap] operate only on the sites present in an input VCF, and return the representation of those sites in a new genome assembly. This does not capture all variation, however. Consider an individual who has sequence reads that indicate they match the GRCh37 reference genome assembly at position GRCh37.chr1:169,519,049 (i.e. the individual's genotype is T/T). Because the individual is homozygous reference at that site, there will be no variation present in their VCF file created on GRCh37. However, the analogous position on the updated GRCh38 reference genome assembly, position GRCh38.chr1:169,549,811, has the reference base C. Consequently, if the individual's read data were analyzed on GRCh38, they would be identified as homozygous for a C->T SNP. Because this site is not present in the input GRCh37 VCF, it is never added when creating a GRCh38 VCF by these other tools.

How important is this? Do you think it is relevant to any types of analysis?

Performance of liftover methods

$\endgroup$
1
  • $\begingroup$ I'm mostly just wondering if the bioinformatics community thinks the added benefit is large enough to be relevant. I haven't encountered a need for this in a project. $\endgroup$
    – BigMistake
    Commented Jun 4, 2023 at 19:56

1 Answer 1

2
$\begingroup$

All this is saying is that the tool doesn't lift over variants which aren't present in the vcf. I can't imagine this being a big deal in any analysis I've done and it's exactly what I'd expect the tool to do. There's no rule that invariant sites (with respect to the reference) can't be in a vcf, so if you are concerned then just retain the variants in the vcf file at the calling stage

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.