2
$\begingroup$

I have a txt file summarising the result of a GWAS on an European population. Its structure is the next one:

data = data.frame(chr = c('1', '1', '1', '1', '1'),
              bp = c(740098, 787889, 952073, 993492),
              snp = c('rs12138618',  'rs4951864',  'rs3128126',  'rs4075116'),
              p = c(0.1, 0.04, 0.7, 0.9))

Imagine the first two snp listed (rs12138618, rs4951864) belong to the same LD-block (for r^2 = 0.1 or similar criteria). Where can I get this information from (table/database) or what package in R can I use to label these dependent SNPs and obtain something as:

   chr    bp        snp    p    ld_block
1   1 740098 rs12138618 0.10        21
2   1 787889  rs4951864 0.04        21
3   1 952073  rs3128126 0.70        29
4   1 993492  rs4075116 0.90        34

I am using big data, so something computationally-wise is preferred.

Note: Obviously I am making up the numbers, but I hope my point is clear. I am also new to GWAS, so it is very likely that I can be missing/misunderstanding concepts.

What if I do not have the genotype data but I know that my population is CEU? Is there any reference-genotype data that I can provide to plink instead?

$\endgroup$
5
  • 2
    $\begingroup$ Is your overall goal to find the 'best' SNP within each block of LD, or do you explicitly need to know all SNPs within some LD block for another reason? Can you clarify what the overall goal is please. $\endgroup$
    – user438383
    Commented Nov 21, 2022 at 9:48
  • $\begingroup$ I want to explicitly know to which LD block belongs each SNP, in order to obtain something similar to the represented output stated in my question. My problem is that I do not know from where can I get some "reference-genotype data" for an European population (I thought 1000G may have something but I did not manage to find it) $\endgroup$
    – Gero
    Commented Nov 21, 2022 at 11:11
  • 1
    $\begingroup$ @Gero what if one SNP is in LD with more than one LD block? Most frequently SNPs are in continuous LD with other SNPs upstream and downstream, and you can't neatly split them, though there are of course inference techniques for segmentation. You can infer "haplotype blocks", which are one way of segmenting this information. Usually this is somewhat data dependent, do you want to do this according to your specific data using a tool or using some global reference (e.g. tag SNPs from HapMap)? $\endgroup$ Commented Nov 21, 2022 at 19:40
  • $\begingroup$ My plan was to: 1) Do a LD matrix, 2) find all SNP pairs based on r^2, 3) Create LD blocks considering LD as a transitive property (if one SNP is in LD with more than one block then I will consider those "2 blocks" as 1 big unique block). Even if LD is continuous as long as the distribution of the SNPs in the genome is not homogenous I guess many blocks will be obtained, hopefully ∼1,000,000 $\endgroup$
    – Gero
    Commented Nov 23, 2022 at 17:43
  • $\begingroup$ On the other hand, thank you for the tools. I'll try to investigate how they work and do the maths for the segmentation! (P.S. My plan is to use a global/population reference). $\endgroup$
    – Gero
    Commented Nov 23, 2022 at 17:44

0

Browse other questions tagged or ask your own question.