1
$\begingroup$

This question was also asked on Biostars

I am doing an experiment where I am trying to analyze the errors in the homopolymer regions between the polished reference hg002 genome and hifiasm assembly done inhouse using just long ONT duplex reads.

I am currently working on Haplotype aware long read error correction mechanisms, so Hg002(diploid) genome and its associated interesting regions (homopolymer, SDs and Tandem repeats) are important for this study. Assemblies only using long ONT duplex or HiFi reads tend to give a lower quality consensus in these regions.

I am not able to find the .bed files for the hg002 genome. All I was able to find was the homopolymer .bed files for chm13, GRCh38, GRCH37 reference genomes here.

Thanks a lot!

$\endgroup$
0

2 Answers 2

1
$\begingroup$

If you click through on UCSC you can get to the UCSC source files for hg002.

    GCA_018852605.2.2bit                          2023-12-24 09:52  732M  
      GCA_018852605.2.2bit.bpt                      2023-12-24 09:52  4.5K  
      GCA_018852605.2.agp.gz                        2023-12-14 07:12  512   
      GCA_018852605.2.chrom.sizes.txt               2023-12-14 07:12  474   
      GCA_018852605.2.chromAlias.bb                 2024-02-15 12:08   39K  
      GCA_018852605.2.chromAlias.txt                2024-02-15 12:08  806   
      GCA_018852605.2.fa.gz                         2023-12-24 09:52  878M  
      GCA_018852605.2.repeatMasker.out.gz           2023-12-22 12:36  160M  
      GCA_018852605.2.repeatMasker.version.txt      2023-12-22 09:51  1.7K  
      GCA_018852605.2.repeatModeler.2bit            2023-12-25 19:37  729M  
      GCA_018852605.2.repeatModeler.families.fa.gz  2023-12-25 14:49  205K  
      GCA_018852605.2.repeatModeler.families.stk.gz 2023-12-25 14:49  4.6M  
      GCA_018852605.2.repeatModeler.out.gz          2023-12-25 17:38  135M  
      GCA_018852605.2.repeatModeler.version.txt     2023-12-25 14:55  1.4K  
      GCA_018852605.2.rmod.log.txt                  2023-12-25 14:49  5.1K  
      GCA_018852605.2.rmsk.customLib.fa.gz          2023-12-25 14:49  205K  
      GCA_018852605.2.trans.gfidx                   2023-12-24 09:52  4.3G  
      GCA_018852605.2.untrans.gfidx                 2023-12-24 09:52  2.0G  
      GCA_018852605.2_assembly_report.txt           2023-12-23 08:58  2.9K  
      bbi/                                          2024-01-10 14:37    -   
      genes/                                        2024-01-10 14:37    -   
      groups.txt                                    2024-03-09 14:52  542   
      html/                                         2024-03-09 14:54    -   
      hub.txt                                       2024-03-09 14:52  5.3K  
      ixIxx/                                        2024-01-10 14:37    -   
      trackDb.txt                                   2024-01-10 14:37  4.4K  

In the BigBed (bbi/) directory you can see several annotation tracks:

      GCA_018852605.2_ASM1885260v2.allGaps.bb              2023-12-24 08:50   19K  
      GCA_018852605.2_ASM1885260v2.assembly.bb             2023-12-14 07:12   22K  
      GCA_018852605.2_ASM1885260v2.augustus.bb             2023-12-24 19:30  2.0M  
      GCA_018852605.2_ASM1885260v2.cpgIslandExt.bb         2023-12-24 16:56  1.0M  
      GCA_018852605.2_ASM1885260v2.cpgIslandExtUnmasked.bb 2023-12-24 16:55  1.7M  
      GCA_018852605.2_ASM1885260v2.cytoBand.bb             2023-12-14 07:12   21K  
      GCA_018852605.2_ASM1885260v2.gap.bb                  2023-12-14 07:12   24K  
      GCA_018852605.2_ASM1885260v2.gc5Base.bw              2023-12-14 07:12  1.5G  
      GCA_018852605.2_ASM1885260v2.rModel.align.bb         2023-12-25 19:30  1.1G  
      GCA_018852605.2_ASM1885260v2.rModel.bb               2023-12-25 19:23  313M  
      GCA_018852605.2_ASM1885260v2.rmsk.align.bb           2023-12-22 13:59  1.0G  
      GCA_018852605.2_ASM1885260v2.rmsk.bb                 2023-12-22 13:53  320M  
      GCA_018852605.2_ASM1885260v2.simpleRepeat.bb         2023-12-24 08:48   48M  
      GCA_018852605.2_ASM1885260v2.tandemDups.bb           2023-12-14 07:12   50M  
      GCA_018852605.2_ASM1885260v2.windowMasker.bb         2023-12-24 09:51  110M  
      GCA_018852605.2_ASM1885260v2.xenoRefGene.bb          2023-12-24 18:00  3.4M  

I don't see a specific homopolymer bed file but they do have RepeatMasker outputs which you should be able to extract this info from.

$\endgroup$
3
  • 1
    $\begingroup$ The tandemdups one might be the OP's best bet. UCSC also has a "segmental duplications" track which I don't see in the list here but is probably even better. $\endgroup$
    – terdon
    Commented Apr 18 at 17:32
  • 2
    $\begingroup$ I was able to work with this resource. It contains homopolymer regions bed file in microsatellites directory with the name v1.0.1.mnr10.bb. Though there is no explanation for the name 'mnr' but I guess it means mononucleotide repeats. $\endgroup$
    – Panda_1996
    Commented Apr 19 at 6:06
  • $\begingroup$ @Panda_1996 if that solved your problem, you should feel free to post an answer to your own question with this information. $\endgroup$ Commented Apr 19 at 16:15
0
$\begingroup$

I was able to get the hg002 annotation files (link) hosted on human pangenome S3 bucket (https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html).

It was due to this article cited below that I was able to figure out the annotation directory.

Liao, WW., Asri, M., Ebler, J. et al. A draft human pangenome reference. Nature 617, 312–324 (2023). https://doi.org/10.1038/s41586-023-05896-x

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.