5
$\begingroup$

I have some RNA-seq data where there are reads from the "host" as well as from several bacteria species. In this experimental context, I am interested in the host associated reads and the latter is regarded as "contaminants".

I am using BBSplit to split host- and bacteria-associated reads using my genome of interest as well as a "repo" containing thousands of bacterial genomes that are likely to be present in the host.

This repo has multiple genome assemblies for the same species making the BBSplit step computationally expensive. The total size of the repo is approximatly 25GB, more than 25X the human genome! I am wondering what would be the best criteria to pick one assembly of a particular genome over another assembly.

All in all, different genome assemblies of a given species have more or less similar total lengths, %GC, number of contigs meaning that picking any would probably be fine but not being a microbiologist, I would like to hear expert's opinions on this. My current take is to take the assembly with the largest size for each species.

$\endgroup$
11
  • $\begingroup$ Have you tried testing a couple of randomly chosen ones? Do your results vary significantly? $\endgroup$
    – terdon
    Commented Feb 26 at 13:20
  • $\begingroup$ This is rather difficult to test as this would require comparing count matrices generated after removing reads associated with one bacterial genome assembly for another. There are hundreds of species and each has multiple assemblies. The biggest problem is the computational overhead with the mapping with BBSplit. I believe the best way to proceed would be generating a pan-genome for each species and then map to these but I don't have much experince with working with pan-genomes. $\endgroup$
    – haci
    Commented Feb 26 at 13:28
  • $\begingroup$ do you have the genome of your organism of interest/host? Or even a transcriptome of a related one? You may also cluster reads in your FASTQ files using clumpify from BBMap prior to any mapping/species assignment to speed upthe subsequent steps. $\endgroup$
    – darked89
    Commented Feb 26 at 13:30
  • 1
    $\begingroup$ @haci imho you can just map all the reads to the human genome using splice aware mapper (i.e. STAR). The overwhelming majority of reads from non-eukariots will simply not map to a human genome. If you are not interested in them, there is no need to perform a costly filtering. $\endgroup$
    – darked89
    Commented Mar 1 at 15:20
  • 1
    $\begingroup$ I tend to agree with @darked89 here. Most reads that come from bacteria will simply not map when you map to the host. Moreover, for those reads that actually would map to both, host and bacteria, you have no way of knowing where the respective RNA actually originates from. Removing those would assume that they certainly not from the host. The fact that some reads that map to human also map to a huge collection of bacteria does not really prove that they are of bacterial origin. $\endgroup$
    – Niklas
    Commented Mar 6 at 18:43

2 Answers 2

1
$\begingroup$

A couple of ideas:

  1. Use kraken2 to classify the reads. If you are able to convert your "repo" to kraken2 format, then the matching will be very fast and relatively accurate. Also, using bwa-mem on a joint database could work.
  2. I am not a big fan of custom and not well defined "repos". Why don't you just use the official RefSeq repository of prokaryote genomes? (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/). I don't remember its size, but I remember using it without any problem.
$\endgroup$
5
  • $\begingroup$ Thank you for the suggestions. 1) I have used Kraken2 as well as Centrifuge for read classification. Apparently the latter is better for shorter reads. Anyway, both failed to call more or less half the reads, probably because my read size of 50bp. 2) What I am using is the human oral microbiome database, they fetch genomes from RefSeq. And then again, I don't want to map to everything but to a collection of "good representatives". $\endgroup$
    – haci
    Commented Mar 2 at 19:06
  • $\begingroup$ Some more "random" ideas, in addition to the idea of choosing the largest assembly for each species: 1) you can select for each species the Reference assembly, which is indicated in RefSeq. 2) I wouldn't be against darked89 suggestion (map to humans only) but I also understand that this should be done only in case of emergency. 3) Have you considered mapping with STAR on the joint human- bacterial database? $\endgroup$ Commented Mar 3 at 9:23
  • $\begingroup$ Thanks! Taking the reference for each species is definetly a good idea. And reagerding your third point, can you please elaborate, where can I find this human-bacterial database? Good old Google failed me this time. For STAR, I would need a genome FASTA and associated annotation GTF file and I belive such a combined genome would be GBs big. And regarding your 2nd point, how do you explain the fact that I end up with different count matrices with and w/o filtering for bacterial reads? $\endgroup$
    – haci
    Commented Mar 3 at 14:31
  • 1
    $\begingroup$ Regarding 3rd point: you merge human and all the bacterial fasta in a multifasta and the build the STAR index with that. You are right it will be several gigabytes, but it is possible that once you create the index (IF creating the index gives no problems) STAR will be fast enough in mapping. Regarding 2nd point: I think you are right in supposing that some gene portions may be not too dissimilar from bacterial transcripts, and thus it would be better to map on all the genomes at the same time. $\endgroup$ Commented Mar 3 at 18:00
  • $\begingroup$ My mapping attempts with BBMap took days with a the full set of genomes (hence the question in the first place) not to mention many problems I had encountered while creating the BBMap index. I don't think doing it with STAR will be much much faster. Thanks anyway. $\endgroup$
    – haci
    Commented Mar 3 at 18:07
-1
$\begingroup$
  1. What I would suggest, without knowing much biology about the system is it's a tree problem. The difficulty is knowing whether the repo is prokaryotic. To solve it I'd take the 16S gene and make a tree - each 16S sequence contains its genome ID. I'd then draw a tree, e.g. IQ-TREE is good and fast. This will describe the overall diversity and provides and informed basis of how to represent each species.

  2. Your approach certainly has merit because again if these are bacteria they go through gene loss - which can cause problems. Selecting the largest genome ensures you minimise this loss.

What I would do is combine 1 and 2. This is to label the taxa that has the largest genome and then produce the 16S tree. It will then be obvious by manual inspection if your approach is representative of the genetic diversity. If it isn't .... you can do the whole thing manually (could be painful), the other approach is to use a tree reading package/module/library. I use ETE3, you'd use ape (@haci's from memory is a big R codes). The issue is traversing the tree from the 'biggest genome' to the most distant genetically related member of the species. What you do is identify the MRCA (most recent common ancestor) for that species and pick a taxa that is in the opposite clade to the biggest genome and any member will do. Thus this is automating the process via Python or R code so thousands of genomes/species can be screened.

To try and be clearer, what you are doing in the code is moving the tree one node at a time and checking if all the members below it are the same species. As soon as the answer is 'no', it's the previous node thats the MRCA for the species. In a bifurcating tree that divides the data into its two biggest groups of genetic diversity.

This approach ensures you get a reasonable representation of the species, by selecting two members of it. One is the largest genome.

It's a while since I've used ETE3, so I can't remember the methods, there's probably a MRCA method for a given species. You can do this empirically, doing a recursive climb up the tree from the biggest genome until your monophyly contains a difference species. The previous traversal is the MRCA, then you ask it to divide the species into two clades and any member of the clade without the biggest clade is what you want.

I agree that it needs a bit of "tree-thinking" to do this ... but it's doable. You then use the genome IDs to filter your repo and produce a greatly slimmed down repo.

You could alternative switch to an HPC method (e.g. AWS) using the entire repo

The alternative approach is know a lot about the biology/ pathogenesis of the problem.

$\endgroup$
3
  • $\begingroup$ Well the answer is correct, regardless of the number of downvotes. Explaining it clearly - particularly in code - is a lot of text. Basically to automate a representative genetic diversity data set is not trivial, unless there's an external biological rationale, but here nothing specified. $\endgroup$
    – M__
    Commented Feb 28 at 22:09
  • 1
    $\begingroup$ Thanks @M_ for the detailed answer. The contaminants are bacteria and the host is human. I think I will first try your suggestion of using 16 trees to see if taking the largest genome was a good call. The fact that I have over 500 species, I somehow need to automate the representative/not-a-good-representative call. $\endgroup$
    – haci
    Commented Mar 1 at 14:24
  • 2
    $\begingroup$ BTW, regarding the downvote in this answer, I don't really find it useful downvoting but then not explaining why. In this case, here is a complex answer in a domain that I am not experienced (hence the question) and now I don't not know which part might be "off". Giving a reason for the downvote would spark a discussion that the community would benefit. $\endgroup$
    – haci
    Commented Mar 1 at 14:30

Not the answer you're looking for? Browse other questions tagged or ask your own question.