What I would suggest, without knowing much biology about the system is it's a tree problem. The difficulty is knowing whether the repo is prokaryotic. To solve it I'd take the 16S gene and make a tree - each 16S sequence contains its genome ID. I'd then draw a tree, e.g. IQ-TREE is good and fast. This will describe the overall diversity and provides and informed basis of how to represent each species.
Your approach certainly has merit because again if these are bacteria they go through gene loss - which can cause problems. Selecting the largest genome ensures you minimise this loss.
What I would do is combine 1 and 2. This is to label the taxa that has the largest genome and then produce the 16S tree. It will then be obvious by manual inspection if your approach is representative of the genetic diversity. If it isn't .... you can do the whole thing manually (could be painful), the other approach is to use a tree reading package/module/library. I use ETE3, you'd use ape (@haci's from memory is a big R codes). The issue is traversing the tree from the 'biggest genome' to the most distant genetically related member of the species. What you do is identify the MRCA (most recent common ancestor) for that species and pick a taxa that is in the opposite clade to the biggest genome and any member will do. Thus this is automating the process via Python or R code so thousands of genomes/species can be screened.
To try and be clearer, what you are doing in the code is moving the tree one node at a time and checking if all the members below it are the same species. As soon as the answer is 'no', it's the previous node thats the MRCA for the species. In a bifurcating tree that divides the data into its two biggest groups of genetic diversity.
This approach ensures you get a reasonable representation of the species, by selecting two members of it. One is the largest genome.
It's a while since I've used ETE3, so I can't remember the methods, there's probably a MRCA method for a given species. You can do this empirically, doing a recursive climb up the tree from the biggest genome until your monophyly contains a difference species. The previous traversal is the MRCA, then you ask it to divide the species into two clades and any member of the clade without the biggest clade is what you want.
I agree that it needs a bit of "tree-thinking" to do this ... but it's doable. You then use the genome IDs to filter your repo and produce a greatly slimmed down repo.
You could alternative switch to an HPC method (e.g. AWS) using the entire repo
The alternative approach is know a lot about the biology/ pathogenesis of the problem.
clumpify
fromBBMap
prior to any mapping/species assignment to speed upthe subsequent steps. $\endgroup$i.e. STAR
). The overwhelming majority of reads from non-eukariots will simply not map to a human genome. If you are not interested in them, there is no need to perform a costly filtering. $\endgroup$