1 May 2013- Singapore. This is an Open Science post intended to spark ideas and conversations with other scientists.
I was thinking about some of the assembly problems that have been described in large metagenomic (e.g. soil) samples by @ctitusbrown, @phylogenomics and many others.
To start with - I am not an “NGS” metagenomics expert. Iam a “FGS” metagenomics guy. We published log-odds scores for detecting over 100 species by amino acid composition in 2002. (http://www.biomedcentral.com/1471-2105/3/39). I know a thing or two about large scale bioinformatics pipelines and optimizations.
So. Here are my current thoughts about inverting the metagenomics assembly pipeline.
In a nutshell:
Short read -> RPS-Blast -> PSSM Hierarchy Bin -> Bin Assembly -> Bin contigs -> RPS-Blast hybrid ejection -> Function Assignment -> Phylogenetic assignment -> Taxon Bin -> Assembly.
Inversion? How is this possible?
First - the "small database effect".
What is the small database effect? It is the inverse effect of the large database problem. Google "cancer" and you get (today) 644 million hits. The larger the database, the more hits you get. A smaller database gives you correspondingly fewer hits.
Consider that the number of protein functional classes (PSSMs or HMMS) is smaller than the number of possible species. Organisms are built on common protein building blocks, and there are more ways to organize the blocks and encode them (i.e. species) than there are blocks themselves.
In principle, searching for the function of a short read should outperform a search for the taxon of a short read, in terms of both hit quality and number. Can this work in practice? I think so. Here’s why. (I have tried out steps 1-3 they work from the command-line RPS-BLAST, which got me started on this idea).
- RPS-BLAST can find PSSMs that match sequences of 9-12 amino acids. Change the threshold E-score to around 100. Multiple PSSMs may be hit in the process, and some of these may be false positives. See point 3 about reducing the number of hits by up to an order of magnitude, and points 6 about false positive hits.
- Short reads on the order of 30bp can be input directly into RPS-BLAST as nucleotides. Select the appropriate genetic code. Assume everything encodes protein and close your eyes for the moment.
- CDDs are not independent, rather they are curated. The PSSMs are organized into a hierarchy of evolutionary families and superfamilies. Most hits are in fact within a branch of the hierarchy. The collection of over 40,000 PSSMs is itself redundant and a last-common family or superfamily can be readily found by traversing the ancestral hierarchy. This can collapse the number of candidate PSSM hits to the short sequence considerably. In some cases, by an order of magnitude. For example 50 PSSM hits can collapse into 2-5 superfamily CDD PSSMs.
- With the assignment of lowest common ancestor CDD PSSMs to the short read sequence, one can use the resulting PSSM / CDD identifier as a database key for binning. A short read sequence can be assigned to multiple PSSM bins without harm at this point. By grouping together all the short-reads into PSSM based bins, one can subdivide the metagenomics dataset into potentially related protein functional units by putative peptide sequence.
- Once the reads are binned, one can assemble the sequences in each PSSM bin into contigs independently, by feeding the binned read set into a conventional assembler.
- One can remove erroneous hybrid contigs from the bin by a second RPS-BLAST scan against the parent PSSM or CDD PSSM branch hierarchy. This is a tiny fraction of the entire PSSM set. At this stage the successful contigs should represent species fragments with coding regions based on nucleotide and reading frame overlap, with minimal interference from random overlap. The resulting set of contigs is functionally related to the PSSM, so the assignment of function is already in hand.
- Digital normalization could be applied to the contigs within the PSSM bin to remove redundant information for further genome assembly. There will be a pile of un-binned reads as well that do not map to any PSSMs. These could be assembled/digitally normalized separately from the binned sequences, and merged for the final assembly pass.
- The PSSM itself could be used a scaffold for sub-assembly. That will require some code, but it is probably not necessary.
- The processed contigs can be sent to another system in the pipeline for phylogenetic assignment (e.g. BlastX ). Final assembly can performed assuming species specific sequences are grouped better by the preceding steps.
- PSSMs are divisible and thus can be cut up and subdivided into smaller PSSMs. If there are unmatched portions of the original PSSMs in a set of contigs, they can be chopped into mini-PSSMs, related in the existing CDD hierarchy to the parent PSSM we cut them from. This now changes the search statistics. A new custom RPS-BLAST database can then be compiled out of these leftovers. The unassigned or discarded short reads can be searched against this. This mini-PSSM database will be much smaller in size than the original. This could further extend the previously found contigs by assigning new, weaker hit contigs into the existing PSSM bins up the hierarchy.
Feel free to substitute HMM for PSSM in most of this. However I have no idea whether implementations of protein family HMM searching can use short nt sequences as input (points 1,2), or how to achieve the hierarchical reduction in number of hits without an evolutionary model for the entire set, as curated by CDD (point 3), or whether a HMM can be simply divisible (point 10) without a lot of re-processing.