Metagenomics uses sequencing technologies to study genetic sequences from whole microbial communities. Binning metagenomic reads is the most fundamental step in metagenomic studies, which is essential for the understanding of microbial functions, compositions, and interactions in environmental samples. Various taxonomy-dependent and taxonomy-independent approaches have been developed based on information such as sequence similarity, sequence composition, or k-mer frequency. However, there is still room for improvement, and it is still challenging to bin reads from species with similar or low abundance or to bin reads from unknown species. In this dissertation, we introduce one taxonomy-independent and three taxonomy-dependent approaches to improve the performance of metagenomic reads binning. The taxonomy-independent method called MBBC, bins reads by considering k-mer frequency in reads without reference genomes. The first two taxonomy-dependent methods both bin reads by measuring the similarity of reads to the trained Markov Chains from different taxa. The major difference between these two methods is that the first one selects the potential taxa with the taxonomical decision tree, while the second one, called MBMC, selects potential taxa using ordinary least squares (OLS) method. The third taxonomy-dependent method bins reads by combining the methods of MBMC with clustering Markov chains from the assembled reads. By testing on both simulated and real datasets, these tools showed superior or comparable performance with various the state of the art methods. We anticipate that our tools can significantly improve the accuracy of metagenomic reads binning and thus be widely applied in real environmental samples.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Wang, Ying, "Computational Approaches for Binning Metagenomic Reads" (2016). Electronic Theses and Dissertations, 2004-2019. 5344.