Keywords
Machine learning, Computational biology, multi-modal analysis, time series analysis
Abstract
Advancements in high-throughput technologies have led to an exponential increase in the generation of multi-modal data in computational biology. These datasets, comprising diverse biological measurements such as genomics, transcriptomics, proteomics, metabolomics, and imaging data, offer a comprehensive view of biological systems at various levels of complexity. However, integrating and analyzing such heterogeneous data present significant challenges due to differences in data modalities, scales, and noise levels. Another challenge for multi-modal analysis is the complex interaction network that the modalities share. Understanding the intricate interplay between different biological modalities is essential for unraveling the underlying mechanisms of complex biological processes, including disease pathogenesis, drug response, and cellular function. Machine learning algorithms have emerged as indispensable tools for studying multi-modal data in computational biology, enabling researchers to extract meaningful insights, identify biomarkers, and predict biological outcomes. In this dissertation, we first propose a multi-modal integration framework that takes two interconnected data modalities and their interaction network to iteratively update the modalities into new representations with better disease outcome predictive abilities. The deep learning-based model underscores the importance and performance gains achieved through the incorporation of network information into integration process. Additionally, a multi-modal framework is developed to estimate protein expression from mRNA and microRNA (miRNA) expressions, along with the mRNA-miRNA interaction network. The proposed network propagation model simulates in-vivo miRNA regulation on mRNA translation, offering a cost-effective alternative to experimental protein quantification. Analysis reveals that predicted protein expression exhibits a stronger correlation with ground truth protein expression compared to mRNA expression. Moreover, the effectiveness of integrative models is contingent upon the quality of input data modalities and the completeness of interaction networks, with missing values and network noise adversely affecting downstream tasks. To address these challenges, two multi-modal imputation models are proposed, facilitating the imputation of missing values in time series data. The first model allows the imputation of missing values in time series gene expression utilizing single nucleotide polymorphism (SNP) data for children at high risk of type 1 diabetes. The imputed gene expression allows us to predict the progression towards type 1 diabetes at birth with six years prediction horizon. Subsequently, a follow-up study introduces a generalized multi-modal imputation framework capable of imputing missing values in time series data using either another time series or cross-sectional data collected from the same set of samples. These models excel at imputation tasks, whether values are missing randomly or an entire time step in the series is absent. Additionally, leveraging the additional modality, they are able to estimate a completely missing time series without prior values. Finally, to mitigate noise in the interaction network, a link prediction framework for drug-target interaction prediction is developed. This study demonstrates exceptional performance in cold start predictions and investigates the efficacy of large language models for such predictions. Through a comprehensive review and evaluation of state-of-the-art algorithms, this dissertation aims to provide researchers with valuable insights, methodologies, and tools for harnessing the rich information embedded within multi-modal biological datasets.
Completion Date
2024
Semester
Spring
Committee Chair
Zhang, Wei
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Degree Program
Computer Science
Format
application/pdf
Identifier
DP0028292
URL
https://purls.library.ucf.edu/go/DP0028292
Language
English
Rights
In copyright
Release Date
May 2024
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
Campus Location
Orlando (Main) Campus
STARS Citation
Ahmed, Khandakar Tanvir, "Machine Learning Algorithms to Study Multi-Modal Data for Computational Biology" (2024). Graduate Thesis and Dissertation 2023-2024. 123.
https://stars.library.ucf.edu/etd2023/123
Accessibility Status
Meets minimum standards for ETDs/HUTs