Keywords

Machine learning, Computational biology, multi-modal analysis, time series analysis

Abstract

Advancements in high-throughput technologies have led to an exponential increase in the generation of multi-modal data in computational biology. These datasets, comprising diverse biological measurements such as genomics, transcriptomics, proteomics, metabolomics, and imaging data, offer a comprehensive view of biological systems at various levels of complexity. However, integrating and analyzing such heterogeneous data present significant challenges due to differences in data modalities, scales, and noise levels. Another challenge for multi-modal analysis is the complex interaction network that the modalities share. Understanding the intricate interplay between different biological modalities is essential for unraveling the underlying mechanisms of complex biological processes, including disease pathogenesis, drug response, and cellular function. Machine learning algorithms have emerged as indispensable tools for studying multi-modal data in computational biology, enabling researchers to extract meaningful insights, identify biomarkers, and predict biological outcomes. In this dissertation, we first propose a multi-modal integration framework that takes two interconnected data modalities and their interaction network to iteratively update the modalities into new representations with better disease outcome predictive abilities. The deep learning-based model underscores the importance and performance gains achieved through the incorporation of network information into integration process. Additionally, a multi-modal framework is developed to estimate protein expression from mRNA and microRNA (miRNA) expressions, along with the mRNA-miRNA interaction network. The proposed network propagation model simulates in-vivo miRNA regulation on mRNA translation, offering a cost-effective alternative to experimental protein quantification. Analysis reveals that predicted protein expression exhibits a stronger correlation with ground truth protein expression compared to mRNA expression. Moreover, the effectiveness of integrative models is contingent upon the quality of input data modalities and the completeness of interaction networks, with missing values and network noise adversely affecting downstream tasks. To address these challenges, two multi-modal imputation models are proposed, facilitating the imputation of missing values in time series data. The first model allows the imputation of missing values in time series gene expression utilizing single nucleotide polymorphism (SNP) data for children at high risk of type 1 diabetes. The imputed gene expression allows us to predict the progression towards type 1 diabetes at birth with six years prediction horizon. Subsequently, a follow-up study introduces a generalized multi-modal imputation framework capable of imputing missing values in time series data using either another time series or cross-sectional data collected from the same set of samples. These models excel at imputation tasks, whether values are missing randomly or an entire time step in the series is absent. Additionally, leveraging the additional modality, they are able to estimate a completely missing time series without prior values. Finally, to mitigate noise in the interaction network, a link prediction framework for drug-target interaction prediction is developed. This study demonstrates exceptional performance in cold start predictions and investigates the efficacy of large language models for such predictions. Through a comprehensive review and evaluation of state-of-the-art algorithms, this dissertation aims to provide researchers with valuable insights, methodologies, and tools for harnessing the rich information embedded within multi-modal biological datasets.

Completion Date

2024

Semester

Spring

Committee Chair

Zhang, Wei

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Degree Program

Computer Science

Format

application/pdf

Identifier

DP0028292

URL

https://purls.library.ucf.edu/go/DP0028292

Language

English

Rights

In copyright

Release Date

May 2024

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

Campus Location

Orlando (Main) Campus

Accessibility Status

Meets minimum standards for ETDs/HUTs

Share

COinS