Keywords
biologically informed neural networks, visible neural networks, variational autoencoders, deep learning, single-cell RNA sequencing
Abstract
Deep learning has been widely applied to the analysis of high-dimensional biological omics data, especially single-cell RNA sequencing (scRNA-seq). Incorporating biological information into the architecture of variational autoencoders (VAEs) has been shown to enhance model interpretability, allowing the model’s behavior to reflect the underlying mechanisms of the biological system used for its architecture. In recent years, several studies have employed these biologically informed VAEs to model large-scale single-cell transcriptomics data, with models correctly differentiating between cell states and identifying active pathways. However, systematic benchmarking and comparison of different biologically informed VAE architectures remain limited.
In this study, I evaluated and compared three recent biologically informed VAE models: OntoVAE, VEGA, and expiMap. I focused on their ability to identify condition-relevant biological pathway activity using existing pathway annotation information. These models were also compared against a traditional approach that does not use deep learning. The results revealed significant inconsistencies between the VAE models, with OntoVAE demonstrating the best overall performance on the test data. Furthermore, I found that model performance was highly sensitive to data preprocessing and ontology filtering.
To demonstrate practical application, I used OntoVAE to analyze a recent Parkinson’s disease scRNA-seq dataset. The model prioritized multiple pathways connected to genes associated with Parkinson’s disease. In some cases, OntoVAE appeared to prioritize more biologically relevant pathways than a traditional non-deep learning approach using the same data.
Overall, my findings suggest that, for current biologically informed VAEs, simpler model architectures tend to offer greater performance when identifying relevant pathways, and data preprocessing, ontology filtering, and training strategy play critical roles in determining performance. Future models should aim for robustness to differences in training strategy and should be capable of identifying biological mechanisms at higher resolutions than the pathway level.
Completion Date
2025
Semester
Fall
Committee Chair
Hu, Haiyan
Degree
Master of Science (M.S.)
College
College of Engineering and Computer Science
Department
Computer Science
Format
Identifier
DP0029723
Document Type
Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Principato, Marjorie R., "An Exploration and Evaluation of Biologically Informed Variational Autoencoders for scRNA-seq Data Analysis" (2025). Graduate Thesis and Dissertation post-2024. 488.
https://stars.library.ucf.edu/etd2024/488