Keywords

biologically informed neural networks, visible neural networks, variational autoencoders, deep learning, single-cell RNA sequencing

Abstract

Deep learning has been widely applied to the analysis of high-dimensional biological omics data, especially single-cell RNA sequencing (scRNA-seq). Incorporating biological information into the architecture of variational autoencoders (VAEs) has been shown to enhance model interpretability, allowing the model’s behavior to reflect the underlying mechanisms of the biological system used for its architecture. In recent years, several studies have employed these biologically informed VAEs to model large-scale single-cell transcriptomics data, with models correctly differentiating between cell states and identifying active pathways. However, systematic benchmarking and comparison of different biologically informed VAE architectures remain limited.

In this study, I evaluated and compared three recent biologically informed VAE models: OntoVAE, VEGA, and expiMap. I focused on their ability to identify condition-relevant biological pathway activity using existing pathway annotation information. These models were also compared against a traditional approach that does not use deep learning. The results revealed significant inconsistencies between the VAE models, with OntoVAE demonstrating the best overall performance on the test data. Furthermore, I found that model performance was highly sensitive to data preprocessing and ontology filtering.

To demonstrate practical application, I used OntoVAE to analyze a recent Parkinson’s disease scRNA-seq dataset. The model prioritized multiple pathways connected to genes associated with Parkinson’s disease. In some cases, OntoVAE appeared to prioritize more biologically relevant pathways than a traditional non-deep learning approach using the same data.

Overall, my findings suggest that, for current biologically informed VAEs, simpler model architectures tend to offer greater performance when identifying relevant pathways, and data preprocessing, ontology filtering, and training strategy play critical roles in determining performance. Future models should aim for robustness to differences in training strategy and should be capable of identifying biological mechanisms at higher resolutions than the pathway level.

Completion Date

2025

Semester

Fall

Committee Chair

Hu, Haiyan

Degree

Master of Science (M.S.)

College

College of Engineering and Computer Science

Department

Computer Science

Format

PDF

Identifier

DP0029723

Document Type

Thesis

Campus Location

Orlando (Main) Campus

Share

COinS