ORCID

https://orcid.org/0009-0002-1424-5589

Keywords

Supervised Topic Modeling, Bayesian Shrinkage, Latent Dirichlet Allocation (LDA), Scientific Text Mining, Variable Selection

Abstract

Topic modeling has become a powerful tool for analyzing large-scale textual corpora across scientific domains. However, traditional unsupervised approaches like Latent Dirichlet Allocation (LDA) often produce redundant or weakly informative topics, limiting their interpretability and predictive utility. This dissertation presents a dual-track study that advances the field of interpretable topic modeling and domain-specific text analysis by (1) quantitatively analyzing research trends in Cyber Aggression and Abuse (CAA), and (2) developing a supervised topic selection framework for forensic science literature.

In the first part, we apply LDA to a corpus of 2,309 journal abstracts on CAA sourced from the Web of Science database. This study is among the first to systematically map latent themes in CAA literature, uncovering hot and cold topic clusters, regional publication trends, and temporal shifts in research focus. Key themes include psychological impacts, identity-based harms, detection techniques, and digital interventions. Our findings highlight an evolving interdisciplinary landscape, offering novel insights for scholars, practitioners, and policymakers working on online harm mitigation.

In the second part, we introduce a supervised topic refinement method that integrates LDA with the Mixed-type Multivariate Bayesian Shrinkage Prior (Mt-MBSP) model. We apply this framework to thousands of forensic science abstracts published in the Journal of Forensic Sciences (2009–2022), aiming to identify a sparse, semantically coherent, and domain-relevant set of topics. Mt-MBSP leverages document-category labels to prioritize predictive topics, outperforming unsupervised baselines such as BERTopic in coherence and diversity metrics. We further validate the selected topics through classification experiments and theoretical generalization analysis using VC-dimension bounds, confirming their practical and statistical utility.

Together, these contributions offer a robust framework for interpretable and predictive topic modeling in applied scientific domains. Our work advances both the methodological landscape of topic selection and the empirical understanding of research trajectories in forensic science and CAA detection.

Completion Date

2025

Semester

Summer

Committee Chair

Larry Tang

Degree

Doctor of Philosophy (Ph.D.)

College

College of Sciences

Department

Statistics and Data Science

Format

PDF

Identifier

DP0029510

Language

English

Document Type

Thesis

Campus Location

Orlando (Main) Campus

Share

COinS