ORCID
https://orcid.org/0009-0002-1424-5589
Keywords
Supervised Topic Modeling, Bayesian Shrinkage, Latent Dirichlet Allocation (LDA), Scientific Text Mining, Variable Selection
Abstract
Topic modeling has become a powerful tool for analyzing large-scale textual corpora across scientific domains. However, traditional unsupervised approaches like Latent Dirichlet Allocation (LDA) often produce redundant or weakly informative topics, limiting their interpretability and predictive utility. This dissertation presents a dual-track study that advances the field of interpretable topic modeling and domain-specific text analysis by (1) quantitatively analyzing research trends in Cyber Aggression and Abuse (CAA), and (2) developing a supervised topic selection framework for forensic science literature.
In the first part, we apply LDA to a corpus of 2,309 journal abstracts on CAA sourced from the Web of Science database. This study is among the first to systematically map latent themes in CAA literature, uncovering hot and cold topic clusters, regional publication trends, and temporal shifts in research focus. Key themes include psychological impacts, identity-based harms, detection techniques, and digital interventions. Our findings highlight an evolving interdisciplinary landscape, offering novel insights for scholars, practitioners, and policymakers working on online harm mitigation.
In the second part, we introduce a supervised topic refinement method that integrates LDA with the Mixed-type Multivariate Bayesian Shrinkage Prior (Mt-MBSP) model. We apply this framework to thousands of forensic science abstracts published in the Journal of Forensic Sciences (2009–2022), aiming to identify a sparse, semantically coherent, and domain-relevant set of topics. Mt-MBSP leverages document-category labels to prioritize predictive topics, outperforming unsupervised baselines such as BERTopic in coherence and diversity metrics. We further validate the selected topics through classification experiments and theoretical generalization analysis using VC-dimension bounds, confirming their practical and statistical utility.
Together, these contributions offer a robust framework for interpretable and predictive topic modeling in applied scientific domains. Our work advances both the methodological landscape of topic selection and the empirical understanding of research trajectories in forensic science and CAA detection.
Completion Date
2025
Semester
Summer
Committee Chair
Larry Tang
Degree
Doctor of Philosophy (Ph.D.)
College
College of Sciences
Department
Statistics and Data Science
Format
Identifier
DP0029510
Language
English
Document Type
Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Alipour Yengejeh, Amir, "Supervised Topic Modeling for Scientific Texts: Bayesian Shrinkage Methods and Domain-Specific Applications in Forensics and Online Aggression Sciences" (2025). Graduate Thesis and Dissertation post-2024. 266.
https://stars.library.ucf.edu/etd2024/266