ORCID
0009-0004-8842-5990
Keywords
Multi-modal learning, Multimodal Large Language Models, Video Understanding, Computer Vision, Machine Learning
Abstract
Multimodal learning aims to develop systems that understand and reason across multiple sources of information such as images, videos, audio, and text. These capabilities are central to applications including visual search, digital assistants, robotics, and visual question answering. Despite recent progress in multimodal large language models, current approaches struggle to capture the structured complexity of real-world visual data. Fine-grained visual details are often lost during multimodal alignment, models frequently fail to capture nuanced human intent, and existing methods remain limited in understanding long-form videos with complex temporal dynamics. Furthermore, many evaluation benchmarks rely heavily on explicit visual cues, which can overestimate reasoning ability and fail to assess deeper relational understanding.
This dissertation addresses these limitations through complementary advances that strengthen multimodal learning across both images and videos. First, we introduce X-Former, a multimodal fusion framework that unifies contrastive and reconstruction learning to preserve fine-grained spatial representations while maintaining strong cross-modal alignment. While improved perception strengthens the models, effective deployment also requires alignment with human intent. For this, we propose SMPRO, the first self-supervised visual preference alignment framework that models multiple preferences through differentiable multi-preference ranking without requiring costly human annotations.
Extending multimodal learning to video understanding, we introduce UDL, an unsupervised discriminative embedding framework that discovers sub-actions directly from long videos, enabling temporal structure learning without dense supervision. However, beyond learning temporal representations, effective multimodal systems must also preserve the internal structure of each modality during cross-modal alignment. We therefore propose Multi-SK, a structure-preserving multimodal learning framework that maintains intra-modal relationships while learning shared representations. Finally, we introduce VRR-QA, a benchmark for visual relational reasoning in videos that emphasizes implicit relationships and contextual dependencies between events.
Together, these contributions advance multimodal learning by improving perception, preference alignment, temporal structure discovery, and structure-preserving representation learning, while introducing new evaluation benchmark for deeper reasoning in video understanding.
Completion Date
2026
Semester
Spring
Committee Chair
Shah, Mubarak
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Format
Document Type
Dissertation
Identifier
DP0053117
STARS Citation
Sirnam, Swetha, "Advancing Multi-Modal Learning" (2026). Graduate Studies Theses and Dissertations 2026. 180.
https://stars.library.ucf.edu/gradstudies_etd_2026/180
Accessibility Statement
This item was created or digitized prior to April 24, 2027, or is a reproduction of legacy media created before that date. It is preserved in its original, unmodified state specifically for research, reference, or historical recordkeeping. In accordance with the ADA Title II Final Rule, the University Libraries provides accessible versions of archival materials upon request. To request an accommodation for this item, please submit an accessibility request form.