ORCID

0009-0004-8842-5990

Keywords

Multi-modal learning, Multimodal Large Language Models, Video Understanding, Computer Vision, Machine Learning

Abstract

Multimodal learning aims to develop systems that understand and reason across multiple sources of information such as images, videos, audio, and text. These capabilities are central to applications including visual search, digital assistants, robotics, and visual question answering. Despite recent progress in multimodal large language models, current approaches struggle to capture the structured complexity of real-world visual data. Fine-grained visual details are often lost during multimodal alignment, models frequently fail to capture nuanced human intent, and existing methods remain limited in understanding long-form videos with complex temporal dynamics. Furthermore, many evaluation benchmarks rely heavily on explicit visual cues, which can overestimate reasoning ability and fail to assess deeper relational understanding.

This dissertation addresses these limitations through complementary advances that strengthen multimodal learning across both images and videos. First, we introduce X-Former, a multimodal fusion framework that unifies contrastive and reconstruction learning to preserve fine-grained spatial representations while maintaining strong cross-modal alignment. While improved perception strengthens the models, effective deployment also requires alignment with human intent. For this, we propose SMPRO, the first self-supervised visual preference alignment framework that models multiple preferences through differentiable multi-preference ranking without requiring costly human annotations.

Extending multimodal learning to video understanding, we introduce UDL, an unsupervised discriminative embedding framework that discovers sub-actions directly from long videos, enabling temporal structure learning without dense supervision. However, beyond learning temporal representations, effective multimodal systems must also preserve the internal structure of each modality during cross-modal alignment. We therefore propose Multi-SK, a structure-preserving multimodal learning framework that maintains intra-modal relationships while learning shared representations. Finally, we introduce VRR-QA, a benchmark for visual relational reasoning in videos that emphasizes implicit relationships and contextual dependencies between events.

Together, these contributions advance multimodal learning by improving perception, preference alignment, temporal structure discovery, and structure-preserving representation learning, while introducing new evaluation benchmark for deeper reasoning in video understanding.

Completion Date

2026

Semester

Spring

Committee Chair

Shah, Mubarak

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Format

PDF

Document Type

Dissertation

Identifier

DP0053117

Share

COinS
 

Accessibility Statement

This item was created or digitized prior to April 24, 2027, or is a reproduction of legacy media created before that date. It is preserved in its original, unmodified state specifically for research, reference, or historical recordkeeping. In accordance with the ADA Title II Final Rule, the University Libraries provides accessible versions of archival materials upon request. To request an accommodation for this item, please submit an accessibility request form.