ORCID

0000-0002-9068-7429

Keywords

Video LLMs, Video Understanding, Action Recognition, Computer Vision, Machine Learning

Abstract

This dissertation investigates the problem of general video understanding, aiming to develop models capable of robust, adaptable, and semantically rich interpretation of video data across diverse domains and tasks. Despite significant progress in video analysis, existing methods remain limited by restricted vocabularies, inadequate fine-grained discrimination, and vulnerability to noise and adversarial perturbations. To address these challenges, the research advances four complementary directions encompassing multimodal fusion, open-vocabulary recognition, robustness enhancement, and unified video–language modeling. First, a multi-modal prototype contrastive framework is proposed for fine-grained video classification, integrating visual and auditory cues through a transformer-based architecture and supervised contrastive learning with class prototypes. Evaluations on the newly introduced SRI-APPROVE dataset demonstrate substantial gains in distinguishing visually similar but semantically distinct categories. Second, an open-vocabulary multi-label classification framework is developed by combining contrastive vision–language encoders with large language model–guided prompt generation, enabling zero-shot recognition of novel actions and entities beyond predefined label sets. Third, the study identifies a robustness deficiency in contrastive self-supervised learning (CSL), showing that false-negative pairs during training reduce adversarial resilience. A corrective strategy that adaptively removes such pairs improves robustness by up to two-thirds relative to supervised baselines without requiring adversarial training. Finally, a unified Video Large Language Model (ViLL-E) is presented, combining generative and embedding-based learning to jointly address video captioning, retrieval, and question answering. ViLL-E achieves state-of-the-art performance across multiple benchmarks and introduces new zero-shot retrieval capabilities. Collectively, these contributions advance the conceptual and practical foundations for general-purpose video understanding, supporting the development of intelligent, adaptable, and dependable multimodal systems for real-world applications.

Completion Date

2025

Semester

Fall

Committee Chair

Shah, Mubarak

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Scieence

Format

PDF

Identifier

DP0029842

Document Type

Thesis

Campus Location

Orlando (Main) Campus

Share

COinS