ORCID
0000-0002-9068-7429
Keywords
Video LLMs, Video Understanding, Action Recognition, Computer Vision, Machine Learning
Abstract
This dissertation investigates the problem of general video understanding, aiming to develop models capable of robust, adaptable, and semantically rich interpretation of video data across diverse domains and tasks. Despite significant progress in video analysis, existing methods remain limited by restricted vocabularies, inadequate fine-grained discrimination, and vulnerability to noise and adversarial perturbations. To address these challenges, the research advances four complementary directions encompassing multimodal fusion, open-vocabulary recognition, robustness enhancement, and unified video–language modeling. First, a multi-modal prototype contrastive framework is proposed for fine-grained video classification, integrating visual and auditory cues through a transformer-based architecture and supervised contrastive learning with class prototypes. Evaluations on the newly introduced SRI-APPROVE dataset demonstrate substantial gains in distinguishing visually similar but semantically distinct categories. Second, an open-vocabulary multi-label classification framework is developed by combining contrastive vision–language encoders with large language model–guided prompt generation, enabling zero-shot recognition of novel actions and entities beyond predefined label sets. Third, the study identifies a robustness deficiency in contrastive self-supervised learning (CSL), showing that false-negative pairs during training reduce adversarial resilience. A corrective strategy that adaptively removes such pairs improves robustness by up to two-thirds relative to supervised baselines without requiring adversarial training. Finally, a unified Video Large Language Model (ViLL-E) is presented, combining generative and embedding-based learning to jointly address video captioning, retrieval, and question answering. ViLL-E achieves state-of-the-art performance across multiple benchmarks and introduces new zero-shot retrieval capabilities. Collectively, these contributions advance the conceptual and practical foundations for general-purpose video understanding, supporting the development of intelligent, adaptable, and dependable multimodal systems for real-world applications.
Completion Date
2025
Semester
Fall
Committee Chair
Shah, Mubarak
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Scieence
Format
Identifier
DP0029842
Document Type
Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Gupta, Rohit, "Towards Achieving General Video Understanding" (2025). Graduate Thesis and Dissertation post-2024. 453.
https://stars.library.ucf.edu/etd2024/453