Keywords

weakly-supervised, semi-supervised, spatio-temporal video understanding, multimodal foundation models, video action detection, spatio-temporal video grounding

Abstract

Although deep learning has advanced video understanding, its deployment is limited by two challenges: reliance on manual annotations and restricted ability to operate beyond closed-world settings. Spatio-temporal labeling is costly, while real-world applications demand models that interpret novel, open-ended queries. This dissertation develops methods to improve label efficiency and enable flexible, open-world video analysis.

To alleviate the annotation bottleneck, we establish a semi-supervised framework for video action detection using consistency regularization, where a model learns stable predictions across augmented views of the same video. This is challenging, generic augmentations often affect only static regions, providing limited information about dynamic actions. We introduce spatio-temporal consistency constraints modeling motion dynamics. Two regularizer: temporal coherency and gradient smoothness, leverage the continuous nature of human actions to ensure predicted localizations remain coherent.

Building on this, we refine our approach through an improved student–teacher framework. Traditional models can accumulate errors when the teacher produces noisy pseudo-labels. We mitigate this with an error recovery module that learns from the student’s mistakes on labeled data and transfers corrective knowledge to improve teacher predictions on unlabeled videos.

While effective within known categories, these methods are insufficient for real-world scenarios requiring understanding beyond predefined labels. We address the open-world challenge of grounding free-form textual queries in video without bounding box supervision. Direct adaptation of vision–language models lacks spatio-temporal grounding and fine-grained alignment. We introduce a method linking textual queries to spatio-temporal predictions and a progressive learning framework that builds understanding gradually first recognizing simple sub-actions, then adapting to dynamic contexts.

Finally, we propose a context-aware progressive learning approach utilizing surrounding context to refine object identification and action interpretation. A self-paced spatio-temporal curriculum guides the model from coarse cues to finer distinctions. Together, these advances move the field toward scalable, practical systems interpreting complex visual narratives with minimal supervision.

Completion Date

2025

Semester

Fall

Committee Chair

Yogesh Singh Rawat

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Document Type

Dissertation/Thesis

Campus Location

Orlando (Main) Campus

Share

COinS