Keywords
weakly-supervised, semi-supervised, spatio-temporal video understanding, multimodal foundation models, video action detection, spatio-temporal video grounding
Abstract
Although deep learning has advanced video understanding, its deployment is limited by two challenges: reliance on manual annotations and restricted ability to operate beyond closed-world settings. Spatio-temporal labeling is costly, while real-world applications demand models that interpret novel, open-ended queries. This dissertation develops methods to improve label efficiency and enable flexible, open-world video analysis.
To alleviate the annotation bottleneck, we establish a semi-supervised framework for video action detection using consistency regularization, where a model learns stable predictions across augmented views of the same video. This is challenging, generic augmentations often affect only static regions, providing limited information about dynamic actions. We introduce spatio-temporal consistency constraints modeling motion dynamics. Two regularizer: temporal coherency and gradient smoothness, leverage the continuous nature of human actions to ensure predicted localizations remain coherent.
Building on this, we refine our approach through an improved student–teacher framework. Traditional models can accumulate errors when the teacher produces noisy pseudo-labels. We mitigate this with an error recovery module that learns from the student’s mistakes on labeled data and transfers corrective knowledge to improve teacher predictions on unlabeled videos.
While effective within known categories, these methods are insufficient for real-world scenarios requiring understanding beyond predefined labels. We address the open-world challenge of grounding free-form textual queries in video without bounding box supervision. Direct adaptation of vision–language models lacks spatio-temporal grounding and fine-grained alignment. We introduce a method linking textual queries to spatio-temporal predictions and a progressive learning framework that builds understanding gradually first recognizing simple sub-actions, then adapting to dynamic contexts.
Finally, we propose a context-aware progressive learning approach utilizing surrounding context to refine object identification and action interpretation. A self-paced spatio-temporal curriculum guides the model from coarse cues to finer distinctions. Together, these advances move the field toward scalable, practical systems interpreting complex visual narratives with minimal supervision.
Completion Date
2025
Semester
Fall
Committee Chair
Yogesh Singh Rawat
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Document Type
Dissertation/Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Kumar, Akash, "Towards Label-Efficient Approaches For Dense Video Tasks" (2025). Graduate Thesis and Dissertation post-2024. 468.
https://stars.library.ucf.edu/etd2024/468