Keywords
Computer Vision, Video Understanding, Action Detection, Vision-Language, Weakly-Supervised
Abstract
In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes.
Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.
Completion Date
2025
Semester
Spring
Committee Chair
Rawat, Yogesh Singh
Degree
Master of Science (M.S.)
College
College of Engineering and Computer Science
Department
Computer Science
Identifier
DP0029395
Document Type
Dissertation/Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Sia, Zhen Hao, "Weakly-Supervised Scaling for Open-Vocabulary Action Detection" (2025). Graduate Thesis and Dissertation post-2024. 226.
https://stars.library.ucf.edu/etd2024/226