Keywords

Computer Vision, Video Understanding, Action Detection, Vision-Language, Weakly-Supervised

Abstract

In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes.

Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.

Completion Date

2025

Semester

Spring

Committee Chair

Rawat, Yogesh Singh

Degree

Master of Science (M.S.)

College

College of Engineering and Computer Science

Department

Computer Science

Identifier

DP0029395

Document Type

Dissertation/Thesis

Campus Location

Orlando (Main) Campus

Share

COinS