Keywords
Video Action Understanding, Action Classification, Temporal Action Localization, Action Detection, Computer Vision
Abstract
Video action understanding involves comprehending actions performed by humans, depicted in videos. Central to the task of video action understanding are four fundamental questions: What, When, Where, and Who. These questions encapsulate the essence of action classification, temporal action localization, action detection, and actor recognition. Despite notable progress in research related to these tasks, many challenges persist and in this dissertation, we propose innovative solutions to tackle these challenges head-on.
First, we address the challenges in action classification (``What?"), specifically related to multi-view action recognition. We propose a novel transformer decoder-based model, with learnable view and action queries, to enforce the learning of action features robust to shifts in viewpoints. Next, we focus on temporal action localization (``What?" and ``When?") and address challenges introduced in the multi-label setting. Our proposed solution involves leveraging the inherent relationships between complex actions in real-world videos. We introduce an attention-based architecture that models these relationships for the task of temporal action localization.
Next, we propose \textit{Gabriella}, a real-time online system for activity detection (``What?", ``When?", and ``Where?") in security videos. Our proposed solution has three stages: tubelet extraction, activity classification, and online tubelet merging. For tubelet extraction, we propose a localization network that detects potential foreground regions to generate action tubelets. The detected tubelets are assigned activity class scores by the classification network and merged using our proposed Tubelet-Merge Action-Split (TMAS) algorithm to form the final action detections. Finally, we introduce an approach to solve the novel task of joint action and actor recognition (``What?" and ``Who?") and solve it using disentangled representation learning. We introduce a novel method to simultaneously identify both subjects (actors) and their actions. Our transformer-based model learns to separate actor and action features effectively by employing supervised contrastive losses alongside standard cross-entropy loss to ensure proper feature separation.
Completion Date
2024
Semester
Summer
Committee Chair
Shah, Mubarak
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Format
application/pdf
Identifier
DP0028563
URL
https://purls.library.ucf.edu/go/DP0028563
Language
English
Release Date
8-15-2024
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
Campus Location
Orlando (Main) Campus
STARS Citation
Tirupattur, Praveen, "Video Action Understanding: Action Classification, Temporal Localization, And Detection" (2024). Graduate Thesis and Dissertation 2023-2024. 359.
https://stars.library.ucf.edu/etd2023/359
Accessibility Status
Meets minimum standards for ETDs/HUTs