Keywords

Video Action Understanding, Action Classification, Temporal Action Localization, Action Detection, Computer Vision

Abstract

Video action understanding involves comprehending actions performed by humans, depicted in videos. Central to the task of video action understanding are four fundamental questions: What, When, Where, and Who. These questions encapsulate the essence of action classification, temporal action localization, action detection, and actor recognition. Despite notable progress in research related to these tasks, many challenges persist and in this dissertation, we propose innovative solutions to tackle these challenges head-on.

First, we address the challenges in action classification (``What?"), specifically related to multi-view action recognition. We propose a novel transformer decoder-based model, with learnable view and action queries, to enforce the learning of action features robust to shifts in viewpoints. Next, we focus on temporal action localization (``What?" and ``When?") and address challenges introduced in the multi-label setting. Our proposed solution involves leveraging the inherent relationships between complex actions in real-world videos. We introduce an attention-based architecture that models these relationships for the task of temporal action localization.

Next, we propose \textit{Gabriella}, a real-time online system for activity detection (``What?", ``When?", and ``Where?") in security videos. Our proposed solution has three stages: tubelet extraction, activity classification, and online tubelet merging. For tubelet extraction, we propose a localization network that detects potential foreground regions to generate action tubelets. The detected tubelets are assigned activity class scores by the classification network and merged using our proposed Tubelet-Merge Action-Split (TMAS) algorithm to form the final action detections. Finally, we introduce an approach to solve the novel task of joint action and actor recognition (``What?" and ``Who?") and solve it using disentangled representation learning. We introduce a novel method to simultaneously identify both subjects (actors) and their actions. Our transformer-based model learns to separate actor and action features effectively by employing supervised contrastive losses alongside standard cross-entropy loss to ensure proper feature separation.

Completion Date

2024

Semester

Summer

Committee Chair

Shah, Mubarak

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Format

application/pdf

Identifier

DP0028563

URL

https://purls.library.ucf.edu/go/DP0028563

Language

English

Release Date

8-15-2024

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

Campus Location

Orlando (Main) Campus

Accessibility Status

Meets minimum standards for ETDs/HUTs

Share

COinS