Videos capture the inherently sequential nature of the real world, making automatic video understanding an essential need for automatic understanding of the real world. Due to major advancements in camera, communication, and storage hardware, videos have become a widely used data format for crucial applications such as home automation, security, analysis, robotics, and autonomous driving. Existing methods for video understanding require heavy computation and large training data for good performance, this limits how quick the videos can be processed and how much data can be labeled for training. Real-world video understanding requires analyzing dense scenes and sequential information, which increases the processing time and labeling cost as the video increases in scene density and video length. Therefore, it is crucial to develop video understanding methods that reduces the processing time and labeling cost. In this dissertation, we first propose a method to improve network efficiency for video understanding task and then provide methods to improve annotation efficiency for video understanding task. Through these works, we aim to improve the network efficiency as well as data annotation efficiency, as an effort to encourage wider development and adaptation of large scale video understanding methods. First, we propose an end-to-end neural network which performs faster video actor-action detection. Our proposed network reduces the need for extra region proposal computation and post-process filter, making the network training easy as well as increasing the inference speed. Next, we propose an active learning based sparse labeling method that makes large video dataset annotation efficient. It selects a few useful frames for annotation from videos, reducing annotation cost while maintaining the dataset usefulness for video understanding task. We also provide a method to train existing video understanding models using such sparse annotations. Then, we propose a clustering-based hybrid active learning method that also selects useful videos along with useful frames for annotation, reducing annotation cost even further. Finally, we study the relation between different types of annotations and how they impact video understanding task. We extensively evaluate and analyze our methods on various dataset and downstream tasks to show that they can do efficient video understanding with faster network and limited sparse annotations.


If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date





Rawat, Yogesh Singh


Doctor of Philosophy (Ph.D.)


College of Engineering and Computer Science


Computer Science

Degree Program

Computer Science


CFE0009773; DP0027881





Release Date

August 2023

Length of Campus-only Access


Access Status

Doctoral Dissertation (Open Access)