Research in human action recognition strives to develop increasingly generalized methods that are robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendous strides in improving recognition accuracy on ever larger and complex benchmark datasets, comprising realistic actions "in the wild" videos. Unfortunately, the all-encompassing, dense, global representations that bring about such improvements often benefit from the inherent characteristics, specific to datasets and classes, that do not necessarily reflect knowledge about the entity to be recognized. This results in specific models that perform well within datasets but generalize poorly. Furthermore, training of supervised action recognition and detection methods need several precise spatio-temporal manual annotations to achieve good recognition and detection accuracy. For instance, current deep learning architectures require millions of accurately annotated videos to learn robust action classifiers. However, these annotations are quite difficult to achieve. In the first part of this dissertation, we explore the reasons for poor classifier performance when tested on novel datasets, and quantify the effect of scene backgrounds on action representations and recognition. We attempt to address the problem of recognizing human actions while training and testing on distinct datasets when test videos are neither labeled nor available during training. In this scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. We perform different types of partitioning of the GIST feature space for several datasets and compute measures of background scene complexity, as well as, for the extent to which scenes are helpful in action classification. We then propose a new process to obtain a measure of confidence in each pixel of the video being a foreground region using motion, appearance, and saliency together in a 3D-Markov Random Field (MRF) based framework. We also propose multiple ways to exploit the foreground confidence: to improve bag-of-words vocabulary, histogram representation of a video, and a novel histogram decomposition based representation and kernel. The above-mentioned work provides probability of each pixel being belonging to the actor, however, it does not give the precise spatio-temporal location of the actor. Furthermore, above framework would require precise spatio-temporal manual annotations to train an action detector. However, manual annotations in videos are laborious, require several annotators and contain human biases. Therefore, in the second part of this dissertation, we propose a weakly labeled approach to automatically obtain spatio-temporal annotations of actors in action videos. We first obtain a large number of action proposals in each video. To capture a few most representative action proposals in each video and evade processing thousands of them, we rank them using optical flow and saliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subset selection method. We demonstrate that this ranking preserves the high-quality action proposals. Several such proposals are generated for each video of the same action. Our next challenge is to iteratively select one proposal from each video so that all proposals are globally consistent. We formulate this as Generalized Maximum Clique Graph problem (GMCP) using shape, global and fine-grained similarity of proposals across the videos. The output of our method is the most action representative proposals from each video. Using our method can also annotate multiple instances of the same action in a video can also be annotated. Moreover, action detection experiments using annotations obtained by our method and several baselines demonstrate the superiority of our approach. The above-mentioned annotation method uses multiple videos of the same action. Therefore, in the third part of this dissertation, we tackle the problem of spatio-temporal action localization in a video, without assuming the availability of multiple videos or any prior annotations. The action is localized by employing images downloaded from the Internet using action label. Given web images, we first dampen image noise using random walk and evade distracting backgrounds within images using image action proposals. Then, given a video, we generate multiple spatio-temporal action proposals. We suppress camera and background generated proposals by exploiting optical flow gradients within proposals. To obtain the most action representative proposals, we propose to reconstruct action proposals in the video by leveraging the action proposals in images. Moreover, we preserve the temporal smoothness of the video and reconstruct all proposal bounding boxes jointly using the constraints that push the coefficients for each bounding box toward a common consensus, thus enforcing the coefficient similarity across multiple frames. We solve this optimization problem using the variant of two-metric projection algorithm. Finally, the video proposal that has the lowest reconstruction cost and is motion salient is used to localize the action. Our method is not only applicable to the trimmed videos, but it can also be used for action localization in untrimmed videos, which is a very challenging problem. Finally, in the third part of this dissertation, we propose a novel approach to generate a few properly ranked action proposals from a large number of noisy proposals. The proposed approach begins with dividing each proposal into sub-proposals. We assume that the quality of proposal remains the same within each sub-proposal. We, then employ a graph optimization method to recombine the sub-proposals in all action proposals in a single video in order to optimally build new action proposals and rank them by the combined node and edge scores. For an untrimmed video, we first divide the video into shots and then make the above-mentioned graph within each shot. Our method generates a few ranked proposals that can be better than all the existing underlying proposals. Our experimental results validated that the properly ranked action proposals can significantly boost action detection results. Our extensive experimental results on different challenging and realistic action datasets, comparisons with several competitive baselines and detailed analysis of each step of proposed methods validate the proposed ideas and frameworks.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Sultani, Waqas, "Weakly Labeled Action Recognition and Detection" (2017). Electronic Theses and Dissertations. 5513.