Motion is a missing information in an image, however, it is a valuable cue for action recognition. Thus, lack of motion information in a single image makes action recognition for still images inherently a very challenging problem in computer vision. In this dissertation, we show that both spatial and temporal patterns provide crucial information for recognizing human actions. Therefore, action recognition depends not only on the spatially-salient pixels, but also on the temporal patterns of those pixels. To address the challenge caused by the absence of temporal information in a single image, we introduce five effective action classification methodologies along with a new still image action recognition dataset. These include (1) proposing a new Spatial-Temporal Convolutional Neural Network, STCNN, trained by fine-tuning a CNN model, pre-trained on appearance-based classification only, over a novel latent space-time domain, named Ranked Saliency Map and Predicted Optical Flow, or RankSM-POF for short, (2) introducing a novel unsupervised Zero-shot approach based on low-rank Tensor Decomposition, named ZTD, (3) proposing the concept of temporal image, a compact representation of hypothetical sequence of images and then using it to design a new hierarchical deep learning network, TICNN, for still image action recognition, (4) introducing a dataset for STill image Action Recognition (STAR), containing over 1M images across 50 different human body-motion action categories. UCF-STAR is the largest dataset in the literature for action recognition in still images, exposing the intrinsic difficulty of action recognition through its realistic scene and action complexity. Moreover, TSSTN, a two-stream spatiotemporal network, is introduced to model the latent temporal information in a single image, and using it as prior knowledge in a two-stream deep network, (5) proposing a parallel heterogeneous meta- learning method to combine STCNN and ZTD through a stacking approach into an ensemble classifier of the proposed heterogeneous base classifiers. Altogether, this work demonstrates benefits of UCF-STAR as a large-scale still images dataset, and show the role of latent motion information in recognizing human actions in still images by presenting approaches relying on predicting temporal information, yielding higher accuracy on widely-used datasets.

Graduation Date





Foroosh, Hassan


Doctor of Philosophy (Ph.D.)


College of Engineering and Computer Science


Computer Science

Degree Program

Computer Science







Release Date

December 2020

Length of Campus-only Access


Access Status

Doctoral Dissertation (Open Access)