A Temporal Sequence Learning For Action Recognition And Prediction
Abstract
In this work1, we present a method to represent a video with a sequence of words, and learn the temporal sequencing of such words as the key information for predicting and recognizing human actions. We leverage core concepts from the Natural Language Processing (NLP) literature used in sentence classification to solve the problems of action prediction and action recognition. Each frame is converted into a word that is represented as a vector using the Bag of Visual Words (BoW) encoding method. The words are then combined into a sentence to represent the video, as a sentence. The sequence of words in different actions are learned with a simple but effective Temporal Convolutional Neural Network (T-CNN) that captures the temporal sequencing of information in a video sentence. We demonstrate that a key characteristic of the proposed method is its low-latency, i.e. its ability to predict an action accurately with a partial sequence (sentence). Experiments on two datasets, UCF101 and HMDB51 show that the method on average reaches 95% of its accuracy within half the video frames. Results, also demonstrate that our method achieves compatible state-of-the-art performance in action recognition (i.e. at the completion of the sentence) in addition to action prediction.
Publication Date
5-3-2018
Publication Title
Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018
Volume
2018-January
Number of Pages
352-361
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1109/WACV.2018.00045
Copyright Status
Unknown
Socpus ID
85050910347 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/85050910347
STARS Citation
Cho, Sangwoo and Foroosh, Hassan, "A Temporal Sequence Learning For Action Recognition And Prediction" (2018). Scopus Export 2015-2019. 10047.
https://stars.library.ucf.edu/scopus2015/10047