Action Recognition With Temporal Scale-Invariant Deep Learning Framework

Keywords

action recognition; CNN; LSTM

Abstract

Recognizing actions according to video features is an important problem in a wide scope of applications. In this paper, we propose a temporal scale-invariant deep learning framework for action recognition, which is robust to the change of action speed. Speciï¬cally, a video is firstly split into several sub-action clips and a keyframe is selected from each sub-action clip. The spatial and motion features of the keyframe are extracted separately by two Convolutional Neural Networks (CNN) and combined in the convolutional fusion layer for learning the relationship between the features. Then, Long Short Term Memory (LSTM) networks are applied to the fused features to formulate long-term temporal clues. Finally, the action prediction scores of the LSTM network are combined by linear weighted summation. Extensive experiments are conducted on two popular and challenging benchmarks, namely, the UCF-101 and the HMDB51 Human Actions. On both benchmarks, our framework achieves superior results over the state-of-the-art methods by 93.7% on UCF-101 and 69.5% on HMDB51, respectively.

Publication Date

2-1-2017

Publication Title

China Communications

Volume

14

Issue

2

Number of Pages

163-172

Document Type

Article

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/CC.2017.7868164

Socpus ID

85015159779 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85015159779

This document is currently not available here.

Share

COinS