Robust 3D Action Recognition Through Sampling Local Appearances And Global Distributions

Keywords

3-D action recognition; Depth data; human-computer interaction (HCI); spatial-temporal interest point (STIP)

Abstract

Three-dimensional (3-D) action recognition has broad applications in human-computer interaction and intelligent surveillance. However, recognizing similar actions remains challenging since previous literature fails to capture motion and shape cues effectively from noisy depth data. In this paper, we propose a novel two-layer Bag-of-Visual-Words (BoVW) model, which suppresses the noise disturbances and jointly encodes both motion and shape cues. First, background clutter is removed by a background modeling method that is designed for depth data. Then, motion and shape cues are jointly used to generate robust and distinctive spatial-temporal interest points (STIPs): motion-based STIPs and shape-based STIPs. In the first layer of our model, a multiscale 3-D local steering kernel descriptor is proposed to describe local appearances of cuboids around motion-based STIPs. In the second layer, a spatial-temporal vector descriptor is proposed to describe the spatial-temporal distributions of shape-based STIPs. Using the BoVW model, motion and shape cues are combined to form a fused action representation. Our model performs favorably compared with common STIP detection and description methods. Thorough experiments verify that our model is effective in distinguishing similar actions and robust to background clutter, partial occlusions and pepper noise.

Publication Date

8-1-2018

Publication Title

IEEE Transactions on Multimedia

Volume

20

Issue

8

Number of Pages

1932-1947

Document Type

Article

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/TMM.2017.2786868

Socpus ID

85039776367 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85039776367

This document is currently not available here.

Share

COinS