Learning semantic features for action recognition via diffusion maps

Authors

    Authors

    J. E. Liu; Y. Yang; I. Saleemi;M. Shah

    Comments

    Authors: contact us about adding a copy of your work at STARS@ucf.edu

    Abbreviated Journal Title

    Comput. Vis. Image Underst.

    Keywords

    Action recognition; Bag of video words; Semantic visual vocabulary; Diffusion Maps; Pointwise Mutual Information; DIMENSIONALITY REDUCTION; MOTION; Computer Science, Artificial Intelligence; Engineering, Electrical &; Electronic

    Abstract

    Efficient modeling of actions is critical for recognizing human actions. Recently, bag of video words (BoVW) representation, in which features computed around spatiotemporal interest points are quantized into video words based on their appearance similarity, has been widely and successfully explored. The performance of this representation however, is highly sensitive to two main factors: the granularity, and therefore, the size of vocabulary, and the space in which features and words are clustered, i.e., the distance measure between data points at different levels of the hierarchy. The goal of this paper is to propose a representation and learning framework that addresses both these limitations. We present a principled approach to learning a semantic vocabulary from a large amount of video words using Diffusion Maps embedding. As opposed to flat vocabularies used in traditional methods, we propose to exploit the hierarchical nature of feature vocabularies representative of human actions. Spatiotemporal features computed around interest points in videos form the lowest level of representation. Video words are then obtained by clustering those spatiotemporal features. Each video word is then represented by a vector of Pointwise Mutual Information (PMI) between that video word and training video clips, and is treated as a mid-level feature. At the highest level of the hierarchy, our goal is to further cluster the mid-level features, while exploiting semantically meaningful distance measures between them. We conjecture that the mid-level features produced by similar video sources (action classes) must lie on a certain manifold. To capture the relationship between these features, and retain it during clustering, we propose to use diffusion distance as a measure of similarity between them. The underlying idea is to embed the mid-level features into a lower-dimensional space, so as to construct a compact yet discriminative, high level vocabulary. Unlike some of the supervised vocabulary construction approaches and the unsupervised methods such as pLSA and LDA, Diffusion Maps can capture local relationship between the mid-level features on the manifold. We have tested our approach on diverse datasets and have obtained very promising results. (C) 2011 Elsevier Inc. All rights reserved.

    Journal Title

    Computer Vision and Image Understanding

    Volume

    116

    Issue/Number

    3

    Publication Date

    1-1-2012

    Document Type

    Article

    Language

    English

    First Page

    361

    Last Page

    377

    WOS Identifier

    WOS:000299800000006

    ISSN

    1077-3142

    Share

    COinS