Title

What'S Making That Sound?

Keywords

Audiovisual processing; Comparative reasoning; Multimodal analysis; Winner-take-all hash

Abstract

In this paper, we investigate techniques to localize the sound source in video made using one microphone. The visual object whose motion generates the sound is located and segmented based on the synchronization analysis of object motion and audio energy. We first apply an effective region tracking algorithm to segment the video into a number of spatial-temporal region tracks, each representing the temporal evolution of an appearance-coherent image structure (i.e., object). We then extract the motion features of each object as its average acceleration in each frame. Meanwhile, Short-term Fourier Transform is applied to the audio signal to extract audio energy feature as the audio descriptor. We further impose a nonlinear transformation on both audio and visual descriptors to obtain the audio and visual codes in a common rank correlation space. Finally, the correlation between an object and the audio signal is simply evaluated by computing the Hamming distance between the audio and visual codes generated in previous steps. We evaluate the proposed method both qualitatively and quantitatively using a number of challenging test videos. In particular, the proposed method is compared with a state-of-the-art audiovisual source localization algorithm. The results demonstrate the superior performance of the proposed algorithm in spatial-temporal localization and segmentation of audio sources in the visual domain.

Publication Date

11-3-2014

Publication Title

MM 2014 - Proceedings of the 2014 ACM Conference on Multimedia

Number of Pages

147-156

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1145/2647868.2654936

Socpus ID

84913590863 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/84913590863

This document is currently not available here.

Share

COinS