Scopus Export 2015-2019

Video Fill In The Blank Using Lr/Rl Lstms With Spatial-Temporal Attentions

Abstract

Given a video and a description sentence with one missing word, 'source sentence', Video-Fill-In-the-Blank (VFIB) problem is to find the missing word automatically. The contextual information of the sentence, as well as visual cues from the video, are important to infer the missing word accurately. Since the source sentence is broken into two fragments: the sentence's left fragment (before the blank) and the sentence's right fragment (after the blank), traditional Recurrent Neural Networks cannot encode this structure accurately because of many possible variations of the missing word in terms of the location and type of the word in the source sentence. For example, a missing word can be the first word or be in the middle of the sentence and it can be a verb or an adjective. In this paper, we propose a framework to tackle the textual encoding: Two separate LSTMs (the LR and RL LSTMs) are employed to encode the left and right sentence fragments and a novel structure is introduced to combine each fragment with an external memory corresponding to the opposite fragments. For the visual encoding, end-to-end spatial and temporal attention models are employed to select discriminative visual representations to find the missing word. In the experiments, we demonstrate the superior performance of the proposed method on challenging VFIB problem. Furthermore, we introduce an extended and more generalized version of VFIB, which is not limited to a single blank. Our experiments indicate the generalization capability of our method in dealing with such more realistic scenarios.

Publication Date

12-22-2017

Publication Title

Proceedings of the IEEE International Conference on Computer Vision

Volume

2017-October

Number of Pages

1416-1425

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/ICCV.2017.157

Copyright Status

Unknown

Socpus ID

85041905925 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85041905925

STARS Citation

Mazaheri, Amir; Zhang, Dong; and Shah, Mubarak, "Video Fill In The Blank Using Lr/Rl Lstms With Spatial-Temporal Attentions" (2017). Scopus Export 2015-2019. 6981.
https://stars.library.ucf.edu/scopus2015/6981

This document is currently not available here.

COinS

Scopus Export 2015-2019

Video Fill In The Blank Using Lr/Rl Lstms With Spatial-Temporal Attentions

Abstract

Publication Date

Publication Title

Volume

Number of Pages

Document Type

Personal Identifier

DOI Link

Copyright Status

Socpus ID

Source API URL

STARS Citation

Explore

Connect

Scopus Export 2015-2019

Video Fill In The Blank Using Lr/Rl Lstms With Spatial-Temporal Attentions

Creator

Abstract

Publication Date

Publication Title

Volume

Number of Pages

Document Type

Personal Identifier

DOI Link

Copyright Status

Socpus ID

Source API URL

STARS Citation

Share

Explore

Connect