Abstract

With the increase of videos available online, it is more important than ever to learn how to process and understand video data. Although convolutional neural networks have revolutionized the representation learning from images and videos, they do not explicitly model entities within the given input. It would be useful for learned models to be able to represent part-to-whole relationships within a given image or video. To this end, a novel neural network architecture - capsule networks - has been proposed. Capsule networks add extra structure to allow for the modeling of entities and has shown great promise when applied to image data. By grouping neural activations and propagating information from one layer to the next through a routing-by-agreement procedure, capsule networks are able to learn part-to-whole relationships as well as robust object representations. In this dissertation, we explore how capsule networks can be generalized to video and be used to effectively solve several video understanding problems. First, we generalize capsule networks from the image domain so that it can process 3-dimensional video data. Our proposed video capsule network (VideoCapsuleNet) tackles the problem of video action detection. We introduce capsule-pooling in the convolutional capsule layer to make the voting algorithm tractable in the 3-dimensional video domain. The network's routing-by-agreement inherently models the action representations and various action characteristics are captured by the predicted capsules. We show that VideoCapsuleNet is able to successfully produce pixel-wise localizations of actions present in videos. While action detection only requires a coarse localization, we show that video capsule networks can generate fine-grained segmentations. To that end, we propose a capsule-based approach for video object segmentation, CapsuleVOS, which can segment several frames at once conditioned on a reference frame and segmentation mask. This conditioning is performed through a novel routing algorithm for attention-based efficient capsule selection. We address two challenging issues in video object segmentation: segmentation of small objects and occlusion of objects across time. The first issue is addressed with a zooming module; the second, is dealt with by a novel memory module based on recurrent neural networks. Above we show that capsule networks can effectively localize actors and objects within videos. Next, we address the problem of integration of video and text for the task of actor and action video segmentation from a sentence. We propose a novel capsule-based approach to perform pixel-level localization based on a natural language query describing the actor of interest. We encode both the video and textual input in the form of capsules, and propose a visual-textual routing mechanism for the fusion of these capsules to successfully localize the actor and action within all frames of a video. The previous works are all fully supervised: they are all trained on manually annotated data, which is often time-consuming and costly to acquire. Finally, we propose a novel method for self-supervised learning which does not rely on manually annotated data. We present a capsule network that jointly learns high-level concepts and their relationships across different low-level multimodal (video, audio, and text) input representations. To adapt the capsules to large-scale input data, we propose a routing by self-attention mechanism that selects relevant capsules which are then used to generate a final joint multimodal feature representation. This allows us to learn robust representations from noisy video data and to scale up the size of the capsule network compared to traditional routing methods while still being computationally efficient.

Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date

2021

Semester

Fall

Advisor

Shah, Mubarak

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Degree Program

Computer Science

Format

application/pdf

Identifier

CFE0008827; DP0026106

URL

https://purls.library.ucf.edu/go/DP0026106

Language

English

Release Date

December 2021

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

STARS Citation

Duarte, Kevin, "Capsule Networks for Video Understanding" (2021). Electronic Theses and Dissertations, 2020-2023. 856.
https://stars.library.ucf.edu/etd2020/856

Download

Included in

Computer Sciences Commons

COinS

Electronic Theses and Dissertations, 2020-2023

Capsule Networks for Video Understanding

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Browse Advisors

Explore

Connect

Electronic Theses and Dissertations, 2020-2023

Capsule Networks for Video Understanding

Author

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Share

Browse Advisors

Explore

Connect