With the increase of videos available online, it is more important than ever to learn how to process and understand video data. Although convolutional neural networks have revolutionized the representation learning from images and videos, they do not explicitly model entities within the given input. It would be useful for learned models to be able to represent part-to-whole relationships within a given image or video. To this end, a novel neural network architecture - capsule networks - has been proposed. Capsule networks add extra structure to allow for the modeling of entities and has shown great promise when applied to image data. By grouping neural activations and propagating information from one layer to the next through a routing-by-agreement procedure, capsule networks are able to learn part-to-whole relationships as well as robust object representations. In this dissertation, we explore how capsule networks can be generalized to video and be used to effectively solve several video understanding problems. First, we generalize capsule networks from the image domain so that it can process 3-dimensional video data. Our proposed video capsule network (VideoCapsuleNet) tackles the problem of video action detection. We introduce capsule-pooling in the convolutional capsule layer to make the voting algorithm tractable in the 3-dimensional video domain. The network's routing-by-agreement inherently models the action representations and various action characteristics are captured by the predicted capsules. We show that VideoCapsuleNet is able to successfully produce pixel-wise localizations of actions present in videos. While action detection only requires a coarse localization, we show that video capsule networks can generate fine-grained segmentations. To that end, we propose a capsule-based approach for video object segmentation, CapsuleVOS, which can segment several frames at once conditioned on a reference frame and segmentation mask. This conditioning is performed through a novel routing algorithm for attention-based efficient capsule selection. We address two challenging issues in video object segmentation: segmentation of small objects and occlusion of objects across time. The first issue is addressed with a zooming module; the second, is dealt with by a novel memory module based on recurrent neural networks. Above we show that capsule networks can effectively localize actors and objects within videos. Next, we address the problem of integration of video and text for the task of actor and action video segmentation from a sentence. We propose a novel capsule-based approach to perform pixel-level localization based on a natural language query describing the actor of interest. We encode both the video and textual input in the form of capsules, and propose a visual-textual routing mechanism for the fusion of these capsules to successfully localize the actor and action within all frames of a video. The previous works are all fully supervised: they are all trained on manually annotated data, which is often time-consuming and costly to acquire. Finally, we propose a novel method for self-supervised learning which does not rely on manually annotated data. We present a capsule network that jointly learns high-level concepts and their relationships across different low-level multimodal (video, audio, and text) input representations. To adapt the capsules to large-scale input data, we propose a routing by self-attention mechanism that selects relevant capsules which are then used to generate a final joint multimodal feature representation. This allows us to learn robust representations from noisy video data and to scale up the size of the capsule network compared to traditional routing methods while still being computationally efficient.
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Duarte, Kevin, "Capsule Networks for Video Understanding" (2021). Electronic Theses and Dissertations, 2020-. 856.
Restricted to the UCF community until December 2021; it will then be open access.