Representing Spatio-temporal information in videos has proven to be a difficult task compared to action recognition in videos involving multiple actions. A single activity consists many smaller actions that can provide a better understanding of the activity. This paper tries to represent the varying information in a scene-graph format in order to answer temporal questions to obtain improved insights for the video, resulting in a directed temporal information graph. This project will use the Action Genome dataset, which is a variation of the charades dataset, to capture pairwise relationships in a graph. The model performs significantly better than the benchmark results of the dataset providing state-of-the-art results in predicate classification. The paper presents a novel Spatio-temporal scene graph for videos, represented as a directed acyclic graph that maximizes the information in the scene. The results obtained in the counting task suggest some interesting finds that are described in the paper. The graph can be used for reasoning with a much lower computational requirement explored in this work among other downstream tasks such as video captioning, action recognition, and more, trying to bridge the gap between videos and textual analysis.
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu.
Da Vitoria Lobo, Niels
Master of Science (M.S.)
College of Engineering and Computer Science
Length of Campus-only Access
Masters Thesis (Open Access)
Tumkur Narasimhamurthy, Kesar, "Spatio-Temporal Representation for Reasoning with Action Genome" (2021). Electronic Theses and Dissertations, 2020-. 778.