Vision models have improved in popularity and performance on many tasks since the emergence of large-scale datasets, improved access to computational resources, and new model architectures like the transformer. However, it is still not well understood if these models can be deployed in the real world. Because these models are "blackbox" architectures, we do not fully understand what these models are truly learning. An understanding of what models learn "underneath the hood" would result in better improvements for real-world scenarios. Motivated by this, we benchmark these impressive visual models using newly proposed datasets and tasks on their robustness and their general understanding, using semantics as both a probe and an area of improvement. We first propose a new task of graphical representation for video, using language as a semantic signal to enable quick and interpretable video understanding through cross-attention between language and video. We then explore robustness of video action-recognition models. Given real-world shifts from the original video distribution deep learning models are trained on, where do models fail, and how can we improve these failures. Next, we explore the robustness of video-language models for text-to-video retrieval. Given real-world shifts in either the video or the text distribution models were trained on, how are models failing, and where can improvements be made. Findings in this work indicated visual-language models may struggle with human-level understanding. So, we next benchmark visual-language models on conceptual understandings of object-relations, attribute-object relations, and context-object relations by proposing new datasets. Across all works in this dissertation, we empirically provide both weaknesses and strengths of large, vision models and potential areas of improvement. Through this research, we aim to contribute to the advancement of computer vision model understanding, paving the way for more robust and generalizable models that can effectively handle real-world scenarios.
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu
Rawat, Yogesh Singh
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Chantry, Madeline, "A Study on Robustness and Semantic Understanding of Visual Models" (2023). Electronic Theses and Dissertations, 2020-. 1850.