Abstract
Vision models have improved in popularity and performance on many tasks since the emergence of large-scale datasets, improved access to computational resources, and new model architectures like the transformer. However, it is still not well understood if these models can be deployed in the real world. Because these models are "blackbox" architectures, we do not fully understand what these models are truly learning. An understanding of what models learn "underneath the hood" would result in better improvements for real-world scenarios. Motivated by this, we benchmark these impressive visual models using newly proposed datasets and tasks on their robustness and their general understanding, using semantics as both a probe and an area of improvement. We first propose a new task of graphical representation for video, using language as a semantic signal to enable quick and interpretable video understanding through cross-attention between language and video. We then explore robustness of video action-recognition models. Given real-world shifts from the original video distribution deep learning models are trained on, where do models fail, and how can we improve these failures. Next, we explore the robustness of video-language models for text-to-video retrieval. Given real-world shifts in either the video or the text distribution models were trained on, how are models failing, and where can improvements be made. Findings in this work indicated visual-language models may struggle with human-level understanding. So, we next benchmark visual-language models on conceptual understandings of object-relations, attribute-object relations, and context-object relations by proposing new datasets. Across all works in this dissertation, we empirically provide both weaknesses and strengths of large, vision models and potential areas of improvement. Through this research, we aim to contribute to the advancement of computer vision model understanding, paving the way for more robust and generalizable models that can effectively handle real-world scenarios.
Notes
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu
Graduation Date
2023
Semester
Summer
Advisor
Rawat, Yogesh Singh
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Degree Program
Computer Science
Identifier
CFE0009704; DP0027811
URL
https://purls.library.ucf.edu/go/DP0027811
Language
English
Release Date
August 2023
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
STARS Citation
Chantry, Madeline, "A Study on Robustness and Semantic Understanding of Visual Models" (2023). Electronic Theses and Dissertations, 2020-2023. 1850.
https://stars.library.ucf.edu/etd2020/1850