Abstract

Vision models have improved in popularity and performance on many tasks since the emergence of large-scale datasets, improved access to computational resources, and new model architectures like the transformer. However, it is still not well understood if these models can be deployed in the real world. Because these models are "blackbox" architectures, we do not fully understand what these models are truly learning. An understanding of what models learn "underneath the hood" would result in better improvements for real-world scenarios. Motivated by this, we benchmark these impressive visual models using newly proposed datasets and tasks on their robustness and their general understanding, using semantics as both a probe and an area of improvement. We first propose a new task of graphical representation for video, using language as a semantic signal to enable quick and interpretable video understanding through cross-attention between language and video. We then explore robustness of video action-recognition models. Given real-world shifts from the original video distribution deep learning models are trained on, where do models fail, and how can we improve these failures. Next, we explore the robustness of video-language models for text-to-video retrieval. Given real-world shifts in either the video or the text distribution models were trained on, how are models failing, and where can improvements be made. Findings in this work indicated visual-language models may struggle with human-level understanding. So, we next benchmark visual-language models on conceptual understandings of object-relations, attribute-object relations, and context-object relations by proposing new datasets. Across all works in this dissertation, we empirically provide both weaknesses and strengths of large, vision models and potential areas of improvement. Through this research, we aim to contribute to the advancement of computer vision model understanding, paving the way for more robust and generalizable models that can effectively handle real-world scenarios.

Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date

2023

Semester

Summer

Advisor

Rawat, Yogesh Singh

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Degree Program

Computer Science

Identifier

CFE0009704; DP0027811

URL

https://purls.library.ucf.edu/go/DP0027811

Language

English

Release Date

August 2023

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

STARS Citation

Chantry, Madeline, "A Study on Robustness and Semantic Understanding of Visual Models" (2023). Electronic Theses and Dissertations, 2020-2023. 1850.
https://stars.library.ucf.edu/etd2020/1850

Download

Included in

Computer Sciences Commons

COinS

Electronic Theses and Dissertations, 2020-2023

A Study on Robustness and Semantic Understanding of Visual Models

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Browse Advisors

Explore

Connect

Electronic Theses and Dissertations, 2020-2023

A Study on Robustness and Semantic Understanding of Visual Models

Author

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Share

Browse Advisors

Explore

Connect