Abstract

Given visual input and a natural language question about it, the visual question answering (VQA) task is to answer the question correctly. To improve a system's reliability and trustworthiness, it is imperative that it links the text (question and answer) to specific visual regions. This dissertation first explores the VQA task in a multi-modal setting where questions are based on video as well as subtitles. An algorithm is introduced to process each modality and their features are fused to solve the task. Additionally, to understand the model's emphasis on visual data, this study collects a diagnostic set of questions which strictly require the knowledge of visual input based on a human annotator's judgment. The next phase of this research deals with grounding in VQA systems without any detectors or object annotations. To this end, weak supervision is employed for grounding by training on the VQA task alone. In the initial part of this study, a rubric is provided to measure the grounding performance. This reveals that high accuracy is no guarantee for good grounding, i.e., the system is getting the correct answer despite not attending to the visual evidence. Techniques are introduced to improve VQA grounding by combining attention and capsule networks. This approach benefits the grounding ability in both CNNs and transformers. Lastly, we focus on question answering in videos. By depicting activities and objects as well as their relationships as a graph, a video can be represented compactly capturing necessary information to produce an answer. An algorithm is devised that learns to construct such graphs and uses question-to-graph attention; this solution obtains significant improvement for complex reasoning-based questions on STAR and AGQA benchmarks. Hence, by obtaining higher accuracy and better grounding, this dissertation bridges the gap between task accuracy and explainability of reasoning in VQA systems.

Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date

2022

Semester

Fall

Advisor

Shah, Mubarak

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Degree Program

Computer Science

Format

application/pdf

Identifier

CFE0009419; DP0027142

URL

https://purls.library.ucf.edu/go/DP0027142

Language

English

Release Date

December 2022

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

STARS Citation

Urooj, Aisha, "Visual Question Answering: Exploring Trade-offs Between Task Accuracy and Explainability" (2022). Electronic Theses and Dissertations, 2020-2023. 1448.
https://stars.library.ucf.edu/etd2020/1448

Download

Included in

Computer Sciences Commons

COinS

Electronic Theses and Dissertations, 2020-2023

Visual Question Answering: Exploring Trade-offs Between Task Accuracy and Explainability

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Browse Advisors

Explore

Connect

Electronic Theses and Dissertations, 2020-2023

Visual Question Answering: Exploring Trade-offs Between Task Accuracy and Explainability

Author

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Share

Browse Advisors

Explore

Connect