Keywords
Video Question Answering, NarrativeML, Causal Reasoning, Counterfactual Reasoning, Vision-Language Models, Narrative Representation
Abstract
The purpose of this thesis is to examine how structured narrative representation and reasoning strategies jointly affect the performance of Video Question Answering (VideoQA). Although recent advances in large multimodal and language models have improved video description and retrieval, causal and counterfactual reasoning remain challenging. This thesis studies an inference-only pipeline in which videos are converted into natural-language narratives, and those narratives may optionally be transformed into a structured NarrativeML format before multiple-choice questions are answered without task-specific fine-tuning. NarrativeML is a machine-readable narrative representation that explicitly encodes entities, events, and their causal and temporal relations. Three representational contexts are evaluated---Narrative, NarrativeML, and Both (i.e., Narrative and NarrativeML together)---together with two reasoning strategies: Chain-of-Thought (CoT) prompting for stepwise textual inference and Narrative Template-based Prompting (NTP) for structured causal reasoning. Experiments are conducted on the Causal-VidQA benchmark, whose official test split contains 5,429 samples across four question types: descriptive, explanatory, predictive, and counterfactual. The aggregate large-scale evaluation reported in this thesis uses 5,379 samples after excluding a 50-sample development subset drawn from that split. CoT is applied to Narrative and Both for all question types, and to NarrativeML for descriptive and explanatory questions, whereas NTP is used only with NarrativeML for predictive and counterfactual questions. With an overall accuracy of 66.73%, Narrative + CoT is the strongest fully automated setting in this thesis and remains competitive with the fine-tuned systems compared here. Across the 5,379-sample evaluation set used in this thesis, NarrativeML generally underperforms relative to Narrative overall. A similar broad pattern is observed on a small sample of the NExT-QA benchmark. Overall, the results suggest that representation choice matters, but its effect depends on question type, extraction quality, and the reasoning strategy used at inference time.
Completion Date
2026
Semester
Spring
Committee Chair
Karmaker Santu
Degree
Master of Science (M.S.)
College
College of Engineering and Computer Science
Department
Computer Science
Document Type
Dissertation/Thesis
Identifier
DP0053122
STARS Citation
Truong, Hoang Bao, "Using Inferences from Natural Language Narratives to Improve Video Question Answering" (2026). Graduate Studies Theses and Dissertations 2026. 199.
https://stars.library.ucf.edu/gradstudies_etd_2026/199
Accessibility Statement
This item was created or digitized prior to April 24, 2027, or is a reproduction of legacy media created before that date. It is preserved in its original, unmodified state specifically for research, reference, or historical recordkeeping. In accordance with the ADA Title II Final Rule, the University Libraries provides accessible versions of archival materials upon request. To request an accommodation for this item, please submit an accessibility request form.