Keywords

Video Question Answering, NarrativeML, Causal Reasoning, Counterfactual Reasoning, Vision-Language Models, Narrative Representation

Abstract

The purpose of this thesis is to examine how structured narrative representation and reasoning strategies jointly affect the performance of Video Question Answering (VideoQA). Although recent advances in large multimodal and language models have improved video description and retrieval, causal and counterfactual reasoning remain challenging. This thesis studies an inference-only pipeline in which videos are converted into natural-language narratives, and those narratives may optionally be transformed into a structured NarrativeML format before multiple-choice questions are answered without task-specific fine-tuning. NarrativeML is a machine-readable narrative representation that explicitly encodes entities, events, and their causal and temporal relations. Three representational contexts are evaluated---Narrative, NarrativeML, and Both (i.e., Narrative and NarrativeML together)---together with two reasoning strategies: Chain-of-Thought (CoT) prompting for stepwise textual inference and Narrative Template-based Prompting (NTP) for structured causal reasoning. Experiments are conducted on the Causal-VidQA benchmark, whose official test split contains 5,429 samples across four question types: descriptive, explanatory, predictive, and counterfactual. The aggregate large-scale evaluation reported in this thesis uses 5,379 samples after excluding a 50-sample development subset drawn from that split. CoT is applied to Narrative and Both for all question types, and to NarrativeML for descriptive and explanatory questions, whereas NTP is used only with NarrativeML for predictive and counterfactual questions. With an overall accuracy of 66.73%, Narrative + CoT is the strongest fully automated setting in this thesis and remains competitive with the fine-tuned systems compared here. Across the 5,379-sample evaluation set used in this thesis, NarrativeML generally underperforms relative to Narrative overall. A similar broad pattern is observed on a small sample of the NExT-QA benchmark. Overall, the results suggest that representation choice matters, but its effect depends on question type, extraction quality, and the reasoning strategy used at inference time.

Completion Date

2026

Semester

Spring

Committee Chair

Karmaker Santu

Degree

Master of Science (M.S.)

College

College of Engineering and Computer Science

Department

Computer Science

Document Type

Dissertation/Thesis

Identifier

DP0053122

Share

COinS
 

Accessibility Statement

This item was created or digitized prior to April 24, 2027, or is a reproduction of legacy media created before that date. It is preserved in its original, unmodified state specifically for research, reference, or historical recordkeeping. In accordance with the ADA Title II Final Rule, the University Libraries provides accessible versions of archival materials upon request. To request an accommodation for this item, please submit an accessibility request form.