Keywords
Large Language Model, Complex Reasoning, Heterogeneous Data Structure, Benchmarking, NLP, Post-Training
Abstract
Large language models (LLMs) increasingly face complex real-world tasks requiring complex reasoning over heterogeneous inputs, including text, numbers, and structured data. This dissertation first investigates the capability boundaries of LLMs under such scenarios. Through the MeetingBank and SportsMetrics benchmarks, we show that while LLMs exhibit linguistic fluency, they struggle with factual accuracy, information density, and quantitative cross-referencing in meeting summarization and sports analysis. Using DecipherPref, a pair-wise evaluation framework grounded in the Bradley-Terry-Luce model, we further probe LLMs' inherent preferences for information-rich and lengthy inputs, and we identify critical deficiencies in financial decision-making through DeFine, where models lack precise insight into uncertainty and key quantitative factors. To address these limitations, this dissertation proposes solutions spanning reinforcement learning, data synthesis, ladder fine-tuning and experience evolving agent. We introduce SportsGen, a controllable simulator for generating complex reasoning data, and DeFine, a framework leveraging analogical reasoning for financial forecasting. Building on SportsGen, we propose Ladder Training, a progressive curriculum strategy with accuracy-plateau auto-promotion that preserves general reasoning while more than doubling in-domain accuracy, revealing catastrophic forgetting during supervised fine-tuning as a property of data ordering rather than an intrinsic cost of optimization. To enhance verifiability, we develop STRUX, a reinforcement learning framework that uses iterative self-reflection to distill unstructured transcripts into structured, evidence-based investment explanations. For scenarios lacking task-specific data, we propose E2A (Experience Evolving Agent), a training-free framework that manages problem-solving rules through a biologically inspired hypothesis lifecycle. Together, these methods generalize beyond their original tasks to mathematics, multi-hop logic, and long-context reasoning, advancing LLM capabilities in complex, open-domain environments.
Completion Date
2026
Semester
Spring
Committee Chair
Foroosh, Hassan
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Format
Document Type
Dissertation
Identifier
DP0053163
STARS Citation
Hu, Yebowen, "Benchmarking and Advancing Heterogeneous Data Reasoning with Large Language Models" (2026). Graduate Studies Theses and Dissertations 2026. 88.
https://stars.library.ucf.edu/gradstudies_etd_2026/88
Accessibility Statement
This item was created or digitized prior to April 24, 2027, or is a reproduction of legacy media created before that date. It is preserved in its original, unmodified state specifically for research, reference, or historical recordkeeping. In accordance with the ADA Title II Final Rule, the University Libraries provides accessible versions of archival materials upon request. To request an accommodation for this item, please submit an accessibility request form.