Keywords

Large Language Model, Complex Reasoning, Heterogeneous Data Structure, Benchmarking, NLP, Post-Training

Abstract

Large language models (LLMs) increasingly face complex real-world tasks requiring complex reasoning over heterogeneous inputs, including text, numbers, and structured data. This dissertation first investigates the capability boundaries of LLMs under such scenarios. Through the MeetingBank and SportsMetrics benchmarks, we show that while LLMs exhibit linguistic fluency, they struggle with factual accuracy, information density, and quantitative cross-referencing in meeting summarization and sports analysis. Using DecipherPref, a pair-wise evaluation framework grounded in the Bradley-Terry-Luce model, we further probe LLMs' inherent preferences for information-rich and lengthy inputs, and we identify critical deficiencies in financial decision-making through DeFine, where models lack precise insight into uncertainty and key quantitative factors. To address these limitations, this dissertation proposes solutions spanning reinforcement learning, data synthesis, ladder fine-tuning and experience evolving agent. We introduce SportsGen, a controllable simulator for generating complex reasoning data, and DeFine, a framework leveraging analogical reasoning for financial forecasting. Building on SportsGen, we propose Ladder Training, a progressive curriculum strategy with accuracy-plateau auto-promotion that preserves general reasoning while more than doubling in-domain accuracy, revealing catastrophic forgetting during supervised fine-tuning as a property of data ordering rather than an intrinsic cost of optimization. To enhance verifiability, we develop STRUX, a reinforcement learning framework that uses iterative self-reflection to distill unstructured transcripts into structured, evidence-based investment explanations. For scenarios lacking task-specific data, we propose E2A (Experience Evolving Agent), a training-free framework that manages problem-solving rules through a biologically inspired hypothesis lifecycle. Together, these methods generalize beyond their original tasks to mathematics, multi-hop logic, and long-context reasoning, advancing LLM capabilities in complex, open-domain environments.

Completion Date

2026

Semester

Spring

Committee Chair

Foroosh, Hassan

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Format

PDF

Document Type

Dissertation

Identifier

DP0053163

Share

COinS
 

Accessibility Statement

This item was created or digitized prior to April 24, 2027, or is a reproduction of legacy media created before that date. It is preserved in its original, unmodified state specifically for research, reference, or historical recordkeeping. In accordance with the ADA Title II Final Rule, the University Libraries provides accessible versions of archival materials upon request. To request an accommodation for this item, please submit an accessibility request form.