Keywords
Text-to-Code, Text-to-SQL, Semantic Parsing
Abstract
Databases play a vital role in today's digital landscape, enabling effective data storage, management, and retrieval for businesses and other organizations. However, interacting with databases often requires knowledge of query (e.g., SQL) and analysis, which can be a barrier for many users. In natural language processing, the text-to-code task, which converts natural language text into query and analysis code, bridges this gap by allowing users to access and manipulate data using everyday language. This dissertation investigates different challenges in text-to-code (including text-to-SQL as a subtask), with a focus on four primary contributions to the field. As a solution to the lack of statistical analysis in current text-to-code tasks, we introduce SIGMA, a text-to-Code dataset with statistical analysis, featuring 6000 questions with Python code labels. Baseline models show promising results, indicating that our new task can support both statistical analysis and SQL queries simultaneously. Second, we present Ar-Spider, the first Arabic cross-domain text-to-SQL dataset that addresses multilingual limitations. We have conducted experiments with LGESQL and S²SQL models, enhanced by our Context Similarity Relationship (CSR) approach, which demonstrates competitive performance, reducing the performance gap between the Arabic and English text-to-SQL datasets. Third, we address context-dependent text-to-SQL task, often overlooked by current models. The SParC dataset was explored by utilizing different question representations and in-context learning prompt engineering techniques. Then, we propose GAT-SQL, an advanced prompt engineering approach that improves both zero-shot and in-context learning experiments. GAT-SQL sets new benchmarks in both SParC and CoSQL datasets. Finally, we introduce Ar-SParC, a context-dependent Arabic text-to-SQL dataset that enables users to interact with the model through a series of interrelated questions. In total, 40 experiments were conducted to investigate this dataset using various prompt engineering techniques, and a novel technique called GAT Corrector was developed, which significantly improved the performance of all baseline models.
Completion Date
2024
Semester
Fall
Committee Chair
Wang, Liqiang
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Degree Program
Computer Science
Format
Identifier
DP0029003
Language
English
Release Date
December 2024
Access Status
Dissertation/Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
almohaimeed, saleh, "Towards Robust and Accurate Text-to-Code Generation" (2024). Graduate Thesis and Dissertation post-2024. 4.
https://stars.library.ucf.edu/etd2024/4
Accessibility Status
PDF accessibility verified using Adobe Acrobat Pro Accessibility Checker