Keywords
Identical-by-Descent, Biobank-Scale Data, Benchmarking, Haplotype Matching, Parallel Algorithm, High Performance Computing
Abstract
As genomic biobank initiatives continue to grow, the availability of large-scale genotype datasets, encompassing hundreds of thousands to millions of individuals, has transformed genetic research and biomedical discovery. However, the sheer volume of this data presents major computational barriers. Efficient and scalable methods are urgently needed to process and extract meaningful signals from biobank-scale data using modern multi-core architectures. One central task in this domain is the detection of identity-by-descent (IBD) segments, which underpins a range of applications including genealogical inference, disease mapping, phasing, and population structure analysis.
This dissertation addresses these challenges by presenting a sequence of contributions that combine rigorous benchmarking, novel algorithm design, and high-performance implementation. First, we construct a comprehensive and reproducible benchmarking framework to evaluate the performance, accuracy, and resource efficiency of widely used IBD detection tools. This benchmark provides a clear picture of how different algorithms behave under real-world biobank conditions and serves as a guide for researchers seeking the most appropriate tool for their analytical goals.
Second, we introduce HP-PBWT, the first haplotype-based parallel implementation of the positional Burrows-Wheeler transform (PBWT). Unlike traditional PBWT based tools that process haplotypes serially, HP-PBWT exploits haplotype-based parallelism to leverage modern multi-core CPUs, dramatically reducing runtime while maintaining the core mathematical structure of PBWT. This work bridges the gap between classical algorithmic theory and practical scalability on modern hardware.
Finally, we develop RaPID2, an improved IBD detection tool that incorporates both performance-oriented engineering and algorithmic adaptability. RaPID2 integrates fixed-size and dynamic window modes, applies memory-efficient representations, and introduces multiple levels of task parallelism. It replicates the detection power of its predecessor, RaPID, while achieving significant speedups-up to 32-fold faster in high-resolution settings. Compared to other popular tools like hap-IBD, RaPID2 delivers competitive accuracy with dramatically improved runtime at moderate IBD thresholds such as 2 cM.
Together, these contributions demonstrate how high-performance computing principles can be harnessed to meet the demands of modern genomics. The tools and frameworks developed in this dissertation are released as open-source software, enabling transparent evaluation and future integration into large-scale genetic studies.
Completion Date
2025
Semester
Summer
Committee Chair
Zhang, Shaojie
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Format
Identifier
DP0029618
Language
English
Document Type
Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Tang, Kecong, "Algorithms And Benchmarking For Parallel Identity-By-Descent Segment Detection" (2025). Graduate Thesis and Dissertation post-2024. 379.
https://stars.library.ucf.edu/etd2024/379