Keywords

Identical-by-Descent, Biobank-Scale Data, Benchmarking, Haplotype Matching, Parallel Algorithm, High Performance Computing

Abstract

As genomic biobank initiatives continue to grow, the availability of large-scale genotype datasets, encompassing hundreds of thousands to millions of individuals, has transformed genetic research and biomedical discovery. However, the sheer volume of this data presents major computational barriers. Efficient and scalable methods are urgently needed to process and extract meaningful signals from biobank-scale data using modern multi-core architectures. One central task in this domain is the detection of identity-by-descent (IBD) segments, which underpins a range of applications including genealogical inference, disease mapping, phasing, and population structure analysis.

This dissertation addresses these challenges by presenting a sequence of contributions that combine rigorous benchmarking, novel algorithm design, and high-performance implementation. First, we construct a comprehensive and reproducible benchmarking framework to evaluate the performance, accuracy, and resource efficiency of widely used IBD detection tools. This benchmark provides a clear picture of how different algorithms behave under real-world biobank conditions and serves as a guide for researchers seeking the most appropriate tool for their analytical goals.

Second, we introduce HP-PBWT, the first haplotype-based parallel implementation of the positional Burrows-Wheeler transform (PBWT). Unlike traditional PBWT based tools that process haplotypes serially, HP-PBWT exploits haplotype-based parallelism to leverage modern multi-core CPUs, dramatically reducing runtime while maintaining the core mathematical structure of PBWT. This work bridges the gap between classical algorithmic theory and practical scalability on modern hardware.

Finally, we develop RaPID2, an improved IBD detection tool that incorporates both performance-oriented engineering and algorithmic adaptability. RaPID2 integrates fixed-size and dynamic window modes, applies memory-efficient representations, and introduces multiple levels of task parallelism. It replicates the detection power of its predecessor, RaPID, while achieving significant speedups-up to 32-fold faster in high-resolution settings. Compared to other popular tools like hap-IBD, RaPID2 delivers competitive accuracy with dramatically improved runtime at moderate IBD thresholds such as 2 cM.

Together, these contributions demonstrate how high-performance computing principles can be harnessed to meet the demands of modern genomics. The tools and frameworks developed in this dissertation are released as open-source software, enabling transparent evaluation and future integration into large-scale genetic studies.

Completion Date

2025

Semester

Summer

Committee Chair

Zhang, Shaojie

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Format

PDF

Identifier

DP0029618

Language

English

Document Type

Thesis

Campus Location

Orlando (Main) Campus

Share

COinS