Keywords
PBWT, Algorithm, Identity-by-Descent, Genetic ancestry, Biobank
Abstract
Haplotype matching is the task of finding identical matching segments given a group of aligned genetic sequences. Haplotype matches play an important role as they represent biological phenomena. Long haplotype matching segments could indicate identity-by-descent segments showing some degree of genealogical relatedness between the individuals that the haplotypes belong to. Similarly, long segment matches that are adjacent to each other may indicate recombination events. Advanced genotyping technology has made it feasible for large numbers of individuals to be genotyped resulting in many biobanks across the world. This requires efficient algorithms that can analyze large amounts of genetic data.
In this dissertation, we develop positional Burrows-Wheeler transform (PBWT)-based methods that allow efficient haplotype matching queries and show their application for global ancestry inference. PBWT is an efficient data structure by Richard Durbin that enables haplotype matching given a bi-allelic haplotype panel. We generalize the haplotype matching problem to composite haplotype matching patterns and provide a combinatorial definition of crossover recombination patterns, errors of phasing algorithms and gene conversion. We develop a space efficient single scan algorithm that utilizes multiple PBWT columns to find composite haplotype matches. Furthermore, we address the memory bottleneck of dynamic PBWT algorithms and develop a dynamic compressed PBWT called Dynamic μ-PBWT. We run-length compress the PBWT columns and store them in B+ trees for efficient updates. This enables efficient insertion/deletion of a haplotype to/from the Dynamic μ-PBWT. Lastly, we apply haplotype matching query to develop a reference-based global ancestry inference tool. We use identity-by-descent segments shared between the query individual and the reference panel to infer the global ancestry of the query individual. We show that this method is an efficient alternative in populations with high IBD sharing. We also develop a method to refine the reference panel so that it can better approximate the ancestral haplotypes.
Completion Date
2025
Semester
Fall
Committee Chair
Zhang, Shaojie
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Department of Computer Science
Format
Identifier
DP0029814
Document Type
Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Shakya, Pramesh, "PBWT-based Methods for Biobank-Scale Haplotype Data Analysis" (2025). Graduate Thesis and Dissertation post-2024. 498.
https://stars.library.ucf.edu/etd2024/498