Keywords

PBWT, Algorithm, Identity-by-Descent, Genetic ancestry, Biobank

Abstract

Haplotype matching is the task of finding identical matching segments given a group of aligned genetic sequences. Haplotype matches play an important role as they represent biological phenomena. Long haplotype matching segments could indicate identity-by-descent segments showing some degree of genealogical relatedness between the individuals that the haplotypes belong to. Similarly, long segment matches that are adjacent to each other may indicate recombination events. Advanced genotyping technology has made it feasible for large numbers of individuals to be genotyped resulting in many biobanks across the world. This requires efficient algorithms that can analyze large amounts of genetic data.

In this dissertation, we develop positional Burrows-Wheeler transform (PBWT)-based methods that allow efficient haplotype matching queries and show their application for global ancestry inference. PBWT is an efficient data structure by Richard Durbin that enables haplotype matching given a bi-allelic haplotype panel. We generalize the haplotype matching problem to composite haplotype matching patterns and provide a combinatorial definition of crossover recombination patterns, errors of phasing algorithms and gene conversion. We develop a space efficient single scan algorithm that utilizes multiple PBWT columns to find composite haplotype matches. Furthermore, we address the memory bottleneck of dynamic PBWT algorithms and develop a dynamic compressed PBWT called Dynamic μ-PBWT. We run-length compress the PBWT columns and store them in B+ trees for efficient updates. This enables efficient insertion/deletion of a haplotype to/from the Dynamic μ-PBWT. Lastly, we apply haplotype matching query to develop a reference-based global ancestry inference tool. We use identity-by-descent segments shared between the query individual and the reference panel to infer the global ancestry of the query individual. We show that this method is an efficient alternative in populations with high IBD sharing. We also develop a method to refine the reference panel so that it can better approximate the ancestral haplotypes.

Completion Date

2025

Semester

Fall

Committee Chair

Zhang, Shaojie

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Department of Computer Science

Format

PDF

Identifier

DP0029814

Document Type

Thesis

Campus Location

Orlando (Main) Campus

Share

COinS