Abstract

The positional Burrows-Wheeler transform (PBWT) is a foundational data structure for representing haplotype matches of biobank scale. Once the PBWT panel of a set of haplotypes are constructed, efficient algorithms are available for “All vs. All” positional substring matching, finding exact matches of substrings in pre-aligned strings, for haplotypes within the panel, and “One vs. All” positional substring match query for an out-of-panel haplotype against all haplotypes in the panel. While the original PBWT was designed from linear reference genomes, GBWT was proposed to extend PBWT to genome graphs that allow large insertions and deletions. However, there are no GBWT algorithms for haplotype matching. In this work, we develop the efficient algorithms for “All vs. All” and “One vs. All” haplotype set-maximal and long matching algorithms for GBWT. For a GBWT containing a panel of paths P, we show algorithms similar to the matching algorithms of PBWT. Our algorithms achieves theoretically optimal time complexity to output all “All vs. All” matches in time linear to the size of the input panel (O(∑|Pi| + |out put|)), and quasilinear time to the length of the query path for “One vs. All” path match queries (O(|Q| log σ + |out put| log σ ), where σ is the maximum out- degree in the GBWT and out put is the set of discovered path matches). Under the constant σ assumption made by gPBWT and GBWT, these algorithms are in fact linear. Our algorithms open the possibilities for applications of efficient positional substring matching in pangenome references such as identical-by-descent (IBD) segment identification and genotype imputation.

Thesis Completion

2023

Semester

Summer

Thesis Chair/Advisor

Zhang, Shaojie

Degree

Bachelor of Science (B.S.)

College

College of Engineering and Computer Science

Department

Computer Science

Degree Program

Mathematics, Computer Engineering

Language

English

Access Status

Campus Access

Length of Campus-only Access

3 years

Release Date

8-15-2026

Share

COinS