The positional Burrows-Wheeler transform (PBWT) is a foundational data structure for representing haplotype matches of biobank scale. Once the PBWT panel of a set of haplotypes are constructed, efficient algorithms are available for “All vs. All” positional substring matching, finding exact matches of substrings in pre-aligned strings, for haplotypes within the panel, and “One vs. All” positional substring match query for an out-of-panel haplotype against all haplotypes in the panel. While the original PBWT was designed from linear reference genomes, GBWT was proposed to extend PBWT to genome graphs that allow large insertions and deletions. However, there are no GBWT algorithms for haplotype matching. In this work, we develop the efficient algorithms for “All vs. All” and “One vs. All” haplotype set-maximal and long matching algorithms for GBWT. For a GBWT containing a panel of paths P, we show algorithms similar to the matching algorithms of PBWT. Our algorithms achieves theoretically optimal time complexity to output all “All vs. All” matches in time linear to the size of the input panel (O(∑|Pi| + |out put|)), and quasilinear time to the length of the query path for “One vs. All” path match queries (O(|Q| log σ + |out put| log σ ), where σ is the maximum out- degree in the GBWT and out put is the set of discovered path matches). Under the constant σ assumption made by gPBWT and GBWT, these algorithms are in fact linear. Our algorithms open the possibilities for applications of efficient positional substring matching in pangenome references such as identical-by-descent (IBD) segment identification and genotype imputation.
Bachelor of Science (B.S.)
College of Engineering and Computer Science
Mathematics, Computer Engineering
Length of Campus-only Access
Villalobos, Seba, "Efficient Haplotype Matching on Biobank-Scale Reference Graphs" (2023). Honors Undergraduate Theses. 1477.