Algorithmic Framework For Approximate Matching Under Bounded Edits With Applications To Sequence Analysis
Abstract
We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully controlling the required number of such edited suffixes to enable the design of efficient algorithms. For a total input size of n, our framework limits the number of generated edited suffixes to no more than a factor of O(log kn) of the input size (for any constant k), and restricts the algorithm to linear space usage by overlapping the generation and processing of edited suffixes. Our framework improves the best known upper bound of n2k1.5/2Ω(logn/k) for the classic k-edit longest common substring problem [Abboud, Williams, and Yu; SODA 2015] to yield the first strictly sub-quadratic time algorithm that runs in O(nlog kn) time and O(n) space for any constant k. We present similar subquadratic time and linear space algorithms for (i) computing the alignment-free distance between two genomes based on the k-edit average common substring measure, (ii) mapping reads/read fragments to a reference genome while allowing up to k edits, and (iii) computing all-pair maximal k-edit common substrings (also, suffix/prefix overlaps), which has applications in clustering and assembly. We expect our algorithmic framework to be a broadly applicable theoretical tool, and may inspire the design of practical heuristics and software.
Publication Date
1-1-2018
Publication Title
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume
10812 LNBI
Number of Pages
211-224
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1007/978-3-319-89929-9_14
Copyright Status
Unknown
Socpus ID
85046126687 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/85046126687
STARS Citation
Thankachan, Sharma V.; Aluru, Chaitanya; Chockalingam, Sriram P.; and Aluru, Srinivas, "Algorithmic Framework For Approximate Matching Under Bounded Edits With Applications To Sequence Analysis" (2018). Scopus Export 2015-2019. 9500.
https://stars.library.ucf.edu/scopus2015/9500