A Practical And Efficient Algorithm For The K-Mismatch Shortest Unique Substring Finding Problem
Keywords
Hamming distance; Mismatch; Shortest unique substring; String
Abstract
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nog^k n )$, while maintaining a practical space complexity at $O(kn)$, where n is the string length. When $k>0$, which is the hard case, our new proposal significantly improves the any-case $O(n^2)$ time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200KB sample DNA sequence with $k=1$ in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than $1/4$ that of the serial implementation, when processing a 10MB sample DNA sequence with $k=2$. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any $k>0$, while this new proposal, using 24 cores, can finish processing a sample of this size with $k=1$ in $206.376$ seconds with a peak memory usage of 46GB, which is both easily available and affordable on Cloud for many users. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology.
Publication Date
8-15-2018
Publication Title
ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
Number of Pages
428-437
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1145/3233547.3233564
Copyright Status
Unknown
Socpus ID
85056076989 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/85056076989
STARS Citation
Allen, Daniel R.; Thankachan, Sharma V.; and Xu, Bojian, "A Practical And Efficient Algorithm For The K-Mismatch Shortest Unique Substring Finding Problem" (2018). Scopus Export 2015-2019. 10083.
https://stars.library.ucf.edu/scopus2015/10083