Title
Discovering Similar Passages Within Large Text Documents
Keywords
passage retrieval; plagiarism detection; text alignment
Abstract
We present a novel general method for discovering similar passages within large text documents based on adapting and extending the well-known Smith-Waterman dynamic programming local sequence alignment algorithm. We extend that algorithm for large document analysis by defining: (a) a recursive procedure for discovering multiple non-overlapping aligned passages within a given document pair; (b) a matrix splicing method for processing long texts; (c) a chaining method for combining sequence strands; and (d) an inexact similarity measure for determining token matches. We show that an implementation of this method is computationally efficient and produces very high precision with good recall for several types of order-based plagiarism and that it achieves higher overall performance than the best reported methods against the PAN 2013 text alignment test corpus. © 2014 Springer International Publishing.
Publication Date
1-1-2014
Publication Title
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume
8685 LNCS
Number of Pages
98-109
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1007/978-3-319-11382-1_10
Copyright Status
Unknown
Socpus ID
84906777083 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/84906777083
STARS Citation
Glinos, Demetrios, "Discovering Similar Passages Within Large Text Documents" (2014). Scopus Export 2010-2014. 9206.
https://stars.library.ucf.edu/scopus2010/9206