Title

Discovering Similar Passages Within Large Text Documents

Keywords

passage retrieval; plagiarism detection; text alignment

Abstract

We present a novel general method for discovering similar passages within large text documents based on adapting and extending the well-known Smith-Waterman dynamic programming local sequence alignment algorithm. We extend that algorithm for large document analysis by defining: (a) a recursive procedure for discovering multiple non-overlapping aligned passages within a given document pair; (b) a matrix splicing method for processing long texts; (c) a chaining method for combining sequence strands; and (d) an inexact similarity measure for determining token matches. We show that an implementation of this method is computationally efficient and produces very high precision with good recall for several types of order-based plagiarism and that it achieves higher overall performance than the best reported methods against the PAN 2013 text alignment test corpus. © 2014 Springer International Publishing.

Publication Date

1-1-2014

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Volume

8685 LNCS

Number of Pages

98-109

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1007/978-3-319-11382-1_10

Socpus ID

84906777083 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/84906777083

This document is currently not available here.

Share

COinS