Ranked Document Retrieval For Multiple Patterns

Keywords

Compressed suffix array; Succinct encoding; Suffix array; Suffix tree; Weighted ancestor query

Abstract

Let D={T1,T2,…,TD} be a collection of D documents having n characters in total. Given two patterns P and Q, and an integer k>0, we consider the following queries. • top-k forbidden pattern query: Among all documents containing P, but not Q, report the k documents most relevant to P.• top-k two pattern query: Among all documents that contain both P and Q, report the k documents most relevant to P.For the above two queries, we provide a linear space index with O(|P|+|Q|+nk) query time, under some standard relevance functions such as PageRank and TermFrequency. The document listing version of the above two problems asks to report all t documents that either contain P, but not Q, or contain both P and Q, depending on the query type. As a corollary of the top-k result, we obtain a linear space and O(|P|+|Q|+nt) query time solution for the document listing problems. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. Finally, we present space-efficient indexes for these problems.

Publication Date

10-25-2018

Publication Title

Theoretical Computer Science

Volume

746

Number of Pages

98-111

Document Type

Article

Personal Identifier

scopus

DOI Link

https://doi.org/10.1016/j.tcs.2018.06.029

Socpus ID

85049522666 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85049522666

This document is currently not available here.

Share

COinS