Space-Efficient Indexes For Forbidden Extension Queries

Keywords

Data structures; Document retrieval; String algorithms; Suffix trees; Top-k

Abstract

Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extension. Let D={T1,T2,…,TD} be a collection of D string documents of n characters in total, and P+ and P− be two query patterns, where P+ is a proper prefix of P−. We call P− as the forbidden extension of the included pattern P+. A forbidden extension query 〈P+,P−〉 asks to report all occ documents in D that contains P+ as a substring, but does not contain P− as one. A top-k forbidden extension query 〈P+,P−,k〉 asks to report those k documents among the occ documents that are most relevant to P+, where each document is given a unique fixed score (PageRank) and the relevance of a document is determined based on its score. We present a linear index (in words) with an O(|P−|+occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following space-time trade-offs: • O(n) space (in words) and O(|P−|log⁡σ+k) query time.• |CSA|+|CSA⁎|+Dlog⁡n/D+O(n) bits and O(search(P−)+k⋅tSA⋅log2+ϵ⁡n) query time, where ϵ>0 is an arbitrarily small constant.• |CSA|+O(nlog⁡D) bits and O(search(P−)+(k+log⁡D)log⁡D) query time.Here σ is the size of the alphabet set, CSA (of size |CSA| bits) is the compressed suffix array (CSA) of the concatenated text of all documents, CSAd is the CSA of Td and |CSA⁎|=∑d=1D|CSAd|. Also, search(P−) is the time for pattern matching and tSA is the time to find suffix (or inverse suffix) array value.

Publication Date

5-1-2018

Publication Title

Journal of Discrete Algorithms

Volume

50

Number of Pages

23-35

Document Type

Article

Personal Identifier

scopus

DOI Link

https://doi.org/10.1016/j.jda.2018.09.001

Socpus ID

85055133466 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85055133466

This document is currently not available here.

Share

COinS