Space-Efficient Indexes For Forbidden Extension Queries
Keywords
Data structures; Document retrieval; String algorithms; Suffix trees; Top-k
Abstract
Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extension. Let D={T1,T2,…,TD} be a collection of D string documents of n characters in total, and P+ and P− be two query patterns, where P+ is a proper prefix of P−. We call P− as the forbidden extension of the included pattern P+. A forbidden extension query 〈P+,P−〉 asks to report all occ documents in D that contains P+ as a substring, but does not contain P− as one. A top-k forbidden extension query 〈P+,P−,k〉 asks to report those k documents among the occ documents that are most relevant to P+, where each document is given a unique fixed score (PageRank) and the relevance of a document is determined based on its score. We present a linear index (in words) with an O(|P−|+occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following space-time trade-offs: • O(n) space (in words) and O(|P−|logσ+k) query time.• |CSA|+|CSA⁎|+Dlogn/D+O(n) bits and O(search(P−)+k⋅tSA⋅log2+ϵn) query time, where ϵ>0 is an arbitrarily small constant.• |CSA|+O(nlogD) bits and O(search(P−)+(k+logD)logD) query time.Here σ is the size of the alphabet set, CSA (of size |CSA| bits) is the compressed suffix array (CSA) of the concatenated text of all documents, CSAd is the CSA of Td and |CSA⁎|=∑d=1D|CSAd|. Also, search(P−) is the time for pattern matching and tSA is the time to find suffix (or inverse suffix) array value.
Publication Date
5-1-2018
Publication Title
Journal of Discrete Algorithms
Volume
50
Number of Pages
23-35
Document Type
Article
Personal Identifier
scopus
DOI Link
https://doi.org/10.1016/j.jda.2018.09.001
Copyright Status
Unknown
Socpus ID
85055133466 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/85055133466
STARS Citation
Biswas, Sudip; Ganguly, Arnab; Shah, Rahul; and Thankachan, Sharma V., "Space-Efficient Indexes For Forbidden Extension Queries" (2018). Scopus Export 2015-2019. 9193.
https://stars.library.ucf.edu/scopus2015/9193