Title
Non-Derivable Itemsets For Fast Outlier Detection In Large High-Dimensional Categorical Data
Keywords
Anomaly detection; Categorical datasets; Frequent itemset mining; Non-Derivable itemsets; Outlier detection
Abstract
Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. © 2010 Springer-Verlag London Limited.
Publication Date
12-1-2011
Publication Title
Knowledge and Information Systems
Volume
29
Issue
3
Number of Pages
697-725
Document Type
Article
Personal Identifier
scopus
DOI Link
https://doi.org/10.1007/s10115-010-0343-7
Copyright Status
Unknown
Socpus ID
81155135134 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/81155135134
STARS Citation
Koufakou, Anna; Secretan, Jimmy; and Georgiopoulos, Michael, "Non-Derivable Itemsets For Fast Outlier Detection In Large High-Dimensional Categorical Data" (2011). Scopus Export 2010-2014. 2321.
https://stars.library.ucf.edu/scopus2010/2321