"Non-derivable itemsets for fast outlier detection in large high-dimens"

Faculty Bibliography 2010s

Title

Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

Authors

A. Koufakou; J. Secretan;M. Georgiopoulos

Comments

Authors: contact us about adding a copy of your work at STARS@ucf.edu

Abbreviated Journal Title

Knowl. Inf. Syst.

Keywords

Outlier detection; Anomaly detection; Frequent itemset mining; Non-Derivable itemsets; Categorical datasets; DATA SETS; ALGORITHMS; Computer Science, Artificial Intelligence; Computer Science, Information; Systems

Abstract

Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a delta parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various delta values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.

Journal Title

Knowledge and Information Systems

Volume

Issue/Number

Publication Date

1-1-2011

Document Type

Article

DOI Link

http://dx.doi.org/10.1007/s10115-010-0343-7

Language

English

First Page

697

Last Page

725

WOS Identifier

WOS:000297131000007

ISSN

0219-1377

Recommended Citation

"Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data" (2011). Faculty Bibliography 2010s. 1503.
https://stars.library.ucf.edu/facultybib2010/1503

Find in your library

COinS

Faculty Bibliography 2010s

Title

Authors

Authors

Comments

Abbreviated Journal Title

Keywords

Abstract

Journal Title

Volume

Issue/Number

Publication Date

Document Type

DOI Link

Language

First Page

Last Page

WOS Identifier

ISSN

Recommended Citation

Explore

Connect

Faculty Bibliography 2010s

Title

Authors

Authors

Comments

Abbreviated Journal Title

Keywords

Abstract

Journal Title

Volume

Issue/Number

Publication Date

Document Type

DOI Link

Language

First Page

Last Page

WOS Identifier

ISSN

Recommended Citation

Share

Explore

Connect