Keywords

Data Mining, Outlier Detection, Anomaly Detection, Distributed Datasets, Categorical Datasets, Mixed-Type Attribute Datasets

Abstract

An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates.

Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date

2009

Advisor

Georgiopoulos, Michael

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Electrical Engineering and Computer Science

Degree Program

Computer Engineering

Format

application/pdf

Identifier

CFE0002734

URL

http://purl.fcla.edu/fcla/etd/CFE0002734

Language

English

Release Date

September 2009

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

Share

COinS