Keywords
Data Mining, Outlier Detection, Anomaly Detection, Distributed Datasets, Categorical Datasets, Mixed-Type Attribute Datasets
Abstract
An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates.
Notes
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu
Graduation Date
2009
Advisor
Georgiopoulos, Michael
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Electrical Engineering and Computer Science
Degree Program
Computer Engineering
Format
application/pdf
Identifier
CFE0002734
URL
http://purl.fcla.edu/fcla/etd/CFE0002734
Language
English
Release Date
September 2009
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
STARS Citation
Koufakou, Anna, "Scalable And Efficient Outlier Detection In Large Distributed Data Sets With Mixed-type Attributes" (2009). Electronic Theses and Dissertations. 3986.
https://stars.library.ucf.edu/etd/3986