A Scalable and Efficient Outlier Detection Strategy for Categorical Data
Abstract
Outlier detection has received significant attention in many applications, such as credit card fraud detection and network intrusion detection. Most of the existing research efforts focus on numerical datasets and cannot be directly applied to categorical sets where there is little sense in ordering the data and calculating distances among data points. Furthermore, a number of the current outlier detection methods require quadratic time with respect to the dataset size and usually need multiple scans of the data; these features are undesirable when the datasets are large and scattered over multiple geographically distributed sites. In this paper, we focus and evaluate, experimentally, a few representative current outlier detection approaches ( one based on entropy and two based on frequent itemsets) that are geared towards categorical sets. In addition, we introduce a simple, scalable and efficient outlier detection algorithm that has the advantage of discovering outliers in categorical datasets by performing a single scan of the dataset. This newly introduced outlier detection algorithm is compared with the existing, and aforementioned outlier detection strategies. The conclusion from this comparison is that the simple outlier detection algorithm that we introduce is more efficient (faster) than the existing strategies, and as effective (accurate) in discovering outliers.
Notes
This item is only available in print in the UCF Libraries. If this is your thesis or dissertation, you can help us make it available online for use by researchers around the world by STARS for more information.
Thesis Completion
2007
Semester
Spring
Advisor
Georgiopoulos, Michael
Degree
Bachelor of Science (B.S.)
College
College of Engineering and Computer Science
Degree Program
Computer Engineering
Subjects
Dissertations, Academic -- Engineering and Computer Science; Engineering and Computer Science -- Dissertations, Academic
Format
Identifier
DP0022173
Language
English
Access Status
Open Access
Length of Campus-only Access
None
Document Type
Honors in the Major Thesis
Recommended Citation
Ortiz, Enrique, "A Scalable and Efficient Outlier Detection Strategy for Categorical Data" (2007). HIM 1990-2015. 656.
https://stars.library.ucf.edu/honorstheses1990-2015/656