A Scalable and Efficient Outlier Detection Strategy for Categorical Data
Outlier detection has received significant attention in many applications, such as credit card fraud detection and network intrusion detection. Most of the existing research efforts focus on numerical datasets and cannot be directly applied to categorical sets where there is little sense in ordering the data and calculating distances among data points. Furthermore, a number of the current outlier detection methods require quadratic time with respect to the dataset size and usually need multiple scans of the data; these features are undesirable when the datasets are large and scattered over multiple geographically distributed sites. In this paper, we focus and evaluate, experimentally, a few representative current outlier detection approaches ( one based on entropy and two based on frequent itemsets) that are geared towards categorical sets. In addition, we introduce a simple, scalable and efficient outlier detection algorithm that has the advantage of discovering outliers in categorical datasets by performing a single scan of the dataset. This newly introduced outlier detection algorithm is compared with the existing, and aforementioned outlier detection strategies. The conclusion from this comparison is that the simple outlier detection algorithm that we introduce is more efficient (faster) than the existing strategies, and as effective (accurate) in discovering outliers.
This item is only available in print in the UCF Libraries. If this is your thesis or dissertation, you can help us make it available online for use by researchers around the world by downloading and filling out the Internet Distribution Consent Agreement. You may also contact the project coordinator Kerri Bottorff for more information.
Bachelor of Science (B.S.)
College of Engineering and Computer Science
Dissertations, Academic -- Engineering and Computer Science; Engineering and Computer Science -- Dissertations, Academic
Length of Campus-only Access
Honors in the Major Thesis
Ortiz, Enrique, "A Scalable and Efficient Outlier Detection Strategy for Categorical Data" (2007). HIM 1990-2015. 656.