A Scalable and Efficient Outlier Detection Strategy for Categorical Data

Abstract

Outlier detection has received significant attention in many applications, such as credit card fraud detection and network intrusion detection. Most of the existing research efforts focus on numerical datasets and cannot be directly applied to categorical sets where there is little sense in ordering the data and calculating distances among data points. Furthermore, a number of the current outlier detection methods require quadratic time with respect to the dataset size and usually need multiple scans of the data; these features are undesirable when the datasets are large and scattered over multiple geographically distributed sites. In this paper, we focus and evaluate, experimentally, a few representative current outlier detection approaches ( one based on entropy and two based on frequent itemsets) that are geared towards categorical sets. In addition, we introduce a simple, scalable and efficient outlier detection algorithm that has the advantage of discovering outliers in categorical datasets by performing a single scan of the dataset. This newly introduced outlier detection algorithm is compared with the existing, and aforementioned outlier detection strategies. The conclusion from this comparison is that the simple outlier detection algorithm that we introduce is more efficient (faster) than the existing strategies, and as effective (accurate) in discovering outliers.

Notes

This item is only available in print in the UCF Libraries. If this is your thesis or dissertation, you can help us make it available online for use by researchers around the world by STARS for more information.

Thesis Completion

2007

Semester

Spring

Advisor

Georgiopoulos, Michael

Degree

Bachelor of Science (B.S.)

College

College of Engineering and Computer Science

Degree Program

Computer Engineering

Subjects

Dissertations, Academic -- Engineering and Computer Science; Engineering and Computer Science -- Dissertations, Academic

Format

Print

Identifier

DP0022173

Language

English

Access Status

Open Access

Length of Campus-only Access

None

Document Type

Honors in the Major Thesis

This document is currently not available here.

Share

COinS