Abstract

Building accurate classifiers for predicting group membership is made difficult when data is skewed or imbalanced which is typical of real world data sets. The classifier has the tendency to be biased towards the over represented group as a result. This imbalance is considered a class imbalance problem which will induce bias into the classifier particularly when the imbalance is high. Class imbalance data usually suffers from data intrinsic properties beyond that of imbalance alone. The problem is intensified with larger levels of imbalance most commonly found in observational studies. Extreme cases of class imbalance are commonly found in many domains including fraud detection, mammography of cancer and post term births. These rare events are usually the most costly or have the highest level of risk associated with them and are therefore of most interest. To combat class imbalance the machine learning community has relied upon embedded, data preprocessing and ensemble learning approaches. Exploratory research has linked several factors that perpetuate the issue of misclassification in class imbalanced data. However, there remains a lack of understanding between the relationship of the learner and imbalanced data among the competing approaches. The current landscape of data preprocessing approaches have appeal due to the ability to divide the problem space in two which allows for simpler models. However, most of these approaches have little theoretical bases although in some cases there is empirical evidence supporting the improvement. The main goals of this research is to introduce newly proposed a priori based re-sampling methods that improve concept learning within class imbalanced data. The results in this work highlight the robustness of these techniques performance within publicly available data sets from different domains containing various levels of imbalance. In this research the theoretical and empirical reasons are explored and discussed.

Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date

2016

Semester

Spring

Advisor

Xanthopoulos, Petros

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Degree Program

Modeling and Simulation; Engineering

Format

application/pdf

Identifier

CFE0006169

URL

http://purl.fcla.edu/fcla/etd/CFE0006169

Language

English

Release Date

May 2016

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

STARS Citation

Rivera, William, "a priori synthetic sampling for increasing classification sensitivity in imbalanced data sets" (2016). Electronic Theses and Dissertations. 4895.
https://stars.library.ucf.edu/etd/4895

Download

Included in

Engineering Commons

COinS

Electronic Theses and Dissertations

a priori synthetic sampling for increasing classification sensitivity in imbalanced data sets

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Browse Advisors

Explore

Connect

Electronic Theses and Dissertations

a priori synthetic sampling for increasing classification sensitivity in imbalanced data sets

Author

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Share

Browse Advisors

Explore

Connect