Automated Machine Learning: Intellient Binning Data Preparation and Regularized Regression Classfier
Abstract
Automated machine learning (AutoML) has become a new trend which is the process of automating the complete pipeline from the raw dataset to the development of machine learning model. It not only can relief data scientists' works but also allows non-experts to finish the jobs without solid knowledge and understanding of statistical inference and machine learning. One limitation of AutoML framework is the data quality differs significantly batch by batch. Consequently, fitted model quality for some batches of data can be very poor due to distribution shift for some numerical predictors. In this dissertation, we develop an intelligent binning to resolve this problem. In addition, various regularized regression classifiers (RRCs) including Ridge, Lasso and Elastic Net regression have been tested to enhance model performance further after binning. We focus on the binary classification problem and have developed an AutoML framework using Python to handle the entire data preparation process including data partition and intelligent binning. This system has been tested extensively by simulations and real datasets analyses and the results have shown that (1) All the models perform better with intelligent binding for both balanced and imbalance binary classification problem. (2) Regression-based methods are more sensitive than tree-based methods using intelligent binning. RRCs can work better than other tree methods by using intelligent binning technique. (3) Weighted RRC can obtain the best results compared to other methods. (4) Our framework is an effective and reliable tool to conduct AutoML.
Notes
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu
Graduation Date
2023
Semester
Spring
Advisor
Wang, Chung-Ching
Degree
Doctor of Philosophy (Ph.D.)
College
College of Sciences
Department
Statistics and Data Science
Degree Program
Big Data Analytics
Identifier
CFE0009637; DP0027673
URL
https://purls.library.ucf.edu/go/DP0027673
Language
English
Release Date
May 2023
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
STARS Citation
Zhu, Jianbin, "Automated Machine Learning: Intellient Binning Data Preparation and Regularized Regression Classfier" (2023). Electronic Theses and Dissertations, 2020-2023. 1706.
https://stars.library.ucf.edu/etd2020/1706