Abstract
This paper investigates the impact the LASSO, mRMR, SHAP, and Reinforcement Feature Selection techniques on random forest models for the breast cancer subtypes markers ER, HER2, PR, and TN as well as identifying a small subset of biomarkers that could potentially cause the disease and explain them using explainable AI techniques. This is important because in areas such as healthcare understanding why the model makes a specific decision is important it is a diagnostic of an individual which requires reliable AI. Another contribution is using feature selection methods to identify a small subset of biomarkers capable of predicting if a specific RNA sequence will have one of the cancer labels positive. The study begins by obtaining baseline accuracy metric using a random forest model on The Cancer Genome Atlas's breast cancer database to then explore the effects of feature selection, selecting different numbers of features, significantly influencing model accuracy, and selecting a small number of potential biomarkers that may produce a specific type of breast cancer. Once the biomarkers were selected, the explainable AI techniques SHAP and LIME were applied to the models and provided insight into influential biomarkers and their impact on predictions. The main results are that there are some shared biomarkers between some of the subsets that had high influence over the model prediction, LASSO and Reinforcement Feature selection sets scoring the highest accuracy of all sets and obtaining some insight into how the models used the features by using existing explainable AI methods SHAP and LIME to understand how these selected features are affecting the model's prediction.
Thesis Completion
2023
Semester
Fall
Thesis Chair/Advisor
Wang, Liqiang
Degree
Bachelor of Science (B.S.)
College
College of Engineering and Computer Science
Department
Computer Science
Degree Program
Computer Science
Language
English
Access Status
Open Access
Release Date
12-15-2023
Recommended Citation
La Rosa Giraud, David E., "Biomarker Identification for Breast Cancer Types Using Feature Selection and Explainable AI Methods" (2023). Honors Undergraduate Theses. 1524.
https://stars.library.ucf.edu/honorstheses/1524