Data Science and Data Mining

Advancing Cancer Classifcation through Machine Learning Analysis of RNA-Seq Gene Expression Data

Emil Agbemade, University of Central FloridaFollow
Amina Issoufou Anaroua, University of Central FloridaFollow
Dimitri Bamba, University of Central FloridaFollow

Keywords

Cancer classifcation, gene expression data, RNA-Seq, machine learning, feature selection, dimensionality reduction, network analysis.

Abstract

This study delves into the classifcation of various cancer types using the RNA-Seq (HiSeq) PANCAN dataset from the UCI Machine Learning Repository, which encompasses a rich collection of gene expression data across multiple tumor samples. To improve cancer diagnosis and treatment, our methodology confronts the challenges inherent in high-dimensional datasets, such as the Hughes Effect and the Curse of Dimensionality, through innovative feature selection methods and machine learning approaches. A key component of our strategy includes the use of tree-based algorithms, particularly Random Forest, to refine the dataset to seventy genes of utmost relevance for tumor classifcation, and the application of PCA and Kernel PCA for dimensional reduction, enabling the visualization of non-linear patterns in gene expression data. The research further investigates the gene interaction network through network analysis, employing modularity metrics to understand signifcant community structures linked to biological processes in cancer. Our model evaluation assesses various machine learning models, highlighting the precision and low-test error rates of SVM, Logistic Regression, and KNN, suggesting their effectiveness in exploiting the dataset’s inherent separability. The study’s comprehensive approach not only provides a systematic framework for analyzing gene expression data but also paves the way for advanced research into the genetic mechanisms of cancer, with implications for personalized medicine and treatment strategies.

Semester

Spring 2024

Instructor Name

Xie, Rui

STARS Citation

Agbemade, Emil; Anaroua, Amina Issoufou; and Bamba, Dimitri, "Advancing Cancer Classifcation through Machine Learning Analysis of RNA-Seq Gene Expression Data" (2024). Data Science and Data Mining. 16.
https://stars.library.ucf.edu/data-science-mining/16

Accessibility Status

PDF accessibility verified using Adobe Acrobat Pro Accessibility Checker

Download

Included in

Data Science Commons

COinS

Data Science and Data Mining

Advancing Cancer Classifcation through Machine Learning Analysis of RNA-Seq Gene Expression Data

Keywords

Abstract

Semester

Instructor Name

STARS Citation

Accessibility Status

Included in

Explore

Connect

Data Science and Data Mining

Advancing Cancer Classifcation through Machine Learning Analysis of RNA-Seq Gene Expression Data

Author(s)

Keywords

Abstract

Semester

Instructor Name

STARS Citation

Accessibility Status

Included in

Share

Explore

Connect