Cancer classifcation, gene expression data, RNA-Seq, machine learning, feature selection, dimensionality reduction, network analysis.


This study delves into the classifcation of various cancer types using the RNA-Seq (HiSeq) PANCAN dataset from the UCI Machine Learning Repository, which encompasses a rich collection of gene expression data across multiple tumor samples. To improve cancer diagnosis and treatment, our methodology confronts the challenges inherent in high-dimensional datasets, such as the Hughes Effect and the Curse of Dimensionality, through innovative feature selection methods and machine learning approaches. A key component of our strategy includes the use of tree-based algorithms, particularly Random Forest, to refine the dataset to seventy genes of utmost relevance for tumor classifcation, and the application of PCA and Kernel PCA for dimensional reduction, enabling the visualization of non-linear patterns in gene expression data. The research further investigates the gene interaction network through network analysis, employing modularity metrics to understand signifcant community structures linked to biological processes in cancer. Our model evaluation assesses various machine learning models, highlighting the precision and low-test error rates of SVM, Logistic Regression, and KNN, suggesting their effectiveness in exploiting the dataset’s inherent separability. The study’s comprehensive approach not only provides a systematic framework for analyzing gene expression data but also paves the way for advanced research into the genetic mechanisms of cancer, with implications for personalized medicine and treatment strategies.


Spring 2024

Instructor Name

Xie, Rui

Accessibility Status

PDF accessibility verified using Adobe Acrobat Pro Accessibility Checker

Included in

Data Science Commons