Keywords
Genetic study, Single nucleotide polymorphism (SNP) markers, Lasso feature selection, RMSE (Root Mean Square Error)
Abstract
One of the most valuable crop species, maize, has been the subject of genetic study and experimentation for more than a century. However, species that share similarities and differences across a wide spectrum have developed astonishing adaptations as a result of small changes throughout time. Because it is usual practice to determine the genotypes of thousands of single nucleotide polymorphism (SNP) markers for thousands of patients, the data set we are dealing with has an issue with small n and large p. The result of this is that there are noticeably more predictor factors than responder variables. The original data set has around 487 missing rows and has n = 4981 and p = 7390. We eliminated these entirely absent rows during the pre-processing phase. Additionally, we eliminated any columns that did not apply to our analysis. We then divided the data set into a train set and a test set, with a ratio of 80:20, respectively. We use regularization approaches for Lasso feature selection to solve the high dimensionality issue. The least-squares loss function of linear regression, an extension of linear regression, is given a regularization component in this approach. To impose the penalty (lower weights) against complexity, this is done. With a penalization λ of 0.2, this procedure was successful in identifying 22 traits as being extremely significant to the study. Finally, a Lasso regression model was constructed, and the test set’s RMSE value was found to be 3.494.
Semester
Summer 2023
Course Name
STA 5703 Data Mining 1
Instructor Name
Xie, Rui
College
College of Sciences
STARS Citation
Agbemade, Emil, "Variable Selection and Regression Analysis" (2023). Data Science and Data Mining. 11.
https://stars.library.ucf.edu/data-science-mining/11