Genetic study, Single nucleotide polymorphism (SNP) markers, Lasso feature selection, RMSE (Root Mean Square Error)


One of the most valuable crop species, maize, has been the subject of genetic study and experimentation for more than a century. However, species that share similarities and differences across a wide spectrum have developed astonishing adaptations as a result of small changes throughout time. Because it is usual practice to determine the genotypes of thousands of single nucleotide polymorphism (SNP) markers for thousands of patients, the data set we are dealing with has an issue with small n and large p. The result of this is that there are noticeably more predictor factors than responder variables. The original data set has around 487 missing rows and has n = 4981 and p = 7390. We eliminated these entirely absent rows during the pre-processing phase. Additionally, we eliminated any columns that did not apply to our analysis. We then divided the data set into a train set and a test set, with a ratio of 80:20, respectively. We use regularization approaches for Lasso feature selection to solve the high dimensionality issue. The least-squares loss function of linear regression, an extension of linear regression, is given a regularization component in this approach. To impose the penalty (lower weights) against complexity, this is done. With a penalization λ of 0.2, this procedure was successful in identifying 22 traits as being extremely significant to the study. Finally, a Lasso regression model was constructed, and the test set’s RMSE value was found to be 3.494.


Summer 2023

Course Name

STA 5703 Data Mining 1

Instructor Name

Xie, Rui


College of Sciences

Included in

Data Science Commons