Keywords

Machine Learning, GWAS, Elastic Net, Principal Component Regression (PCR), Partial Least Squares (PLS), Variable Selection, and Predictive Modeling in Genomics.

Description

This study examines the prediction of male flowering time in maize using high-dimensional genomic data within a genome-wide association study framework. Penalized regression and latent-variable dimension reduction methods are compared to address challenges related to multicollinearity, dimensionality, and variable selection in genomic prediction. A standardized preprocessing and cross-validation strategy is applied to ensure robust model evaluation. The findings illustrate the complementary roles of regularization and dimension reduction techniques for modeling complex polygenic traits in plant genomics.

Abstract

This study compares three machine learning approaches—Elastic Net, Principal Component Regression (PCR), and Partial Least Squares (PLS)—for variable selection and prediction within a high-dimensional Maize-GWAS framework. The goal was to accurately predict the complex polygenic trait of time to male flowering while managing the challenges of numerous, highly correlated genetic markers. The ENET model, which combines l1 and l2 penalties, delivered the highest predictive accuracy and successfully identified a select subset of the most influential genetic variants. In contrast, PCR and PLS, both utilizing dimension reduction, offered a significant advantage in computational speed and model stability. The findings confirm that while ENET provides the most precise genomic prediction, the latent variable methods offer a highly efficient and competitive alternative for analyzing complex traits.

Course Name

STA 5703 Data Mining 1

Instructor Name

Dr. Emil Agbemade

Rights

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

College

College of Sciences

Share

COinS