Keywords
LASSO, SNP markers, Regularized parameter
Abstract
This study employs Lasso regression to analyze highdimensional genetic data for predicting flowering time in maize, specifically Days to Anthesis (DtoA). Lasso, or Least Absolute Shrinkage and Selection Operator, is a form of linear regression that introduces an L1 penalty to the model, encouraging sparsity by shrinking some coefficients to zero. This attribute makes Lasso ideal for feature selection in large datasets, as it highlights the most influential predictors while discarding irrelevant variables. Unlike Ridge regression, which applies an L2 penalty to minimize the squared magnitude of coefficients, Lasso’s L1 penalty induces sparsity, providing a clearer interpretation of the selected variables. This feature selection capability is crucial in genetic studies where the number of predictors far exceeds the number of observations. The study systematically compares different values of the Lasso penalty parameter (λ) to observe how model performance and sparsity are balanced. By adjusting λ, we observe that smaller values allow more variables into the model, increasing complexity but potentially overfitting, while higher values promote sparsity, which can reduce accuracy if too many variables are removed. Ridge regression, while useful in regularizing the model and reducing overfitting, does not lead to the same degree of variable selection due to its tendency to shrink coefficients toward zero without fully eliminating them. By focusing on the optimal λ for Lasso, we achieve a model that is both interpretable and effective for identifying genetic markers. This approach provides a robust framework for feature selection in genetic studies and highlights Lasso’s utility over Ridge in contexts where both accuracy and variable interpretability are essential.
STARS Citation
Kesse, Godfred Ahenkroa, "Variable Selection using Lasso Regression" (2025). Data Science and Data Mining. 28.
https://stars.library.ucf.edu/data-science-mining/28