Keywords

Cardiovascular Diseases (CVD), Machine Learning, Random Forest, XGBoost, AUC-ROC, Predictive Modeling, Feature Importance, Linear Classifers, Feature Independence, Precision, Accuracy, Recall, Gradient Boosting, Generalization.

Abstract

of mortality, necessitating advanced predictive models to aid early detection and prevention. This study explores the application of machine learning techniques, including Lo- gistic Regression, K-Nearest Neighbors (KNN), Random Forest, and XGBoost, to predict CVD risk using a dataset of 69,997 observations encompassing demographic, clinical, and lifestyle factors. Data preprocessing involved one-hot encoding of cat- egorical variables and scaling to ensure compatibility with all models. Model performance was evaluated using metrics such as accuracy, precision, recall, F1-score, and AUC-ROC. Among the models, XGBoost demonstrated the highest accuracy at 74%, leveraging its gradient-boosting framework to effectively handle feature interactions and imbalanced data. Random Forest, with an accuracy of 73%, provided insights into feature importance, highlighting systolic blood pressure and age as critical predictors. In contrast, KNN exhibited lower performance at 66%, attributed to its sensitivity to scaling and high-dimensional data. These findings underscore the potential of ensemble methods like XGBoost and Random Forest in clinical decision-making and public health strategies for mitigating CVD risks.

Included in

Data Science Commons

Share

COinS