Keywords
Data science salary prediction, linear regression, model diagnostics, Box-Cox transformation, multicollinearity.
Description
This paper presents a statistical analysis of entry-level data science salaries in the United States using multiple linear regression on a dataset from 2020 to 2024. The study identifies key factors—such as job role, experience level, employment type, work arrangement, residency status, and company size—that influence salary outcomes. After addressing issues like non-normality, heteroscedasticity, and multicollinearity through data transformation and variable selection, the final model offers improved interpretability and modest predictive power. The findings provide insights for job seekers, educators, and employers seeking to understand and benchmark data science compensation.
Abstract
In the evolving landscape of data science, accurate salary prediction plays a crucial role in shaping career expectations, informing educational strategies, and guiding organizational hiring decisions. This study investigates the key factors influencing entry-level data science salaries in the United States by applying a multiple linear regression model to a recent dataset spanning from 2020 to 2024. Through data preprocessing, transformation, and diagnostic evaluation, we identify how job roles, experience levels, employment types, work arrangements, residency status, and company size impact compensation. Despite challenges such as outliers, heteroscedasticity, and non-normal residuals, model refinements like the Box-Cox transformation and variable selection enhance predictive performance. The final model, while modest in explanatory power, offers actionable insights into salary determinants and lays the groundwork for future predictive modeling improvements in the domain.
Instructor Name
Dr. Jongik Chung
Rights
This work is licensed under a Creative Commons Attribution 4.0 International License.
College
College of Sciences
STARS Citation
Deb, Dipok, "Data Science Job Salary Prediction Using Linear Regression" (2025). Data Science and Data Mining. 41.
https://stars.library.ucf.edu/data-science-mining/41