Harnessing Large Language Models for Automated Essay Scoring in Public Health
Location
Space Coast
Start Date
28-5-2025 2:45 PM
End Date
28-5-2025 3:10 PM
Description
Automated Essay Scoring (AES) using Large Language Models (LLMs) has emerged as a promising solution for assessing student writing, offering faster grading, unbiased evaluation, and detailed feedback. This study investigates the performance of two commonly used LLM-based tools - ChatGPT, Copilot, in scoring essays from a Public Health Introduction course. It evaluates the alignment of these tools with human rater judgments and examines the impact of prompt engineering on scoring accuracy. Quadratic Weighted Kappa (QWK) scores were used to measure agreement between the models and the manual grader, while deviations were analyzed to identify discrepancies for each criterion. Results indicate that ChatGPT demonstrates higher alignment with manual grading (QWK = 0.5342) compared to Copilot (QWK = 0.2186), with Copilot exhibiting greater score variability and deviations across criteria. Despite its better performance, ChatGPT underestimates scores in specific areas such as recommendations, highlighting areas for improvement. This study underscores the potential of LLMs for AES while identifying critical areas for optimization, paving the way for their effective integration into educational assessment frameworks.
Recommended Citation
Mehra, Shabnam, "Harnessing Large Language Models for Automated Essay Scoring in Public Health" (2025). Teaching and Learning with AI Conference Presentations. 38.
https://stars.library.ucf.edu/teachwithai/2025/wednesday/38
Harnessing Large Language Models for Automated Essay Scoring in Public Health
Space Coast
Automated Essay Scoring (AES) using Large Language Models (LLMs) has emerged as a promising solution for assessing student writing, offering faster grading, unbiased evaluation, and detailed feedback. This study investigates the performance of two commonly used LLM-based tools - ChatGPT, Copilot, in scoring essays from a Public Health Introduction course. It evaluates the alignment of these tools with human rater judgments and examines the impact of prompt engineering on scoring accuracy. Quadratic Weighted Kappa (QWK) scores were used to measure agreement between the models and the manual grader, while deviations were analyzed to identify discrepancies for each criterion. Results indicate that ChatGPT demonstrates higher alignment with manual grading (QWK = 0.5342) compared to Copilot (QWK = 0.2186), with Copilot exhibiting greater score variability and deviations across criteria. Despite its better performance, ChatGPT underestimates scores in specific areas such as recommendations, highlighting areas for improvement. This study underscores the potential of LLMs for AES while identifying critical areas for optimization, paving the way for their effective integration into educational assessment frameworks.