Harnessing Large Language Models for Automated Essay Scoring in Public Health

Location

Space Coast

Start Date

28-5-2025 2:45 PM

End Date

28-5-2025 3:10 PM

Description

Automated Essay Scoring (AES) using Large Language Models (LLMs) has emerged as a promising solution for assessing student writing, offering faster grading, unbiased evaluation, and detailed feedback. This study investigates the performance of two commonly used LLM-based tools - ChatGPT, Copilot, in scoring essays from a Public Health Introduction course. It evaluates the alignment of these tools with human rater judgments and examines the impact of prompt engineering on scoring accuracy. Quadratic Weighted Kappa (QWK) scores were used to measure agreement between the models and the manual grader, while deviations were analyzed to identify discrepancies for each criterion. Results indicate that ChatGPT demonstrates higher alignment with manual grading (QWK = 0.5342) compared to Copilot (QWK = 0.2186), with Copilot exhibiting greater score variability and deviations across criteria. Despite its better performance, ChatGPT underestimates scores in specific areas such as recommendations, highlighting areas for improvement. This study underscores the potential of LLMs for AES while identifying critical areas for optimization, paving the way for their effective integration into educational assessment frameworks.

This document is currently not available here.

Share

COinS
 
May 28th, 2:45 PM May 28th, 3:10 PM

Harnessing Large Language Models for Automated Essay Scoring in Public Health

Space Coast

Automated Essay Scoring (AES) using Large Language Models (LLMs) has emerged as a promising solution for assessing student writing, offering faster grading, unbiased evaluation, and detailed feedback. This study investigates the performance of two commonly used LLM-based tools - ChatGPT, Copilot, in scoring essays from a Public Health Introduction course. It evaluates the alignment of these tools with human rater judgments and examines the impact of prompt engineering on scoring accuracy. Quadratic Weighted Kappa (QWK) scores were used to measure agreement between the models and the manual grader, while deviations were analyzed to identify discrepancies for each criterion. Results indicate that ChatGPT demonstrates higher alignment with manual grading (QWK = 0.5342) compared to Copilot (QWK = 0.2186), with Copilot exhibiting greater score variability and deviations across criteria. Despite its better performance, ChatGPT underestimates scores in specific areas such as recommendations, highlighting areas for improvement. This study underscores the potential of LLMs for AES while identifying critical areas for optimization, paving the way for their effective integration into educational assessment frameworks.