Harnessing Large Language Models for Automated Essay Scoring in Public Health
Contributor
University of Central Florida. Faculty Center for Teaching and Learning; University of Central Florida. Division of Digital Learning; Teaching and Learning with AI Conference (2025 : Orlando, Fla.)
Location
Space Coast
Start Date
28-5-2025 2:45 PM
End Date
28-5-2025 3:10 PM
Publisher
University of Central Florida Libraries
Keywords:
Automated essay scoring; Large language models; Public health education; Grading accuracy; Prompt engineering
Subjects
Academic writing--Evaluation; English language--Writing--Evaluation; Grading and marking (Students)--Computer-assisted instruction; Academic writing--Computer-assisted instruction; Language and education--Evaluation
Description
Automated Essay Scoring (AES) using Large Language Models (LLMs) has emerged as a promising solution for assessing student writing, offering faster grading, unbiased evaluation, and detailed feedback. This study investigates the performance of two commonly used LLM-based tools - ChatGPT, Copilot, in scoring essays from a Public Health Introduction course. It evaluates the alignment of these tools with human rater judgments and examines the impact of prompt engineering on scoring accuracy. Quadratic Weighted Kappa (QWK) scores were used to measure agreement between the models and the manual grader, while deviations were analyzed to identify discrepancies for each criterion. Results indicate that ChatGPT demonstrates higher alignment with manual grading (QWK = 0.5342) compared to Copilot (QWK = 0.2186), with Copilot exhibiting greater score variability and deviations across criteria. Despite its better performance, ChatGPT underestimates scores in specific areas such as recommendations, highlighting areas for improvement. This study underscores the potential of LLMs for AES while identifying critical areas for optimization, paving the way for their effective integration into educational assessment frameworks.
Language
eng
Type
Presentation
Format
application/vnd.openxmlformats-officedocument.presentationml.presentation
Rights Statement
All Rights Reserved
Audience
Faculty, Students, Instructional designers
Recommended Citation
Mehra, Shabnam, "Harnessing Large Language Models for Automated Essay Scoring in Public Health" (2025). Teaching and Learning with AI Conference Presentations. 38.
https://stars.library.ucf.edu/teachwithai/2025/wednesday/38
Harnessing Large Language Models for Automated Essay Scoring in Public Health
Space Coast
Automated Essay Scoring (AES) using Large Language Models (LLMs) has emerged as a promising solution for assessing student writing, offering faster grading, unbiased evaluation, and detailed feedback. This study investigates the performance of two commonly used LLM-based tools - ChatGPT, Copilot, in scoring essays from a Public Health Introduction course. It evaluates the alignment of these tools with human rater judgments and examines the impact of prompt engineering on scoring accuracy. Quadratic Weighted Kappa (QWK) scores were used to measure agreement between the models and the manual grader, while deviations were analyzed to identify discrepancies for each criterion. Results indicate that ChatGPT demonstrates higher alignment with manual grading (QWK = 0.5342) compared to Copilot (QWK = 0.2186), with Copilot exhibiting greater score variability and deviations across criteria. Despite its better performance, ChatGPT underestimates scores in specific areas such as recommendations, highlighting areas for improvement. This study underscores the potential of LLMs for AES while identifying critical areas for optimization, paving the way for their effective integration into educational assessment frameworks.