Harnessing Large Language Models for Automated Essay Scoring in Public Health

Contributor

University of Central Florida. Faculty Center for Teaching and Learning; University of Central Florida. Division of Digital Learning; Teaching and Learning with AI Conference (2025 : Orlando, Fla.)

Location

Space Coast

Start Date

28-5-2025 2:45 PM

End Date

28-5-2025 3:10 PM

Publisher

University of Central Florida Libraries

Keywords:

Automated essay scoring; Large language models; Public health education; Grading accuracy; Prompt engineering

Subjects

Academic writing--Evaluation; English language--Writing--Evaluation; Grading and marking (Students)--Computer-assisted instruction; Academic writing--Computer-assisted instruction; Language and education--Evaluation

Description

Automated Essay Scoring (AES) using Large Language Models (LLMs) has emerged as a promising solution for assessing student writing, offering faster grading, unbiased evaluation, and detailed feedback. This study investigates the performance of two commonly used LLM-based tools - ChatGPT, Copilot, in scoring essays from a Public Health Introduction course. It evaluates the alignment of these tools with human rater judgments and examines the impact of prompt engineering on scoring accuracy. Quadratic Weighted Kappa (QWK) scores were used to measure agreement between the models and the manual grader, while deviations were analyzed to identify discrepancies for each criterion. Results indicate that ChatGPT demonstrates higher alignment with manual grading (QWK = 0.5342) compared to Copilot (QWK = 0.2186), with Copilot exhibiting greater score variability and deviations across criteria. Despite its better performance, ChatGPT underestimates scores in specific areas such as recommendations, highlighting areas for improvement. This study underscores the potential of LLMs for AES while identifying critical areas for optimization, paving the way for their effective integration into educational assessment frameworks.

Language

eng

Type

Presentation

Format

application/vnd.openxmlformats-officedocument.presentationml.presentation

Rights Statement

All Rights Reserved

Audience

Faculty, Students, Instructional designers

This document is currently not available here.

Share

COinS
 
May 28th, 2:45 PM May 28th, 3:10 PM

Harnessing Large Language Models for Automated Essay Scoring in Public Health

Space Coast

Automated Essay Scoring (AES) using Large Language Models (LLMs) has emerged as a promising solution for assessing student writing, offering faster grading, unbiased evaluation, and detailed feedback. This study investigates the performance of two commonly used LLM-based tools - ChatGPT, Copilot, in scoring essays from a Public Health Introduction course. It evaluates the alignment of these tools with human rater judgments and examines the impact of prompt engineering on scoring accuracy. Quadratic Weighted Kappa (QWK) scores were used to measure agreement between the models and the manual grader, while deviations were analyzed to identify discrepancies for each criterion. Results indicate that ChatGPT demonstrates higher alignment with manual grading (QWK = 0.5342) compared to Copilot (QWK = 0.2186), with Copilot exhibiting greater score variability and deviations across criteria. Despite its better performance, ChatGPT underestimates scores in specific areas such as recommendations, highlighting areas for improvement. This study underscores the potential of LLMs for AES while identifying critical areas for optimization, paving the way for their effective integration into educational assessment frameworks.