Keywords
high-performance computing, parallel job scheduling, schedule quality, constraint programming, I/O-aware scheduling, Slurm
Abstract
As High-Performance Computing (HPC) becomes increasingly prevalent and resource-intensive, there is a growing need for the development of more efficient job schedulers, which play a crucial role in the performance of HPC clusters. This dissertation manifests a comprehensive approach to this complex issue, contributing to three major components of the problem: (1) metrics of job packing efficiency and fairness, (2) advanced scheduling algorithms, and (3) job resource utilization prediction techniques.
To ensure high relevance of the results, this study emphasizes scheduling objectives. Therefore, scheduling quality metrics are investigated first, yielding a set of metrics that allow comparing alternative schedules and evaluating scheduling goals trade-offs. The set of metrics enables the first comprehensive analysis of effects of different scheduling improvement approaches on several aspects of scheduling quality, covering a variety of list scheduling algorithms as well as constraint programming optimization schedulers. The contribution to the third research area covers techniques to measure and estimate resource usage data. It reports a first-of-a-kind evaluation of various job runtime prediction techniques in improving scheduling quality, demonstrates an approach capable of estimating job parameters beyond the runtime, and explores measuring resources consumed by a job in an HPC cluster.
The dissertation concludes with a practical demonstration of these concepts through an I/O-aware scheduling prototype that measures real-time resource utilization, autonomously determines job resource requirements the scheduler needs, and implements full-featured multi-resource backfill scheduling that accounts for the specific properties of the parallel file system bandwidth resource. The study exhibits the advantages of further reducing I/O congestion—beyond the capability of generic I/O-aware scheduling—and presents the Workload-adaptive scheduling strategy that attains such improvement. This approach features a “two-group” approximation technique to maintain efficient performance regardless of zero-throughput job availability. An evaluation conducted on a real HPC cluster demonstrates the effectiveness of the novel strategy.
Completion Date
2024
Semester
Summer
Committee Chair
Damian Dechev
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Department of Computer Science
Degree Program
Computer Science
Format
application/pdf
Release Date
8-15-2024
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
Campus Location
Orlando (Main) Campus
STARS Citation
Goponenko, Alexander V., "Objective-Driven Strategies for HPC Job Scheduling" (2024). Graduate Thesis and Dissertation 2023-2024. 344.
https://stars.library.ucf.edu/etd2023/344
Accessibility Status
Meets minimum standards for ETDs/HUTs