Keywords
concurrent data structures, high performance computing, transactions, persistence, backfill scheduling, run time prediction
Abstract
Modern High Performance Computing (HPC) systems are made up of thousands of server-grade compute nodes linked through a high-speed network interconnect. Each node has tens or even hundreds of CPU cores each, with counts continuing to grow on newer HPC clusters. This results in a need to make use of millions of cores per cluster. Fully leveraging these resources is difficult. There is an active need to design software that scales and fully utilizes the hardware. In this dissertation, we address this gap with a dual approach, considering both intra-node (single node) and inter-node (across node) concerns. To aid in intra-node performance, we propose two novel concurrent data structures: a transactional vector and a persistent hash map. These designs have broad applicability in any multi-core environment but are particularly useful in HPC, which commonly features many cores per node. For inter-node performance, we propose a metrics-driven approach to improve scheduling quality, using predicted run times to backfill jobs more accurately and aggressively. This is augmented using application input parameters to further improve these run time predictions. Improved scheduling reduces the number of idle nodes in an HPC cluster, maximizing job throughput. We find that our data structures outperform the prior state-of-the-art while offering additional features. Our backfill technique likewise outperforms previous approaches in simulations, and our run time predictions were significantly more accurate than conventional approaches. Code for these works is freely available, and we have plans to deploy these techniques more broadly on real HPC systems in the future.
Completion Date
2024
Semester
Summer
Committee Chair
Dechev, Damian
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Department of Computer Science
Degree Program
Computer Science
Format
application/pdf
Identifier
DP0028478
URL
https://purls.library.ucf.edu/go/DP0028478
Language
English
Release Date
8-15-2024
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
Campus Location
Orlando (Main) Campus
STARS Citation
Lamar, Kenneth M., "Advances in High Performance Computing Through Concurrent Data Structures and Predictive Scheduling" (2024). Graduate Thesis and Dissertation 2023-2024. 273.
https://stars.library.ucf.edu/etd2023/273
Accessibility Status
Meets minimum standards for ETDs/HUTs