Keywords

concurrent data structures, high performance computing, transactions, persistence, backfill scheduling, run time prediction

Abstract

Modern High Performance Computing (HPC) systems are made up of thousands of server-grade compute nodes linked through a high-speed network interconnect. Each node has tens or even hundreds of CPU cores each, with counts continuing to grow on newer HPC clusters. This results in a need to make use of millions of cores per cluster. Fully leveraging these resources is difficult. There is an active need to design software that scales and fully utilizes the hardware. In this dissertation, we address this gap with a dual approach, considering both intra-node (single node) and inter-node (across node) concerns. To aid in intra-node performance, we propose two novel concurrent data structures: a transactional vector and a persistent hash map. These designs have broad applicability in any multi-core environment but are particularly useful in HPC, which commonly features many cores per node. For inter-node performance, we propose a metrics-driven approach to improve scheduling quality, using predicted run times to backfill jobs more accurately and aggressively. This is augmented using application input parameters to further improve these run time predictions. Improved scheduling reduces the number of idle nodes in an HPC cluster, maximizing job throughput. We find that our data structures outperform the prior state-of-the-art while offering additional features. Our backfill technique likewise outperforms previous approaches in simulations, and our run time predictions were significantly more accurate than conventional approaches. Code for these works is freely available, and we have plans to deploy these techniques more broadly on real HPC systems in the future.

Completion Date

2024

Semester

Summer

Committee Chair

Dechev, Damian

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Department of Computer Science

Degree Program

Computer Science

Format

application/pdf

Identifier

DP0028478

URL

https://purls.library.ucf.edu/go/DP0028478

Language

English

Release Date

8-15-2024

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

Campus Location

Orlando (Main) Campus

Accessibility Status

Meets minimum standards for ETDs/HUTs

Share

COinS