Achieving Load Balance For Parallel Data Access On Distributed File Systems
Keywords
bipartite matching; distributed file systems; HDFS; Parallel data access
Abstract
The distributed file system, HDFS, is widely deployed as the bedrock for many parallel big data analysis. However, when running multiple parallel applications over the shared file system, the data requests from different processes/executors will unfortunately be served in a surprisingly imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file system such as HDFS store each data unit, referred to as chunk file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher probability the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth, resulting in a degraded I/O performance. In this paper, we first conduct a complete analysis on how remote and imbalanced read/write patterns occur and how they are affected by the size of the cluster. We then propose novel methods, referred to as Opass, to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems. Our proposed methods can benefit parallel data-intensive analysis with various parallel data access strategies. Opass adopts new matching-based algorithms to match processes to data so as to compute the maximum degree of data locality and balanced data access. Furthermore, to reduce the imbalance of parallel writes, Opass employs a heatmap for monitoring the I/O statuses of storage nodes and performs HM-LRU policy to select a local optimal storage node for serving write requests. Experiments are conducted on PRObE's Marmot 128-node cluster testbed and the results from both benchmark and well-known parallel applications show the performance benefits and scalability of Opass.
Publication Date
3-1-2018
Publication Title
IEEE Transactions on Computers
Volume
67
Issue
3
Number of Pages
388-402
Document Type
Article
Personal Identifier
scopus
DOI Link
https://doi.org/10.1109/TC.2017.2749229
Copyright Status
Unknown
Socpus ID
85029168086 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/85029168086
STARS Citation
Huang, Dan; Han, Dezhi; Wang, Jun; Yin, Jiangling; and Chen, Xunchao, "Achieving Load Balance For Parallel Data Access On Distributed File Systems" (2018). Scopus Export 2015-2019. 9696.
https://stars.library.ucf.edu/scopus2015/9696