Datanet: A Data Distribution-Aware Method For Sub-Dataset Analysis On Distributed File Systems
Abstract
In this paper, we study the problem of sub-datasetanalysis over distributed file systems, e.g, the Hadoop file system. Our experiments show that the sub-datasets' distributionover HDFS blocks can often cause the corresponding analysisto suffer from a seriously imbalanced parallel execution. Thisis because the locality of individual sub-datasets is hidden bythe Hadoop file system and the content clustering of sub-datasets results in some computational nodes carrying outmuch more workload than others. We conduct a comprehensiveanalysis on how the imbalanced computing patterns occur andtheir sensitivity to the size of a cluster. We then propose anovel method to optimize sub-dataset analysis over distributedstorage systems referred to as DataNet. DataNet aims toachieve distribution-aware and workload-balanced computingand consists of the following three parts. Firstly, we proposean efficient algorithm with linear complexity to obtain themeta-data of sub-dataset distributions. Secondly, we designan elastic storage structure called ElasticMap based on theHashMap and BloomFilter techniques to store the meta-data. Thirdly, we employ a distribution-aware algorithm for sub-dataset applications to achieve a workload-balance in parallelexecution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-nodecluster testbed and the results show the performance benefitsof DataNet.
Publication Date
7-18-2016
Publication Title
Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
Number of Pages
504-513
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1109/IPDPS.2016.33
Copyright Status
Unknown
Socpus ID
84983247120 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/84983247120
STARS Citation
Wang, Jun; Yin, Jiangling; Zhou, Jian; Zhang, Xuhong; and Wang, Ruijun, "Datanet: A Data Distribution-Aware Method For Sub-Dataset Analysis On Distributed File Systems" (2016). Scopus Export 2015-2019. 4473.
https://stars.library.ucf.edu/scopus2015/4473