Title
Co-Located Compute And Binary File Storage In Data-Intensive Computing
Keywords
Dataintensive; Hadoop; HDFS; HPC analytics applications; Mapreduce
Abstract
With the rapid development of computation capability, the massive increase in data volume has outmoded compute-intensive clusters for HPC analysis of large-scale data sets due to a huge amount of data transfer over network. Co-located compute and storage has been introduced in dataintensive clusters to avoid network bottleneck by launching the computation on nodes in which most of the input data reside. Chunk-based storage systems are typical examples, splitting data into blocks and randomly storing them across nodes. Records as the input data for the analysis are read from blocks. This method implicitly assumes that a single record resides on a single node and then data transfer can be avoided. However, this assumption does not always hold because there is a gap between records and blocks. The current solution overlooks the relationship between the computation unit as a record and the storage unit as a block. For situations with records belonging to one block, there would be no data transfer. But in practice, one record could consist of several blocks. This is especially true for binary files, which introduce extra data transfer due to preparing the input data before conducting the analysis. Blocks belonging to a single record are scattered randomly across the data nodes regardless of to the semantics of the records. To address these problems, we develop two solutions in this paper, one is to develop a Record-Based Block Distribution (RBBD) framework and the other is a data-centric scheduling using a Weighted Set Cover Scheduling (WSCS) to schedule the tasks. The Record-Based Block Distribution (RBBD) framework for data-intensive analytics aims to eliminate the gap between records and blocks and accomplishes zero data transfer among nodes. The Weighted Set Cover Scheduling (WSCS) is proposed to further improve the performance by optimizing the combination of nodes. Our experiments show that overlooking the record and block relationship can cause severe performance problems when a record is comprised of several blocks scattered in different nodes. Our proposed novel data storage strategy, Record-Based Block Distribution (RBBD), optimizes the block distribution according to the record and block relationship. By being combined with our novel scheduling Weighted Set Cover Scheduling (WSCS), we efficiently reduces extra data transfers, and eventually improves the performance of the chunk-based storage system. Using our RBBD framework and WSCS in chunk-based storage system, our extensive experiments show that the data transfer decreases by 36.4% (average) and the scheduling algorithm outperforms the random algorithm by 51%-62%; the deviation from the ideal solutions is no more than 6.8%. © 2012 IEEE.
Publication Date
11-5-2012
Publication Title
Proceedings - 2012 IEEE 7th International Conference on Networking, Architecture and Storage, NAS 2012
Number of Pages
199-206
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1109/NAS.2012.29
Copyright Status
Unknown
Socpus ID
84868090112 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/84868090112
STARS Citation
Xiao, Qiangju; Shang, Pengju; and Wang, Jun, "Co-Located Compute And Binary File Storage In Data-Intensive Computing" (2012). Scopus Export 2010-2014. 4753.
https://stars.library.ucf.edu/scopus2010/4753