Title

Co-Located Compute And Binary File Storage In Data-Intensive Computing

Keywords

Dataintensive; Hadoop; HDFS; HPC analytics applications; Mapreduce

Abstract

With the rapid development of computation capability, the massive increase in data volume has outmoded compute-intensive clusters for HPC analysis of large-scale data sets due to a huge amount of data transfer over network. Co-located compute and storage has been introduced in dataintensive clusters to avoid network bottleneck by launching the computation on nodes in which most of the input data reside. Chunk-based storage systems are typical examples, splitting data into blocks and randomly storing them across nodes. Records as the input data for the analysis are read from blocks. This method implicitly assumes that a single record resides on a single node and then data transfer can be avoided. However, this assumption does not always hold because there is a gap between records and blocks. The current solution overlooks the relationship between the computation unit as a record and the storage unit as a block. For situations with records belonging to one block, there would be no data transfer. But in practice, one record could consist of several blocks. This is especially true for binary files, which introduce extra data transfer due to preparing the input data before conducting the analysis. Blocks belonging to a single record are scattered randomly across the data nodes regardless of to the semantics of the records. To address these problems, we develop two solutions in this paper, one is to develop a Record-Based Block Distribution (RBBD) framework and the other is a data-centric scheduling using a Weighted Set Cover Scheduling (WSCS) to schedule the tasks. The Record-Based Block Distribution (RBBD) framework for data-intensive analytics aims to eliminate the gap between records and blocks and accomplishes zero data transfer among nodes. The Weighted Set Cover Scheduling (WSCS) is proposed to further improve the performance by optimizing the combination of nodes. Our experiments show that overlooking the record and block relationship can cause severe performance problems when a record is comprised of several blocks scattered in different nodes. Our proposed novel data storage strategy, Record-Based Block Distribution (RBBD), optimizes the block distribution according to the record and block relationship. By being combined with our novel scheduling Weighted Set Cover Scheduling (WSCS), we efficiently reduces extra data transfers, and eventually improves the performance of the chunk-based storage system. Using our RBBD framework and WSCS in chunk-based storage system, our extensive experiments show that the data transfer decreases by 36.4% (average) and the scheduling algorithm outperforms the random algorithm by 51%-62%; the deviation from the ideal solutions is no more than 6.8%. © 2012 IEEE.

Publication Date

11-5-2012

Publication Title

Proceedings - 2012 IEEE 7th International Conference on Networking, Architecture and Storage, NAS 2012

Number of Pages

199-206

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/NAS.2012.29

Socpus ID

84868090112 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/84868090112

This document is currently not available here.

Share

COinS