Title

Concentric Layout, A New Scientific Data Distribution Scheme In Hadoop File System

Abstract

The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data fast and space efficiently, data pre-process operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a hierarchical data layout which maintains the dimensional property in large data sets. Contrary to the continuous data layout adopted in current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk, and then stores chunks symmetrically in a higher level. This matches well with the matrix like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments show that the concentric data layout improves the overall performance, reduces the execution time by about 38% when reading a 64 GB file. It also mitigates the unused data read overhead and increases the useful data efficiency by 32% on average. © 2010 IEEE.

Publication Date

10-27-2010

Publication Title

Proceedings - 2010 IEEE International Conference on Networking, Architecture and Storage, NAS 2010

Number of Pages

231-239

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/NAS.2010.59

Socpus ID

77958136674 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/77958136674

This document is currently not available here.

Share

COinS