Hierarchical Spark: A Multi-Cluster Big Data Computing Framework

Keywords

Big data; Hierarchical; Hybrid Cloud; Multi-cluster; Spark

Abstract

Nowadays, with the increasing burst of newly generated data everyday, as well as the vast expanding needs for corresponding data analyses, grand challenges have been brought to big data computing platforms. Computing resources in a single cluster are often not able to fulfill the computing capability needs. The requests of distributed computing resources are dramatically arising. In addition, with increasing popularity of cloud computing platforms, many organizations with data security concerns are more favor to hybrid cloud, a multi-cluster environment composed by both public cloud and private cloud in purpose of keeping sensitive data local. All these scenarios show great necessity of migrating big data computing to multi-cluster environment. In this paper, we present a hierarchical multi-cluster big data computing framework built upon Apache Spark. Our framework supports combination of heterogeneous Spark computing clusters. With an integrated controller within the framework, it also facilitates ability for submitting, monitoring, executing of Spark workflow. Our experimental results show that the proposed framework not only enables possibility of distributing Spark workflow throughout multiple clusters, but also provides significant performance improvement compared to single cluster environment by optimizing utilization of multi-cluster computing resources.

Publication Date

9-8-2017

Publication Title

IEEE International Conference on Cloud Computing, CLOUD

Volume

2017-June

Number of Pages

90-97

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/CLOUD.2017.20

Socpus ID

85032230273 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85032230273

This document is currently not available here.

Share

COinS