Optimize Parallel Data Access In Big Data Processing

Abstract

Recent years the Hadoop Distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based Spark frameworks, which can efficiently support iterative and interactive data analysis in memory. The first part of my dissertation mainly focuses on studying parallel data accession distributed file systems, e.g, HDFS. Since the distributed I/O resources and global data distribution are often not taken into consideration, the data requests from parallel processes/executors will unfortunately be served in a remoter imbalanced fashion on the storage servers. In order to address these problems, we develop I/O middleware systems and matching-based algorithms to map parallel data requests to storage servers such that local and balanced data access can be achieved. The last part of my dissertation presents our plans to improve the performance of interactive data access in big data analysis. Specifically, most interactive analysis programs will scan through the entire data set regardless of which data is actually required. We plan to develop a content-aware method to quickly access required data without this laborious scanning process.

Publication Date

7-7-2015

Publication Title

Proceedings - 2015 IEEE/ACM 15th International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015

Number of Pages

721-724

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/CCGrid.2015.168

Socpus ID

84941248060 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/84941248060

This document is currently not available here.

Share

COinS