Keywords

Hadoop systems, parallel data analysis, high performance computing

Abstract

To facilitate big data processing, many dedicated data-intensive storage systems such as Google File System(GFS), Hadoop Distributed File System(HDFS) and Quantcast File System(QFS) have been developed. Currently, the Hadoop Distributed File System(HDFS) [20] is the state-of-art and most popular open-source distributed file system for big data processing. It is widely deployed as the bedrock for many big data processing systems/frameworks, such as the script-based pig system, MPI-based parallel programs, graph processing systems and scala/java-based Spark frameworks. These systems/applications employ parallel processes/executors to speed up data processing within scale-out clusters. Job or task schedulers in parallel big data applications such as mpiBLAST and ParaView can maximize the usage of computing resources such as memory and CPU by tracking resource consumption/availability for task assignment. However, since these schedulers do not take the distributed I/O resources and global data distribution into consideration, the data requests from parallel processes/executors in big data processing will unfortunately be served in an imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file systems such as HDFS store each data unit, referred to as chunk or block file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher the probability that the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth. Because of this, the makespan of the entire program could be significantly prolonged and the overall I/O performance will degrade. The first part of my dissertation seeks to address aspects of these problems by creating an I/O middleware system and designing matching-based algorithms to optimize data access in parallel big data processing. To address the problem of remote data movement, we develop an I/O middleware system, called SLAM, which allows MPI-based analysis and visualization programs to benefit from locality read, i.e, each MPI process can access its required data from a local or nearby storage node. This can greatly improve the execution performance by reducing the amount of data movement over network. Furthermore, to address the problem of imbalanced data access, we propose a method called Opass, which models the data read requests that are issued by parallel applications to cluster nodes as a graph data structure where edges weights encode the demands of load capacity. We then employ matching-based algorithms to map processes to data to achieve data access in a balanced fashion. The final part of my dissertation focuses on optimizing sub-dataset analyses in parallel big data processing. Our proposed methods can benefit different analysis applications with various computational requirements and the experiments on different cluster testbeds show their applicability and scalability.

Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date

2015

Semester

Fall

Advisor

Wang, Jun

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Electrical Engineering and Computer Engineering

Degree Program

Computer Engineering

Format

application/pdf

Identifier

CFE0006021

URL

http://purl.fcla.edu/fcla/etd/CFE0006021

Language

English

Release Date

December 2015

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

Subjects

Dissertations, Academic -- Engineering and Computer Science; Engineering and Computer Science -- Dissertations, Academic

STARS Citation

Yin, Jiangling, "Research on High-performance and Scalable Data Access in Parallel Big Data Computing" (2015). Electronic Theses and Dissertations. 1417.
https://stars.library.ucf.edu/etd/1417

Download

Included in

Computer Engineering Commons

COinS

Electronic Theses and Dissertations

Research on High-performance and Scalable Data Access in Parallel Big Data Computing

Keywords

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

Subjects

STARS Citation

Included in

Browse Advisors

Explore

Connect

Electronic Theses and Dissertations

Research on High-performance and Scalable Data Access in Parallel Big Data Computing

Author

Keywords

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

Subjects

STARS Citation

Included in

Share

Browse Advisors

Explore

Connect