Big Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organizations across a wide range of industries. The widespread data-intensive computing needs have inspired innovations in parallel and distributed computing, which has been the effective way to tackle massive computing workload for decades. One significant example is MapReduce, which is a programming model for expressing distributed computations on huge datasets, and an execution framework for data-intensive computing on commodity clusters as well. Since it was originally proposed by Google, MapReduce has become the most popular technology for data-intensive computing. While Google owns its proprietary implementation of MapReduce, an open source implementation called Hadoop has gained wide adoption in the rest of the world. The combination of Hadoop and Cloud platforms has made data-intensive computing much more accessible and affordable than ever before. This dissertation addresses the performance issue of data-intensive computing on Cloud platforms from three different aspects: task assignment, replica placement, and straggler identification. Both task assignment and replica placement are subjects closely related to load balancing, which is one of the key issues that can significantly affect the performance of parallel and distributed applications. While task assignment schemes strive to balance data processing load among cluster nodes to achieve minimum job completion time, replica placement policies aim to assign block replicas to cluster nodes according to their processing capabilities to exploit data locality to the maximum extent. Straggler identification is also one of the crucial issues data-intensive computing has to deal with, as the overall performance of parallel and distributed applications is often determined by the node with the lowest performance. The results of extensive evaluation tests confirm that the schemes/policies proposed in this dissertation can improve the performance of data-intensive applications running on Cloud platforms.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Electrical Engineering and Computer Engineering
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Dai, Wei, "Improving the Performance of Data-intensive Computing on Cloud Platforms" (2017). Electronic Theses and Dissertations. 5509.