Optimize Parallel Data Access In Big Data Processing
Abstract
Recent years the Hadoop Distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based Spark frameworks, which can efficiently support iterative and interactive data analysis in memory. The first part of my dissertation mainly focuses on studying parallel data accession distributed file systems, e.g, HDFS. Since the distributed I/O resources and global data distribution are often not taken into consideration, the data requests from parallel processes/executors will unfortunately be served in a remoter imbalanced fashion on the storage servers. In order to address these problems, we develop I/O middleware systems and matching-based algorithms to map parallel data requests to storage servers such that local and balanced data access can be achieved. The last part of my dissertation presents our plans to improve the performance of interactive data access in big data analysis. Specifically, most interactive analysis programs will scan through the entire data set regardless of which data is actually required. We plan to develop a content-aware method to quickly access required data without this laborious scanning process.
Publication Date
7-7-2015
Publication Title
Proceedings - 2015 IEEE/ACM 15th International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015
Number of Pages
721-724
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1109/CCGrid.2015.168
Copyright Status
Unknown
Socpus ID
84941248060 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/84941248060
STARS Citation
Yin, Jiangling and Wang, Jun, "Optimize Parallel Data Access In Big Data Processing" (2015). Scopus Export 2015-2019. 2044.
https://stars.library.ucf.edu/scopus2015/2044