Taming Big Data Svm With Locality-Aware Scheduling

Keywords

data locality; HDFS; MPI; parallel SVM; read performance

Abstract

Incorporating MPI programming model into data-intensive file system for big data application is significant in performance research for optimization purpose. In this paper we ported an MPI-SVM solver, originally developed for HPC environment to the Hadoop distributed file system (HDFS). We analyzed the performance bottlenecks with which the SVM solver will be confronted on the HDFS. It is known the storage expansion on HDFS comes with a skewed data distribution. As a result, we found out that some hot nodes always receive condensed I/O requests while other nodes always send remote requests. These remote requests make the I/O delays elongate on hot nodes, which leads to performance bottleneck for our solver. Thus we specifically improved the data preprocessing part that requires large amount of I/O operations by a deterministic scheduling method. Our improvement showed a balanced read pattern on each node. The time ratio between the longest process and the shortest process has been reduced by 60%. Also the average read time has significantly reduced by 78%. The data served on each node also showed a small variance in comparison with the originally ported SVM algorithm. We believe our design avoids the overhead introduced by remote I/O operations, which will be beneficial to many algorithms when coping with large scale of data.

Publication Date

1-11-2017

Publication Title

Proceedings - 2016 International Conference on Advanced Cloud and Big Data, CBD 2016

Number of Pages

37-44

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/CBD.2016.017

Socpus ID

85013151764 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85013151764

This document is currently not available here.

Share

COinS