A New Reliability Model In Replication-Based Big Data Storage Systems
Keywords
Continuous Time Markov Chains (CTMC); Copyset; Multi-way replication; Random declustering; Reliability
Abstract
Reliability is a critical metric in the design and development of replication-based big data storage systems such as Hadoop File System (HDFS). In the system with thousands of machines and storage devices, even in-frequent failures become likely. In Google File System, the annual disk failure rate is 2.88%, which means that you were expected to see 8760 disk failures in a year. Unfortunately, given an increasing number of node failures, how often a cluster starts losing data when being scaled out is not well investigated. Moreover, there is no systemic method that can be used to quantify the reliability for multi-way replication based data placement methods, which has been widely used in enterprise large-scale storage systems to improve the I/O parallelism. In this paper, we develop a new reliability model by incorporating the probability of replica loss to investigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by up to 63% and 85%, respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79% and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout.
Publication Date
10-1-2017
Publication Title
Journal of Parallel and Distributed Computing
Volume
108
Number of Pages
14-27
Document Type
Article
Personal Identifier
scopus
DOI Link
https://doi.org/10.1016/j.jpdc.2017.02.001
Copyright Status
Unknown
Socpus ID
85014316828 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/85014316828
STARS Citation
Wang, Jun; Wu, Huafeng; and Wang, Ruijun, "A New Reliability Model In Replication-Based Big Data Storage Systems" (2017). Scopus Export 2015-2019. 5255.
https://stars.library.ucf.edu/scopus2015/5255