With the increasing popularity of data-intensive applications as well as the large-scale computing and storage systems, current data centers and supercomputers are often dealing with extremely large data-sets. To store and process this huge amount of data reliably and energy-efficiently, three major challenges should be taken into consideration for the system designers. Firstly, power conservation–Multicore processors or CMPs have become a mainstream in the current processor market because of the tremendous improvement in transistor density and the advancement in semiconductor technology. However, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption . Thus, how to balance system performance and power-saving is a critical issue which needs to be solved effectively. Secondly, system reliability–Reliability is a critical metric in the design and development of replication-based big data storage systems such as Hadoop File System (HDFS). In the system with thousands machines and storage devices, even in-frequent failures become likely. In Google File System, the annual disk failure rate is 2:88%,which means you were expected to see 8,760 disk failures in a year. Unfortunately, given an increasing number of node failures, how often a cluster starts losing data when being scaled out is not well investigated. Thirdly, energy efficiency–The fast processing speeds of the current generation of supercomputers provide a great convenience to scientists dealing with extremely large data sets. The next generation of "exascale" supercomputers could provide accurate simulation results for the automobile industry, aerospace industry, and even nuclear fusion reactors for the very first time. However, the energy cost of super-computing is extremely high, with a total electricity bill of 9 million dollars per year. Thus, conserving energy and increasing the energy efficiency of supercomputers has become critical in recent years. This dissertation proposes new solutions to address the above three key challenges for current large-scale storage and computing systems. Firstly, we propose a novel power management scheme called MAR (model-free, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8% accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1% more power while maintain a comparable performance loss of about 0.78-1.08%. In addition, more simulation results indicate that our design achieved 3.35-14.2% more power saving efficiency and 4.2-10.7% less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax, PID and MPC. Secondly, we create a new reliability model by incorporating the probability of replica loss to investigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by upto 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79 % and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout. Thirdly, we develop a power-aware job scheduler by applying a rule based control method and taking into account real world power and speedup profiles to improve power efficiency while adhering to predetermined power constraints. The intensive simulation results shown that our proposed method is able to achieve the maximum utilization of computing resources as compared to baseline scheduling algorithms while keeping the energy cost under the threshold. Moreover, by introducing a Power Performance Factor (PPF) based on the real world power and speedup profiles, we are able to increase the power efficiency by up to 75%.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Electrical Engineering and Computer Engineering
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Wang, Ruijun, "Developing New Power Management and High-Reliability Schemes in Data-Intensive Environment" (2016). Electronic Theses and Dissertations. 5433.