Sapprox: Enabling Efficient And Accurate Approximations On Sub-Datasets With Distribution-Aware Online Sampling

Abstract

In this paper, we aim to enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low sampling efficiency and poor estimation accuracy on unevenly distributed sub-datasets. To address the problem, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system, and make good use of such information to facilitate online sampling. There are three thrusts in Sapprox. First, we develop a probabilistic map to reduce the exponential number of recorded sub-datasets to a linear one. Second, we apply the cluster sampling with unequal probability theory to implement a distribution-aware sampling method for efficient online subdataset sampling. Third, we quantitatively derive the optimal sampling unit size in a distributed file system by associating it with approximation costs and accuracy. We have implemented Sapprox into Hadoop ecosystem as an example system and open sourced it on GitHub. Our comprehensive experimental results show that Sapprox can achieve a speedup by up to 20x over the precise execution.

Publication Date

1-1-2016

Publication Title

Proceedings of the VLDB Endowment

Volume

10

Issue

3

Number of Pages

109-120

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.14778/3021924.3021928

Socpus ID

85020420470 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85020420470

This document is currently not available here.

Share

COinS