Sapprox: Enabling Efficient And Accurate Approximations On Sub-Datasets With Distribution-Aware Online Sampling
Abstract
In this paper, we aim to enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low sampling efficiency and poor estimation accuracy on unevenly distributed sub-datasets. To address the problem, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system, and make good use of such information to facilitate online sampling. There are three thrusts in Sapprox. First, we develop a probabilistic map to reduce the exponential number of recorded sub-datasets to a linear one. Second, we apply the cluster sampling with unequal probability theory to implement a distribution-aware sampling method for efficient online subdataset sampling. Third, we quantitatively derive the optimal sampling unit size in a distributed file system by associating it with approximation costs and accuracy. We have implemented Sapprox into Hadoop ecosystem as an example system and open sourced it on GitHub. Our comprehensive experimental results show that Sapprox can achieve a speedup by up to 20x over the precise execution.
Publication Date
1-1-2016
Publication Title
Proceedings of the VLDB Endowment
Volume
10
Issue
3
Number of Pages
109-120
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.14778/3021924.3021928
Copyright Status
Unknown
Socpus ID
85020420470 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/85020420470
STARS Citation
Zhang, Xuhong; Wang, Jun; and Yin, Jiangling, "Sapprox: Enabling Efficient And Accurate Approximations On Sub-Datasets With Distribution-Aware Online Sampling" (2016). Scopus Export 2015-2019. 4202.
https://stars.library.ucf.edu/scopus2015/4202