Log-Based Abnormal Task Detection And Root Cause Analysis For Spark
Keywords
Log Analysis; Abnormal Task; Root
Abstract
Application delays caused by abnormal tasks arecommon problems in big data computing frameworks. Anabnormal task in Spark, which may run slowly withouterror or warning logs, not only reduces its resident node'sperformance, but also affects other nodes' efficiency.Spark log files report neither root causes of abnormal tasks,nor where and when abnormal scenarios happen. AlthoughSpark provides a 'speculation' mechanism to detect stragglertasks, it can only detect tailed stragglers in each stage. Sincethe root causes of abnormal happening are complicated, thereare no effective ways to detect root causes.This paper proposes an approach to detect abnormality andanalyzes root causes using Spark log files. Unlike commononline monitoring or analysis tools, our approach is a pureoff-line method that can analyze abnormality accurately. Ourapproach consists of four steps. First, a parser preprocessesraw log files to generate structured log data. Second, ineach stage of Spark application, we choose features relatedto execution time and data locality of each task, as well asmemory usage and garbage collection of each node. Third,based on the selected features, we detect where and whenabnormalities happen. Finally, we analyze the problems usingweighted factors to decide the probability of root causes. In thispaper, we consider four potential root causes of abnormalities,which include CPU, memory, network, and disk. The proposedmethod has been tested on real-world Spark benchmarks.To simulate various scenario of root causes, we conductedinterference injections related to CPU, memory, network,and Disk. Our experimental results show that the proposedapproach is accurate on detecting abnormal tasks as well asfinding the root causes.
Publication Date
9-7-2017
Publication Title
Proceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017
Number of Pages
389-396
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1109/ICWS.2017.135
Copyright Status
Unknown
Socpus ID
85032347370 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/85032347370
STARS Citation
Lu, Siyang; Rao, Bing Bing; Wei, Xiang; Tak, Byungchul; and Wang, Long, "Log-Based Abnormal Task Detection And Root Cause Analysis For Spark" (2017). Scopus Export 2015-2019. 7192.
https://stars.library.ucf.edu/scopus2015/7192