Title

Log-Based Abnormal Task Detection And Root Cause Analysis For Spark

Keywords

Log Analysis; Abnormal Task; Root

Abstract

Application delays caused by abnormal tasks arecommon problems in big data computing frameworks. Anabnormal task in Spark, which may run slowly withouterror or warning logs, not only reduces its resident node'sperformance, but also affects other nodes' efficiency.Spark log files report neither root causes of abnormal tasks,nor where and when abnormal scenarios happen. AlthoughSpark provides a 'speculation' mechanism to detect stragglertasks, it can only detect tailed stragglers in each stage. Sincethe root causes of abnormal happening are complicated, thereare no effective ways to detect root causes.This paper proposes an approach to detect abnormality andanalyzes root causes using Spark log files. Unlike commononline monitoring or analysis tools, our approach is a pureoff-line method that can analyze abnormality accurately. Ourapproach consists of four steps. First, a parser preprocessesraw log files to generate structured log data. Second, ineach stage of Spark application, we choose features relatedto execution time and data locality of each task, as well asmemory usage and garbage collection of each node. Third,based on the selected features, we detect where and whenabnormalities happen. Finally, we analyze the problems usingweighted factors to decide the probability of root causes. In thispaper, we consider four potential root causes of abnormalities,which include CPU, memory, network, and disk. The proposedmethod has been tested on real-world Spark benchmarks.To simulate various scenario of root causes, we conductedinterference injections related to CPU, memory, network,and Disk. Our experimental results show that the proposedapproach is accurate on detecting abnormal tasks as well asfinding the root causes.

Publication Date

9-7-2017

Publication Title

Proceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017

Number of Pages

389-396

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/ICWS.2017.135

Socpus ID

85032347370 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/85032347370

This document is currently not available here.

Share

COinS