Arvind Outlier Detection Tp 2012 Final

download Arvind Outlier Detection Tp 2012 Final

of 6

Transcript of Arvind Outlier Detection Tp 2012 Final

  • 8/12/2019 Arvind Outlier Detection Tp 2012 Final

    1/6

    Method to detect software performance regression based on outliers and

    change point in a time-series

    Arvind Murthy

    Performance Engineering, Yahoo SDC, Bangalore, India(email:[email protected])

    August 15, 2012

    Abstract

    In order to adopt Continuous Integration (CI) performance regressions need to be monitored for the integratedsoftware stack. Performance metric data points of a software is time-series in nature. Change-point detection isthe problem of discovering points at which properties of time-series data change. Outlier detection is the problemof discovering abnormal or deviating data points with respect to distribution in time-series data. This coversa broad range of real-world problems, actively discussed in the community of statistics and data mining. Mostprevious works address detection of outliers and change points separately. In this paper, we present a novel unifiedapproach for joint detection of outliers and change points. We are currently using the solution as a part of continousintegration stack in order to detect performance regression of the Hadoop [12] stack across releases. In our methodchange point detection is subsumed into the problem of detecting outliers. Our core idea is to compare deviationof a data point from its short-term median with its long-term average. We use this idea to construct an onlinealgorithm that detects outliers immediately and change points, within a few time steps of their occurrence. Itsworst case complexity is O(n). Experiments with synthetic datasets as well as application to real datasets showwhy the algorithm succeeds.

    1 Introduction

    The problem of discovering time-points at which proper-

    ties of time-series data change attracts a lot of attentionin the data mining community [5, 4, 7, 8, 9]. This prob-lem is referred to as change-point detection or event de-tection and covers a broad range of real-world problemssuch as fraud detection in cellular systems, intrusion de-tection in computer networks, irregular-motion detectionin vision systems, signal segmentation in data stream,and fault detection in engineering systems.

    A statistical definition of change point is a data point,where the probability distribution of data changes signif-icantly in its parameters or even nature. e.g. the meanor variance of data changes markedly, or the distributionchanges from normal to heavy tailed. An example of the

    latter is network traffic, which becomes heavy tailed asmore and more traffic get multiplexed on the same net-work link.

    On the other hand an outlier is a data point that devi-ates markedly from the rest of the data. Discriminatingbetween outliers and change points is challenging. Thisis because a sudden large change in the distribution atsome time point could be initially mistaken as merelyan outlier with respect to the previous data points. Butif these outliers persist for long enough, then they arereally part of a new distribution.

    At Yahoo! we have encountered the problem of change

    point detection in our automated performance regres-sion detection system Trumpet [13]. This system detectsand alerts on changes in software performance from one

    build/version/release to the next. This is better thanmanual analysis, which is limited in accuracy and scalespoorly in engineer man-hours. Currently Trumpet isapplied to alerting performance regressions in Hadoopstack. The time-series here are sequences of performancemetrics of consecutive builds as measured in benchmarkruns. Examples of such metrics are application run-time,and cluster throughput. The resulting time-series of suchperformance metrics do not typically follow any well-known distribution. Moreover they are non-stationaryi.e. their mean, variance and other statistical parameterscan change over time as trends or jumps. This makes theproblem very challenging.

    Current change point and outlier detection methods as-sume that the data follows a certain distribution and/ormodel the time-series as an autoregressive process. Un-fortunately their accuracy of detecting change points issensitive to outliers. Some outlier detection approachessuch as [1, 2, 3, 5, 10, 11] depend on static thresholds toidentify outliers or have to make multiple passes, wherein each pass a fixed number of outliers are detected andfiltered out. Many of these methods assume a normaldistribution; that makes them less robust to real worlddata with its arbitrary and non-stationary distributions.

    Our key idea is to use a ratio of deviation from short-

    1

  • 8/12/2019 Arvind Outlier Detection Tp 2012 Final

    2/6

    term median to the long-term mean as a score to classifya data point. The median is taken over a shorter windowprior to the current data point while the mean is takenover a longer window including the current data point.Using such sliding windows we can track trend adap-tively while ignoring out-of-date statistics. This score isused to classify the current point as outlier or not. Witha sequence of such outliers are detected at successivedata points, we retrospectively classify the start of the

    sequence as a change point. Change point detection isthus reduced to detecting sufficient number of successiveoutliers.

    The remainder of this paper is organized as follows. InSection 2, we formulate the detection problem with back-ground. In Section 3, we develop proposed algorithm. AHadoop based case study is presented in section 4. Wereport experimental results on synthetic and real-worlddatasets, in Section 5. Discussion on related and alter-native models is done in section 6. We conclude by sum-marizing and possible future work in Section 7.

    2 Background

    To illustrate the problem, consider a time-series ofthroughput values observed in a sorting system thatsorts a fixed amount of data using different versions ofsoftware. Most of the throughput variations may bestatistically regular, but once a while there may be anoutlier point i.e. a marked deviation from the previousdata. In general detecting such outliers is importantbecause they may be caused by an anomaly inside thesorting software or the system environment in which itruns.

    We enumerate the requirements for an outlier and changedetection algorithm:

    1) The process should be on-line: an outlier has to be de-tected as it appears. A change point has to be detectedwithin some constant number of observations after thechange happens.

    2) The detection should be adaptive to a non-stationarytime-series and robust to a wide variety of distributions.Preferably it should not assume specific distributions.

    The major theme of this paper is, to demonstrate a

    unified solution for detecting outliers and change pointswhilst honoring the above requirements. Our algorithmcalled OCP detect (Outlier Change-Point detect) honorsthe on-line requirement:1) Because it computes an outlier-classification score fora data point immediately at the time at which it occurswithout waiting. Change points get detected after a con-stant number of successive outliers happen. 2) Moreoverit honors the adaptive requirement, as it uses a slid-ing window technique to forget old data and adapts thethresholds after change points are detected.

    3 The OCP detect Algorithm

    We denote a data sequence as {xi: i = 1, 2,... N},wherei is time variable. p denotes the number of data pointsto be observed before initiating analysis. Cxidenotes thecurrent data point in the time series being considered foranalysis. tandsdenotes the thresholds for change pointand outlier detection respectively. By threshold we meanthe magnitude of fluctuation allowed in the data points

    within which they wont be classified. Window size w, vdenotes the number of data points to consider in orderto compute median and mean respectively. We choosemedian over a small window, as it is less sensitive to out-liers and helps in localizing the deviation. w

  • 8/12/2019 Arvind Outlier Detection Tp 2012 Final

    3/6

    is 58. The resulting scores are, Score1 : 0.865, Score2 :0.002

    Consider the last point in the time-series whose value is50. The short term Median for this point is computedover the points 38 to 42 (marked blue) and the value is60. The long term Mean for this point is computed overthe points 34 to 43 (marked red) and the value is 57.The resulting scores are computed; Score1 : 0.880 andScore2 : 0.004

    We can observe in the above example that, mean ismore sensitive to data points than median. We considerthe magnitude of deviation of current point from meanand median. Also we consider deviation of mean frommedian itself. Using the deviations we establish a rela-tion that forms the Score1 and Score2. Note that it isnot possible that magnitude of deviation between mean,median and current data point be same. However theycould be very close to each other, in which case this hasto be handled as a corner case, discussed further. Cur-rent data point and median being closer to each other

    signifies the data point is not an outlier in the short termwindow, this would reflect a small Score1. In the exam-ple above, Score1 > Score2, current data point is awayfrom both mean and median. Deviation of median frommean has increased compared to its previous point. Wecannot base classification decision only consider median,we still need to quantifiy the deviation from mean to dothis. Thus we need to quantify the deviation relativelyto mean, median and the current data point.

    Score1 is the ratio of absolute difference between thecurrent data point from the median to the mean ampli-fied by the threshold. The median is over the short-termwindow wand mean over the longer-term window v. Asexplained using figure above, taking such a ratio makesthe score much more robust to fluctuations in the long-terms mean such as due to past outliers. We heuristicallyfoundw/v Score2, we classify thepoint as an outlier. Score 2 is used as a data-dependentcutoff to classify Score1 as outlier or not. We gave some

    examples in the previous section explaining this advan-tage. Score1 is sensitive to mean and to the deviation ofcurrent data point from median. Score2 is sensitive todeviation of current point from mean and to deviation ofmean from median. In windowv, Score1 will be greaterthan Score2, with the presense of outliers or with datapoints subsequent to a change point.

    Step 4: Test the Scores: IfScore1 is higher than Score2indicates stronger the possibility of current data pointbeing an outlier or change-point.

    Score1i

    > Score2i

    (Cxi < LLi Cxi > U Li) : Vi= 1

    S2i(LLi > Cxi > U Li) : Vi= 0

    However to classify it, we make an additional check thatthe current data point is beyond a certain band aroundthe median. The classification state is then saved in

    vector V.

    Step 5: Classify outlier and change point based onthe state information in vector V.

    Vi

    = 1 : V si= (Vi+ Vi1+ Vi2)

    = 0 : V si= 0

    First pass on vector Vs :

    V si

    = 3 (V si1 = 3 V si2 = 3)

    :V si= 1

    = 1 (V si1 = 1 V si2 = 1)

    :V si= V si+ V si1

    Second pass on vector Vs :

    V si

    = 0 : NO change, adapt LL, UL to Cxi

    = (1 2)(Cxi >(Cxi+ s%Cxi))

    :Outlier

    >2 : Change point, adapt LL, UL to change

    If the current point has a higher Score1 as indicated inVi, we consider the past three states to classify. Vector

    Vsi is used to express the sum state of Vi,Vi1, Vi2.Vsi stores state of the current data point with respect topast two data points. Vsi is processed in two passes, inthe first pass; ifVsi indicates 3, it means outliers havebeen detected in the past and current point could be thechange point. We test the previous states to make surethere was no change point detected in the past two datapoints, if detected then we infer the current point is anoutlier. Similarly if theVsiindicates 1, it is possible thatcurrent point to be an outlier.

    Finally in the second pass on Vsi, we classify Cxi; if thestate of current point is 0, then there is no significantdeviation. In case the state for current point is 1 or

    2 whilst the current data point deviates more than s%threshold, we infer it is an outlier also state-2 signifies ahigher possibility ofCxi1 being a change point. If thestate is 3, with accuracy we infer that two points priorto current one is the change point.

    Step 6: Report the classification of outliers and change-points as signified in Vsi vector.

    Step 7: Persist the scores and state elements of vectorV, Vsfor past data points N, N-1 and N-2we computeand persist median and mean over the window sizes while

    3

  • 8/12/2019 Arvind Outlier Detection Tp 2012 Final

    4/6

    classifying current data point. This step enables onlineimplementation.

    Upon a new data point arrival we retrieve the stateinformation persisted in step 7, the median, mean, andscore. It is not required to process from the start ofthe time-series. This improves the efficiency and per-formance of the solution and makes the solution on-linefor outlier detection and change points are detected in

    subsequent few steps.

    4 Hadoop Case Study

    This solution is currently being used to detect changepoint in Hadoops performance characteristics. Hadoopdevelopment results in continuous code change and fea-ture addition by developers. The use case is to trackHadoop performance build-on-build or nightly. Twotime-series of releases, are followed to detect performanceanomalies.

    Change point detection reduces cost and turn aroundtime to detect / fix performance regressions. Outliersneed to be discarded as they influence further analysis.Performance of the Hadoop stack is benchmarked usinga suite of tests and monitoring the metrics reported bythe benchmarking system. 8 to 10 benchmarks are con-sidered and multiple metrics per benchmark need to beanalyzed, it is evident that manual effort in achievingthis doesnt scale.

    As an example of this case study, lets us consider the sortbenchmark, that sorts given amount of data on a hadoopcluster. Sort throughput is the metric that reflects theperformance of the stack to sort data. The results of us-

    ing this system to detect outliers and change points isbelow:

    Figure - 2

    Throughput changes occur because of change in datasize, change in number of nodes in the cluster or changein Hadoop software version. The change in throughputresulted in multiple change points evident from Figure -

    2. The solution detects all such points, marked red. Sim-ilarly, there were outliers occurring because of externalfactors such as network latency, node failure or choice ofnode that was chosen at random by Hadoop. Such pointswere also detected and marked blue in the above figure.Note how the algorithm adapted the LL and UL limitsto the changing median.. The analysis and notificationsystem uses this solution and takes action upon changepoints.

    5 Experiments

    Experiments with synthetic data sets are per-formed; simulations used data generated from amixed Gaussian distribution. For some experimentswe constructed non-stationary time-series from seg-ments of different normal distributions with stan-dard deviation varying 1 to 2 times the mean .

    Figure - 3

    As Figure - 3 shows all outliers and change points could

    be detected. The test data consists of small, medium andlarge magnitude of changes and consists outliers in eachsegment. The algorithm was able to detect all outliers aswell as classify change points.

    Figure - 4

    In the second experiment also the time-series is non sta-tionary, with a gradual trend, but in the form of smallchanges in small time steps. mean and median changeover the sub parts. We can observe that the solutioncorrectly detects the outliers and change points. Further-more, that the box around the time-series representingthe LL and UL are adapted as per the change in meanand median over time as shown in Figure - 4.

    4

  • 8/12/2019 Arvind Outlier Detection Tp 2012 Final

    5/6

    ItFigure - 5

    In the third experiment, the data set has mean increas-ing by 6% and standard deviation range from 2 to 3. Itis evident from Figure - 5 that outliers are detected withs=t=2 %.

    6 Related work

    Inter Quartile Range (IQR) test: IQR is based on per-centile statistics. Just like the median, percentiles pre-

    sume specific ordering of observations in a dataset. Thesort process required in this method turns out to becostly, especially when the dataset is huge. For example,a data set with a billion observations to calculate thepercentiles using piecewise-parabolic algorithm it couldtake approximately five times the time to generate dis-tribution moments.

    Chauvenets criterion [10]: As stated in the criterion ifthe expected number of measurements at least as badas the suspect measurement is less than 1/2, then thesuspect measurement should be rejected. The comple-mentary error function in this model is the residual area

    under the tails of a distribution. This cannot be ap-plied on a non-stationary time-series data. The criteriadoesnt apply, when the data distribution is stronglybi/multi-modal.

    Peirces criterion [2, 3]: Theoretically accounts for thecase where there is more than one suspect observation,in contrast to Chauvenets criterion. However, Peircescriterion can be applied more generally to a data set inGaussian distributions.

    A piecewise segmented function is used in [9] to accom-modate time-dependent data; the change points are qual-

    ified as the points between successive segments. A changepoint may be detected by discovering the point such thatall errors of local model fittings of segments to the databefore and after that point is minimized.

    However, it is computationally expensive to convergeto such a point as the local model fitting is required asmany times as the number of points between the succes-sive points every time data is input. Further, it is notguaranteed that it works well if the time-series cannot bewell represented by a simple piecewise segmented func-tion.

    CUSUM (cumulative sum) [1]: Detects change in meanthat are 2 or less and success is dependent on Gaussiansequence.

    GLR (generalized likelihood ratio) [5]: Logarithm ofthe likelihood ratio between two consecutive intervals intime-series data is monitored for detecting change points.The above premise has been extensively explored in thedata mining community in connection with real-world

    applications, for example, approaches based on noveltydetection, maximum-likelihood estimation and onlinelearning of autoregressive models.

    Direct density-ratio estimation has been actively ex-plored in the machine learning community, e.g., KernelMean Matching [6]. However, this method is based onbatch algorithm and is not suitable for change-point de-tection due to the sequential nature of time-series dataanalysis.

    We also considered Grubbs, Tietjen-Moore and General-

    ized Extreme Deviate Tests [11]; where these hold goodfor stationary time-series and assume normality of dis-tribution.

    Further, our approach is distinguished from previousworks in the following regards:

    1) In most previous work, outlier detection and changepoint detection have not been addressed together, thispaper outlines the connection between them and presentsa unified method for dealing with both of them from theviewpoint of non-stationary time-series. In our solution,we can detect outliers and change points simultaneously

    on the basis of an identical score and state vector em-ployed in the algorithm.

    2) The proposed change-detection algorithm is compu-tationally efficient and achieves high detection accuracy.

    3) In most previous work, the normality assumption ismade regarding the underlying distribution, in contrastthe presented solution does not and the idea is applicableto non-stationary time-series.

    7 Summary

    We have proposed a solution for classifying outliers andchange-points from non-stationary time series. It is ad-dressed in two parts: scoring and classification. In thescoring we compute scores that reflect outliers, we in-crementally discover and keep the state of outliers indata series. Specifically, in this solution we used slidingwindow to forget old statistic, thus effect of past data isgradually discounted. The algorithm is characterized inits property to address outliers and change points at thesame time. This enabled us to deal with non-stationary

    5

  • 8/12/2019 Arvind Outlier Detection Tp 2012 Final

    6/6

    sources. The current implementation and usage indi-cates the success of the algorithm. This method hasbeen put to use in the case study discussed, the solutionis however generic and could be used for any softwarestack.

    Specifically we addressed the problem of change pointdetection as a problem of outlier detection. This gavea unifying view of outlier detection and change point

    detection.

    The following issues remain for future exploration: Im-prove the outlier/change point detection accuracy, byadding a prediction layer, thereby reduce the complexityof carefully tuning the thresholds and window sizes. Ideais to develop a layer that predicts next data point withsome probability. The layer would also learn to readjustprobabilities from current deviations and enhance accu-racy.

    8 References

    [1] Basseville, M. and Nikiforov, V., Detection of AbruptChanges: Theory and Application, Prentice-Hall, Inc.,Englewood Cliffs, N. J., 1993.

    [2] B.A. Gould, On Peirces criterion for the rejection ofdoubtful observations, with tables for facilitating its ap-plication, Astronomical Journal IV, 83, 81 - 87 (1855).

    [3] Benjamin Peirce, Criterion for the rejection of

    doubtful observations, Astronomical Journal II, 45,161- 163 (1852).(http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle query?1852AJ......2..161P;data type=PDF HIGH)

    [4] Brodsky, B. and Darkhovsky, B., NonparametricMethods in Change-Point Problems, Kluwer Academic

    Publishers, 1993.

    [5] Gustafsson, F., Adaptive Filtering and Change De-tection, John Wiley & Sons Inc., 2000.

    [6] Huang, J., Smola, A., Gretton, A., Borgwardt, K. M.and Scholkopf, B., Correcting Sample Selection Bias by

    Unlabeled Data, in Advances in Neural Information Pro-cessing Systems, Vol.19, MIT Press, Cambridge, 2007.

    [7] Kifer, D., Ben-David, S and Gehrke, J., DetectingChange in Data Streams, in Proc. of the 30th Intl Conf.on Very Large Data Bases (2004), pp.180-191.

    [8] V. Barnett and T. Lewis, Outliers in Statistical Data,John Wiley & Sons, 1994.

    [9] V. Guralnik and J. Srivastava, Event detection fromtime series data, in Proc. KDD-99, pp:33-42, 1999.

    [10] William Chauvenet, A Manual of Spherical andPractical Astronomy V.II , Lippincott, Philadelphia, 1stEd (1863); Reprint of 1891 5th Ed: Dover, NY (1960).

    [11] NIST/SEMATECH Statistical Methods.http://www.itl.nist.gov/div898/handbook/index.htm

    [12] ApacheTM HadoopTM! http://hadoop.apache.org/

    [13] Trumpet: Automated Hadoop stack performanceanalysis system. http://tiny.corp.yahoo.com/xpnV3Y-

    6