Scalable Anomaly Detection with Spark and SOS
Transcript of Scalable Anomaly Detection with Spark and SOS
![Page 1: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/1.jpg)
ScalableAnomaly Detectionwith Spark and SOS
Strata NYCSeptember 26, 2019
![Page 2: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/2.jpg)
Hi there, my name is Jeroen Janssens
![Page 3: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/3.jpg)
Today
● SOS, World!● Anomalies and outliers● Evaluating outlier-selection algorithms● Various approaches to outlier selection● Stochastic Outlier Selection● Conclusion
![Page 4: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/4.jpg)
SOS, World!
01-sos-world.ipynb
![Page 5: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/5.jpg)
Implementations of SOS
● Python: http://bit.ly/sos-python● Spark: http://bit.ly/sos-spark● R: http://bit.ly/sos-r● Flink: http://bit.ly/sos-flink
![Page 6: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/6.jpg)
Anomalies and outliers
![Page 7: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/7.jpg)
An anomaly is an observation or event that deviates qualitatively from what is considered to be normal, according to a
domain expert.
![Page 8: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/8.jpg)
Detecting anomalies is important
● Expensive● Dangerous● Mess up your model
![Page 9: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/9.jpg)
Human anomaly detection may suffer from
● Fatigue● Information overload● Emotional bias
![Page 10: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/10.jpg)
Feature-vector representation
![Page 11: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/11.jpg)
Dissimilarity-matrix representation
![Page 12: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/12.jpg)
From anomaly to outlier
![Page 13: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/13.jpg)
An outlier is a data point that deviates quantitatively from the majority of the
data points, according to an outlier-selection algorithm.
![Page 14: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/14.jpg)
The symbiotic relationship between the domain expert and the algorithm
![Page 15: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/15.jpg)
Data flow diagram
Data flow diagram illustrating the relationship between the domain expert (square) and the outlier-selection algorithm (top circle).
![Page 16: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/16.jpg)
Six Euler diagrams (1/2)
![Page 17: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/17.jpg)
Six Euler diagrams (2/2)
![Page 18: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/18.jpg)
Evaluating outlier-selection algorithms
![Page 19: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/19.jpg)
Confusion matrix
Computer says no.
![Page 20: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/20.jpg)
Four possible outcomes
![Page 21: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/21.jpg)
Evaluation
Illustration of relabelling a multi-class data set into multiple one-class data sets.
![Page 22: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/22.jpg)
Anomalies are rare
In order to evaluate the algorithm we simulate anomalies to be rare. Banana for scale.
![Page 23: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/23.jpg)
Outlier scores
The dashed line indicates the threshold chosen by the domain expert.
![Page 24: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/24.jpg)
ROC curve
An ROC curve plots the false alarm rate against the hit rate for all possible thresholds.
![Page 25: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/25.jpg)
Various approachesto outlier selection
![Page 26: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/26.jpg)
Distribution-based outlier selection
![Page 27: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/27.jpg)
Distance-based outlier selection
Size does matter
![Page 28: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/28.jpg)
Density-based outlier-selection
![Page 29: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/29.jpg)
Stochastic Outlier Selection
![Page 30: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/30.jpg)
Stochastic Outlier Selection
● Unsupervised outlier selection algorithm● Employs concept of affinity (inspired by t-SNE)● One parameter: perplexity● Computes outlier probabilities
![Page 31: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/31.jpg)
t-Distributed Neighbor Embedding (t-SNE; Van der Maaten, Hinton) employs affinity to perform dimensionality reduction
![Page 32: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/32.jpg)
A data point is selected as an outlier when all the other data points have
insufficient affinity with it.
![Page 33: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/33.jpg)
From input to output
![Page 34: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/34.jpg)
From feature matrix to dissimilarity matrix
![Page 35: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/35.jpg)
From input to output
![Page 36: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/36.jpg)
Smooth neighborhoods
![Page 37: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/37.jpg)
Affinity between data points
![Page 38: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/38.jpg)
From input to output
![Page 39: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/39.jpg)
From affinity to binding probability
The binding matrix B is obtained by normalising each row in the affinity matrix A.
![Page 40: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/40.jpg)
Binding probabilities form a graph
![Page 41: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/41.jpg)
Binding probabilities form a graph
![Page 42: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/42.jpg)
Stochastic Neighbor Graph
A data point belongs to the outlier class when no it is not selected by any other data points.
![Page 43: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/43.jpg)
Three SNGs
The three SNGs Ga, Gb, and Gc are sampled from the discrete probability distribution P(G).
![Page 44: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/44.jpg)
Set of all SNGs
![Page 45: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/45.jpg)
Approximating outlier probabilities by sampling SNGs
![Page 46: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/46.jpg)
Demo: Sampling SNGs inCoffeeScript and D3
http://bit.ly/sos-d3
![Page 47: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/47.jpg)
Computing outlier probabilities through marginalisation
![Page 48: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/48.jpg)
Computing outlier probabilities in closed form
![Page 49: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/49.jpg)
Proof!
![Page 50: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/50.jpg)
Selecting outliers
![Page 51: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/51.jpg)
Adaptive variances via the perplexity parameter
![Page 52: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/52.jpg)
Continuous binary search
![Page 53: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/53.jpg)
Perplexity influences outlier probabilities
![Page 54: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/54.jpg)
Evaluation and comparison
![Page 55: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/55.jpg)
Putlier-score plots
![Page 56: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/56.jpg)
Real-world datasets
![Page 57: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/57.jpg)
Synthetic datasets
![Page 58: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/58.jpg)
Synthetic datasets
![Page 59: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/59.jpg)
Synthetic datasets
![Page 60: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/60.jpg)
Synthetic datasets
![Page 61: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/61.jpg)
SOS performs significantly better
![Page 62: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/62.jpg)
Spark implementation of SOS
![Page 63: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/63.jpg)
Spark implementation of SOS
● Developed by Fokko Driesprong● Works with DataFrame API● Available on GitHub● Plan is to make it part of MLLib
![Page 64: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/64.jpg)
SOS on PySpark
92-pyspark-sos.ipynb
![Page 65: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/65.jpg)
Summary
● Outlier selection can support the detection of anomalies● SOS is an intuitive and probabilistic algorithm to select outliers● SOS has a very good performance● No free lunch
![Page 66: Scalable Anomaly Detection with Spark and SOS](https://reader033.fdocuments.in/reader033/viewer/2022051813/6283293ef60b137c3b6af1cb/html5/thumbnails/66.jpg)
Thank you! Here are some links
● Blog: http://bit.ly/sos-blog● D3 Demo: http://bit.ly/sos-d3● Python implementation: http://bit.ly/sos-python● Spark implementation: http://bit.ly/sos-spark● R implementation: http://bit.ly/sos-r● Flink implementation: http://bit.ly/sos-flink