Anomaly Detection and Automatic Labeling with Deep Learning
-
Upload
adam-gibson -
Category
Data & Analytics
-
view
2.694 -
download
0
Transcript of Anomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection with Deep Learning
November 2017
1
Founded 2014Funding $6.3M OSS 3,700 Github Forks
300,000+ downloads/mo.Team 35 employees; 25 engineers; 7 PhDs
SKYMIND OVERVIEW
OUR BOOK GIVEAWAY!(pub. Aug. 2017)
SKIL スカイマインドスキルOur Production Deep Learning Solutionディープラーニングソリューション
confidential 4
• There are 5 main approaches to doing anomaly detection• Probabilistic-based• Distance-based• Domain-based• Reconstruction-based• Information Theoretic-based
• All of these methods have some sort of drawback that prevents them from being applicable to any type of data. They are either:
• Have built-in assumptions of the data (like Gaussian Mixture Models)• Require specific domain knowledge• Only detects certain patterns of anomalies• Not suitable for data with high temporal dependencies• Not suitable for multivariate data or computationally infeasible at our scale
• Multiple approaches are necessary to have a comprehensive detection pipeline
Anomaly Detection Approachesアノマリー検出アプローチ
5
Anomaly Detection Exampleアノマリー検出例
6
Cluster Based Methods1クラスターベース方式1
Cluster based methods workby creating a dictionary of non-anomalous data and finding theone that best matches the actualdata at inference time.
If the pattern was never seen before the reconstruction willbe very different than the actualdata.
クラスタベース方式が機能する非アノマリーデータ辞書を作成し、推論時における実際データに最も一致するものを見つける。
再構築前は従来絶対に見られなかったパターンであれば、実際とは大きく異なるデータであるといえる。
confidential 7
Cluster Based Methods2クラスターベース方式2
Dataデータ
Best Reconstruction最適な再生
Difference違い
8
Join Raw data TransformFeed groups into Autoencoder and save
reconstruction error of center
エンコーダへ送り込んで中心的再構築エラーを保存
Input Data Reconstruction
Example workflow for anomaly detectionアノマリー検出のための作業例
9
Example of VAE detecting anomaliesVAE (Variational Autoencoders 変分オートエンコーダー)検出アノマリー例
Low reconstruction error低い再生エラー
High reconstruction error低い再生エラー
confidential 10
Data Processing Step 3: Trainingデータプロセシング ステップ3:トレーニング
The autoencoder will be trained to recreate the input data as closely as possible (one row at a time)
11
Data Processing Step 4: Rankingデータプロセシング ステップ4:ランキング
The trained autoencoder will then be run on the new data and we will store the error (sum of mean squared difference per column per row) in a Ranking engine.Unusual patterns will have a high reconstruction error
Output (Reconstruction)Input Data
Output (Reconstruction)
Reconstruction
-
Input
= 2 Error Table
12
Problem: No Free lunch課題:ノーフリーランチ定理
Wolpert’s no free lunch theorem states that there is no single machine learning algorithm that can performwell on every task. Deep Learning is itself a set of very different techniques that are good or bad at variousproblems.
The system will have to use various algorithms to detect different types of anomalies and possible different root causes.
Class imbalance will be a problem at the beginning of the system’s lifetime. Labeled data will be overshadowedby the unlabeled data and the Anomaly detectors will not be able to improve for some time. One possiblesolution is to use Pseudo labeling to label all data from a trained classifier. These labels will be very noisy at first so they might have to be deployed in stages.
confidential 13
• Systems need to be able to handle terabytes of data at minimum• The system is required to train a large number of neural networks within a short
time-frame. GPU servers are a cost effective way to achieve the computation resources necessary.
• Due to space considerations GPU servers are storage inefficient and the system will employ the Hadoop File System and Spark on commodity servers to meet the storage requirements.
• The system must scale to larger problems.
System Requirementsシステム要求事項
confidential 14
Use kmeans and tsne highlighting clusters to label data points• Uses the representation from the autoencoder to automatically groupdata• TSNE visualization allows highlighting and automatic labeling• Use KNN and VPTrees to sample the hidden activations learned from the neural
net to interactively label
Automatic Labeling
confidential 15
• Autoencoders can be trained to identify causes of certain kinds of behavior• “Spikes” in reconstruction error on time series can be used in detecting problems
in infrastructure as well as in network monitoring (dropped connections, unusually high latency
• Use KNN and VPTrees to sample the hidden activations learned from the neural net to interactively label
Root cause Analysis
confidential 16
• The system will not use a redundant environments since disaster recovery is not a requirement.
• Inside an environment the system will have redundant hardware and can tolerate the loss of one GPU node and one App node without service degradation.
• This system does not employ a remote backup strategy because all data is ephemeral and can be recreated from the data on S3.
Design Considerations for Production
confidential 17
THANK YOUありがとうございました。
18