Unsupervised Anomaly Detection in Energy Time Series Data ... · 1Kingma & Welling, Auto-Encoding...

17th IEEE International Conference on Machine Learning and Applications

Unsupervised Anomaly Detection in Energy Time Series Data

using Variational Recurrent Autoencoders with Attention

Joao Pereira & Margarida Silveira

Signal and Image Processing GroupInstitute for Systems and Robotics

Instituto Superior Tecnico (Lisbon, Portugal)

Orlando, Florida, USA

December 19th, 2018

Introduction

Anomaly detection is about finding patterns in data that do notconform to expected or normal behaviour.

AnomalyDetection

Communications

NetworkIntrusion

NetworkKPIs

monitoring

Robotics

AssistedFeeding

AutomatedInspection

Energy

SmartMaintenance

FaultDetection

Healthcare

MedicalDiagnostics

PatientMonitoring Security

VideoSurveillance

FraudDetection

Joao Pereira ICMLA’18 2

Introduction

Anomaly detection is about finding patterns in data that do notconform to expected or normal behaviour.

AnomalyDetection

Communications

NetworkIntrusion

NetworkKPIs

monitoring

Robotics

AssistedFeeding

AutomatedInspection

Energy

SmartMaintenance

FaultDetection

Healthcare

MedicalDiagnostics

PatientMonitoring Security

VideoSurveillance

FraudDetection

Main Challenges

I Most data in the world are unlabelled

Dataset D =(

x(i),y∗(i))N

anomaly labels

I Annotating large datasets is difficult, time-consuming andexpensive

Normal

Anomalous

I Time series have temporal structure/dependencies

x =(x1,x2, ...,xT

), xt ∈ Rdx

The Principle in a Nutshell

I Based on a Variational Autoencoder12

Inpute.g., time series x

x = (x1,x2, ...,xT )

xt ∈ Rdx

qφ(z|x)Encoder

µzσz

Latent Space (z ∈ Rdz)

z = µz + σzε

Decoderpθ(x|z)

Reconstruction(µxt, bxt

I Train a VAE on data with mostly normal patterns;

I It reconstructs well normal data, while it fails to reconstructanomalous data;

I The quality of the reconstructions is used as anomaly score.

1Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’142Rezende et al., Stochastic Backpropagation and Approximate Inference in Deep Generative Models, ICML’14

The Principle in a Nutshell

I Based on a Variational Autoencoder12

Inpute.g., time series x

x = (x1,x2, ...,xT )

xt ∈ Rdx

qφ(z|x)Encoder

µzσz

Latent Space (z ∈ Rdz)

z = µz + σzε

Decoderpθ(x|z)

Reconstruction(µxt, bxt

I Train a VAE on data with mostly normal patterns;

I It reconstructs well normal data, while it fails to reconstructanomalous data;

I The quality of the reconstructions is used as anomaly score.

1Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’142Rezende et al., Stochastic Backpropagation and Approximate Inference in Deep Generative Models, ICML’14

Proposed Approach

ReconstructionModel

Detection

Proposed Approach

ReconstructionModel

Detection

Reconstruction Model

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent SpaceCombines self-attention with variational inference.

cdett =

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

DecoderReconstruction parameters:(µxt, bxt

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xTxt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

DecoderBi-LSTM

Variational Layer

Reconstruction

a11 a12a13a1T

Linear

SoftPlus

c1 c2 c3 cT

Mechanism

Corruption

ht =[−→h t;←−h t

cdett =

Another Bi-LSTM.

Output

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

DecoderBi-LSTM

Variational Layer

Reconstruction

a11 a12a13a1T

Linear

SoftPlus

c1 c2 c3 cT

Mechanism

Corruption

ht =[−→h t;←−h t

cdett =

Another Bi-LSTM.

Output

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

DecoderBi-LSTM

Variational Layer

Reconstruction

a11 a12a13a1T

Linear

SoftPlus

c1 c2 c3 cT

Mechanism

Corruption

ht =[−→h t;←−h t

cdett =

Another Bi-LSTM.

Output

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

DecoderBi-LSTM

Variational Layer

Reconstruction

a11 a12a13a1T

Linear

SoftPlus

c1 c2 c3 cT

Mechanism

Corruption

ht =[−→h t;←−h t

Variational Latent Space

Combines self-attention with variational inference.

cdett =

Another Bi-LSTM.

Output

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

DecoderBi-LSTM

Variational Layer

Reconstruction

a11 a12a13a1T

Linear

SoftPlus

c1 c2 c3 cT

Mechanism

Corruption

ht =[−→h t;←−h t

Combines self-attention with variational inference.

cdett =

Another Bi-LSTM.

Output

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

DecoderBi-LSTM

Variational Layer

Reconstruction

a11 a12a13a1T

Linear

SoftPlus

c1 c2 c3 cT

Mechanism

Corruption

ht =[−→h t;←−h t

cdett =

Another Bi-LSTM.

Decoder

Reconstruction parameters:(µxt, bxt

Output

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

DecoderBi-LSTM

Variational Layer

Reconstruction

a11 a12a13a1T

Linear

SoftPlus

c1 c2 c3 cT

Mechanism

Corruption

ht =[−→h t;←−h t

cdett =

Another Bi-LSTM.

Decoder

Reconstruction parameters:(µxt, bxt

Output

EncoderBi-LSTM

−→h e

←−h e

−→h e

←−h e

−→h e

←−h e

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

−→h d

←−h d

−→h d

←−h d

−→h d

←−h d

−→h dT

←−h dT

• • •

DecoderBi-LSTM

Variational Layer

Reconstruction

a11 a12a13a1T

Linear

SoftPlus

c1 c2 c3 cT

Mechanism

Corruption

ht =[−→h t;←−h t

cdett =

Another Bi-LSTM.

Output

Loss Function

L(θ, φ;x(n)) =

Reconstruction Term︷︸︸︷−Ez∼qφ(z|x(n)), ct∼qaφ(ct|x

[log pθ

(x(n)|z, c

+ λKL

(qφ(z|x(n))‖pθ(z)

︸︷︷︸Latent Space KL loss

+ ηT∑

(qaφ(ct|x(n))‖pθ(ct)

︸︷︷︸Attention KL loss

Loss Function

L(θ, φ;x(n)) = −Ez∼qφ(z|x(n)), ct∼qaφ(ct|x(n))

[log pθ

(x(n)|z, c

L(θ, φ;x(n)) =

[log pθ

(x(n)|z, c

+ λKL

DKL denotes the Kullback-Leibler Divergence

Loss Function

L(θ, φ;x(n)) =

[log pθ

(x(n)|z, c

+ λKL

DKL denotes the Kullback-Leibler Divergence

Training Framework

Optimization & Regularization

I About 270k parameters to optimize

I AMS-Grad optimizer3

I Xavier weight initialization4

I Denoising autoencoding criterion5

I Sparse regularization in the encoder Bi-LSTM6

I KL cost annealing7

I Gradient clipping8

Training executed on a single GPU (NVIDIA GTX 1080 TI)

3Reddi, Kale & Kumar, On the Convergence of Adam and Beyond, ICLR’184Bengio et al., Understanding the Difficulty of Training Deep Feedforward Neural Networks, AISTATS’105Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, AAAI’176Arpit et al., Why Regularized Auto-Encoders Learn Sparse Representation?, ICML’167Bowman, Vinyals et al., Generating Sentences from a Continuous Space, SIGNLL’168Bengio et al., On the Difficulty of Training Recurrent Neural Networks, ICML’13

Proposed Approach

ReconstructionModel

Detection

Proposed Approach

ReconstructionModel

Detection

Anomaly Scores

I Use the reconstruction error as anomaly score.

Anomaly Score = Ezl∼qφ(z|x)

[∥∥x− E[pθ(x|zl)]︸︷︷︸µx

∥∥1

I Take the variability of the reconstructions into account.

Anomaly Score = −Ezl∼qφ(z|x)

[log p(x|zl)

︸︷︷︸”Reconstruction Probability”

qφ(z|x) = Normal(µz,σ2zI)

zl ∼ qφ(z|x)

Anomaly Scores

[∥∥x− E[pθ(x|zl)]︸︷︷︸µx

∥∥1

[log p(x|zl)

zl ∼ qφ(z|x)

Anomaly Scores

[∥∥x− E[pθ(x|zl)]︸︷︷︸µx

∥∥1

[log p(x|zl)

zl ∼ qφ(z|x)

Experiments & Results

Time Series Data

Solar PV Generation

0 96 192 288 384Samples

(Production in a day without clouds)

I Provided by

I Recorded every 15min (96 samples per day)

I Data normalized to the installed capacity

I Daily seasonality

I Unlabelled

z-space in 2D (X normaltrain )

t-SNE PCA

T = 32 (< 96)dz = 3

online mode

Finding Anomalies in the Latent Space

Cloudy Day

Inverter Fault

Brief Shading

Normal

Anomaly Scores

Reconstruction Error

(top bar)

−Reconstruction Probability

(bottom bar)

Ground Truth Brief Shading

Inverter Fault Spike

Snow Cloudy Day

HighLow

Attention Visualization

Attention Map

x1 x2 x3

a11 a12 a13 a1T

c1 . . .

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

Attention Map

x1 x2 x3

a11 a12 a13 a1T

c1 . . .a11 a12 a13 a1T

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

Attention Map

x1 x2 x3

a21a22 a23

c2 . . .a11 a12 a13 a1T

a21 a22 a23 a2T

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

Attention Map

x1 x2 x3

a31a32 a33

c3 . . .a11 a12 a13 a1T

a21 a22 a23 a2T

a31 a32 a33 a3T

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

Attention Map Examples

x1 x2 x3

aT2 aT3aTT

cT. . .a11 a12 a13 a1T

a21 a22 a23 a2T

a31 a32 a33 a3T

aT1 aT2 aT3 aTT

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

Attention Map Examples

x1 x2 x3

aT2 aT3aTT

cT. . .a11 a12 a13 a1T

a21 a22 a23 a2T

a31 a32 a33 a3T

aT1 aT2 aT3 aTT

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

0 4 8 12 16 20 24

Time [h]

0 4 8 12 16 20 24

Input Timestep [h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

Conclusions & Future Work

I Effective on detecting anomalies in time series data;

I Unsupervised;

I Suitable for both univariate and multivariate data;

I Efficient: inference and anomaly scores computation is fast;

I General: works with other kinds of sequential data (e.g.,text, videos);

I Exploit the usefulness of the attention maps for detection;

I Make it robust to changes of the normal pattern over time.

Acknowledgements

Thank you for your attention!

mail@joao-pereira.pt

www.joao-pereira.pt

Unsupervised Anomaly Detection in Energy Time Series Data ... · 1Kingma & Welling, Auto-Encoding...

Documents

Transcript of Unsupervised Anomaly Detection in Energy Time Series Data ... · 1Kingma & Welling, Auto-Encoding...

Elvis Rezende Messias*

BTEC Welling Handbook 2011-2012

Workshop track - ICLR 2017

LEDUC - ICLR

How to Train Deep Variational Autoencoders and ... · The recently introduced variational autoencoder (VAE) (Kingma & Welling,2013;Rezende et al.,2014) provides a framework for deep

Frans Welling - Webpower - Conversion14

Semi-supervised Methods - GitHub Pages...(KDD 2014) Planetoid Yang et al. (ICML 2016) Spectral Graph CNN Bruna et al. (ICLR 2015) ChebNet Defferrard et al. (NIPS 2016) GCN Kipf & Welling

110405 Welling REPRINT

Rezende et al. 2006 Science

Rezende et al. 2006 Science. Rezende et al. 2007 Nature.

Welling United Scouting Report

B5 book 15 - camila rezende

Starbucks corporation - Strategic Management Report - Rodrigo Rezende

ICLR 2017 Highlights - Inria · ICLR 2017 Highlights Natural Language Processing Maha ELBAYAD June 2, 2017

Patrícia Soares Rezende Daniela Vieira Marques ...

Welling cum

Welling high blog post pptx

DEPRESSIVE SYMPTOMS AMONG OMMUNITY WELLING …

Published as a conference paper at ICLR 2018timofter/publications/Torfason-ICLR-2018.… · Published as a conference paper at ICLR 2018 We train compression networks for three different

Ferreira Rezende Strategy Disclosure