Unsupervised Anomaly Detection in Energy Time Series Data ... · 1Kingma & Welling, Auto-Encoding...

Post on 28-May-2020

11 views 0 download

Transcript of Unsupervised Anomaly Detection in Energy Time Series Data ... · 1Kingma & Welling, Auto-Encoding...

17th IEEE International Conference on Machine Learning and Applications

Unsupervised Anomaly Detection in Energy Time Series Data

using Variational Recurrent Autoencoders with Attention

Joao Pereira & Margarida Silveira

Signal and Image Processing GroupInstitute for Systems and Robotics

Instituto Superior Tecnico (Lisbon, Portugal)

Orlando, Florida, USA

December 19th, 2018

Introduction

Anomaly detection is about finding patterns in data that do notconform to expected or normal behaviour.

AnomalyDetection

Communications

NetworkIntrusion

NetworkKPIs

monitoring

Robotics

AssistedFeeding

AutomatedInspection

Energy

SmartMaintenance

FaultDetection

Healthcare

MedicalDiagnostics

PatientMonitoring Security

VideoSurveillance

FraudDetection

Joao Pereira ICMLA’18 2

Introduction

Anomaly detection is about finding patterns in data that do notconform to expected or normal behaviour.

AnomalyDetection

Communications

NetworkIntrusion

NetworkKPIs

monitoring

Robotics

AssistedFeeding

AutomatedInspection

Energy

SmartMaintenance

FaultDetection

Healthcare

MedicalDiagnostics

PatientMonitoring Security

VideoSurveillance

FraudDetection

Joao Pereira ICMLA’18 3

Main Challenges

I Most data in the world are unlabelled

Dataset D =(

x(i),y∗(i))N

i=1

anomaly labels

I Annotating large datasets is difficult, time-consuming andexpensive

Normal

Anomalous

I Time series have temporal structure/dependencies

x =(x1,x2, ...,xT

), xt ∈ Rdx

Joao Pereira ICMLA’18 4

The Principle in a Nutshell

I Based on a Variational Autoencoder12

Inpute.g., time series x

x = (x1,x2, ...,xT )

xt ∈ Rdx

qφ(z|x)Encoder

µzσz

z

Latent Space (z ∈ Rdz)

z = µz + σzε

Decoderpθ(x|z)

Reconstruction(µxt, bxt

)Tt=1

I Train a VAE on data with mostly normal patterns;

I It reconstructs well normal data, while it fails to reconstructanomalous data;

I The quality of the reconstructions is used as anomaly score.

1Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’142Rezende et al., Stochastic Backpropagation and Approximate Inference in Deep Generative Models, ICML’14

Joao Pereira ICMLA’18 5

The Principle in a Nutshell

I Based on a Variational Autoencoder12

Inpute.g., time series x

x = (x1,x2, ...,xT )

xt ∈ Rdx

qφ(z|x)Encoder

µzσz

z

Latent Space (z ∈ Rdz)

z = µz + σzε

Decoderpθ(x|z)

Reconstruction(µxt, bxt

)Tt=1

I Train a VAE on data with mostly normal patterns;

I It reconstructs well normal data, while it fails to reconstructanomalous data;

I The quality of the reconstructions is used as anomaly score.

1Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’142Rezende et al., Stochastic Backpropagation and Approximate Inference in Deep Generative Models, ICML’14

Joao Pereira ICMLA’18 5

Proposed Approach

ReconstructionModel

Detection

Proposed Approach

ReconstructionModel

Detection

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent SpaceCombines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

DecoderReconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xTxt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent SpaceCombines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

DecoderReconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent SpaceCombines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

DecoderReconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent SpaceCombines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

DecoderReconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent Space

Combines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

DecoderReconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent Space

Combines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

DecoderReconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent SpaceCombines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

Decoder

Reconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent SpaceCombines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

Decoder

Reconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Reconstruction Model

EncoderBi-LSTM

−→h e

1

←−h e

1

−→h e

2

←−h e

2

−→h e

3

←−h e

3

−→h eT

←−h eT

+n +n +n +n

x1 x2 x3 xT

xt ∈ Rdx

Input sequence

• • •

• • •

µz

σz

z

−→h d

1

←−h d

1

−→h d

2

←−h d

2

−→h d

3

←−h d

3

−→h dT

←−h dT

• • •

• • •

µx1 bx1 µx2 bx2 µx3 bx3µxT bxT

DecoderBi-LSTM

Variational Layer

Reconstruction

cdet1 cdet2 cdet3 cdetT

a11 a12a13a1T

Linear

SoftPlus

µc1 σc1 µc2 σc2 µc3 σc3 µcT σcT

c1 c2 c3 cT

VariationalSelf-Attention

Mechanism

Corruption

Corruption process: additive Gaussian noise

p(x|x) = x + n , n ∼ Normal(0, σ2nI)

Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, ICML’08

Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, ICLR’15

Denoising Autoencoding Criterion

Bidirectional Long-Short Term Memory network

ht =[−→h t;←−h t

]

I 256 units, 128 in each direction

I Sparse regularization, Ω(z) = λ∑dzi=1 |zi|

Hochreiter et al., Long-Short Term Memory, Neural Computation’97

Graves et al., Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition, ICANN’05

Learning temporal dependencies

Variational parameters derived using neural networks

(µz,σz) = Encoder(x)

Sample from the approximate posterior qφ(z|x)

z = µz + σz ε ε ∼ Normal(0, I)

Kingma & Welling, Auto-Encoding Variational Bayes, ICLR’14

Variational Latent SpaceCombines self-attention with variational inference.

cdett =

T∑

j=1

atjhj (µct,σct) = NN(cdett ), ct ∼ Normal(µct,σ

2ctI)

Vaswani et al., Attention is All You Need, NIPS’17

Variational Self-Attention

Another Bi-LSTM.

Decoder ≡ Bi-LSTM([z; c])

DecoderReconstruction parameters:(µxt, bxt

)Tt=1

Laplace log-likelihood:

log p(xt|zl) ∝ ‖xt − µxt‖1

Output

Joao Pereira ICMLA’18 7

Loss Function

L(θ, φ;x(n)) =

Reconstruction Term︷ ︸︸ ︷−Ez∼qφ(z|x(n)), ct∼qaφ(ct|x

(n))

[log pθ

(x(n)|z, c

)]

+ λKL

[DKL

(qφ(z|x(n))‖pθ(z)

)

︸ ︷︷ ︸Latent Space KL loss

+ ηT∑

t=1

DKL

(qaφ(ct|x(n))‖pθ(ct)

)

︸ ︷︷ ︸Attention KL loss

]

Joao Pereira ICMLA’18 8

Loss Function

L(θ, φ;x(n)) = −Ez∼qφ(z|x(n)), ct∼qaφ(ct|x(n))

[log pθ

(x(n)|z, c

)]

L(θ, φ;x(n)) =

Reconstruction Term︷ ︸︸ ︷−Ez∼qφ(z|x(n)), ct∼qaφ(ct|x

(n))

[log pθ

(x(n)|z, c

)]

+ λKL

[DKL

(qφ(z|x(n))‖pθ(z)

)+ η

T∑

t=1

DKL

(qaφ(ct|x(n))‖pθ(ct)

)]

+ λKL

[DKL

(qφ(z|x(n))‖pθ(z)

)

︸ ︷︷ ︸Latent Space KL loss

+ η

T∑

t=1

DKL

(qaφ(ct|x(n))‖pθ(ct)

)

︸ ︷︷ ︸Attention KL loss

]

DKL denotes the Kullback-Leibler Divergence

Joao Pereira ICMLA’18 8

Loss Function

L(θ, φ;x(n)) =

Reconstruction Term︷ ︸︸ ︷−Ez∼qφ(z|x(n)), ct∼qaφ(ct|x

(n))

[log pθ

(x(n)|z, c

)]

+ λKL

[DKL

(qφ(z|x(n))‖pθ(z)

)

︸ ︷︷ ︸Latent Space KL loss

+ η

T∑

t=1

DKL

(qaφ(ct|x(n))‖pθ(ct)

)

︸ ︷︷ ︸Attention KL loss

]

DKL denotes the Kullback-Leibler Divergence

Joao Pereira ICMLA’18 8

Training Framework

Optimization & Regularization

I About 270k parameters to optimize

I AMS-Grad optimizer3

I Xavier weight initialization4

I Denoising autoencoding criterion5

I Sparse regularization in the encoder Bi-LSTM6

I KL cost annealing7

I Gradient clipping8

Training executed on a single GPU (NVIDIA GTX 1080 TI)

3Reddi, Kale & Kumar, On the Convergence of Adam and Beyond, ICLR’184Bengio et al., Understanding the Difficulty of Training Deep Feedforward Neural Networks, AISTATS’105Bengio et al., Denoising Criterion for Variational Auto-Encoding Framework, AAAI’176Arpit et al., Why Regularized Auto-Encoders Learn Sparse Representation?, ICML’167Bowman, Vinyals et al., Generating Sentences from a Continuous Space, SIGNLL’168Bengio et al., On the Difficulty of Training Recurrent Neural Networks, ICML’13

Joao Pereira ICMLA’18 9

Proposed Approach

ReconstructionModel

ReconstructionModel

Detection

Proposed Approach

ReconstructionModel

ReconstructionModel

Detection

Anomaly Scores

I Use the reconstruction error as anomaly score.

Anomaly Score = Ezl∼qφ(z|x)

[∥∥x− E[pθ(x|zl)]︸ ︷︷ ︸µx

∥∥1

]

I Take the variability of the reconstructions into account.

Anomaly Score = −Ezl∼qφ(z|x)

[log p(x|zl)

]

︸ ︷︷ ︸”Reconstruction Probability”

µ

qφ(z|x) = Normal(µz,σ2zI)

qφ(z|x) = Normal(µz,σ2zI)

zl ∼ qφ(z|x)

1

Joao Pereira ICMLA’18 11

Anomaly Scores

I Use the reconstruction error as anomaly score.

Anomaly Score = Ezl∼qφ(z|x)

[∥∥x− E[pθ(x|zl)]︸ ︷︷ ︸µx

∥∥1

]

I Take the variability of the reconstructions into account.

Anomaly Score = −Ezl∼qφ(z|x)

[log p(x|zl)

]

︸ ︷︷ ︸”Reconstruction Probability”

µ

qφ(z|x) = Normal(µz,σ2zI)

qφ(z|x) = Normal(µz,σ2zI)

zl ∼ qφ(z|x)

1

Joao Pereira ICMLA’18 11

Anomaly Scores

I Use the reconstruction error as anomaly score.

Anomaly Score = Ezl∼qφ(z|x)

[∥∥x− E[pθ(x|zl)]︸ ︷︷ ︸µx

∥∥1

]

I Take the variability of the reconstructions into account.

Anomaly Score = −Ezl∼qφ(z|x)

[log p(x|zl)

]

︸ ︷︷ ︸”Reconstruction Probability”

µ

qφ(z|x) = Normal(µz,σ2zI)

qφ(z|x) = Normal(µz,σ2zI)

zl ∼ qφ(z|x)

1

Joao Pereira ICMLA’18 11

Experiments & Results

Time Series Data

Solar PV Generation

0 96 192 288 384Samples

0.00

0.25

0.50

0.75

En

ergy

(Production in a day without clouds)

I Provided by

I Recorded every 15min (96 samples per day)

I Data normalized to the installed capacity

I Daily seasonality

I Unlabelled

Joao Pereira ICMLA’18 13

Variational Latent Space

z-space in 2D (X normaltrain )

t-SNE PCA

T = 32 (< 96)dz = 3

online mode

Joao Pereira ICMLA’18 14

Finding Anomalies in the Latent Space

Z

Cloudy Day

Snow

Spike

Inverter Fault

Brief Shading

Normal

Joao Pereira ICMLA’18 15

Anomaly Scores

Reconstruction Error

(top bar)

−Reconstruction Probability

(bottom bar)

0.0

0.5

1.0E

ner

gy

Ground Truth Brief Shading

0

1

En

ergy

Inverter Fault Spike

0.0

0.5

1.0

En

ergy

Snow Cloudy Day

HighLow

Joao Pereira ICMLA’18 16

Attention Visualization

Attention Map

x1 x2 x3

. . .

xT

+

a11 a12 a13 a1T

c1 . . .

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

Joao Pereira ICMLA’18 17

Attention Visualization

Attention Map

x1 x2 x3

. . .

xT

+

a11 a12 a13 a1T

c1 . . .a11 a12 a13 a1T

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

Joao Pereira ICMLA’18 17

Attention Visualization

Attention Map

x1 x2 x3

. . .

xT

+

a21a22 a23

a2T

c2 . . .a11 a12 a13 a1T

a21 a22 a23 a2T

c1

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

Joao Pereira ICMLA’18 17

Attention Visualization

Attention Map

x1 x2 x3

. . .

xT

+

a31a32 a33

a3T

c3 . . .a11 a12 a13 a1T

a21 a22 a23 a2T

c1

a31 a32 a33 a3T

c2

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

Joao Pereira ICMLA’18 17

Attention Visualization

Attention Map Examples

x1 x2 x3

. . .

xT

+

aT1

aT2 aT3aTT

cT. . .a11 a12 a13 a1T

a21 a22 a23 a2T

c1

a31 a32 a33 a3T

c2

aT1 aT2 aT3 aTT

c3

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

Joao Pereira ICMLA’18 17

Attention Visualization

Attention Map Examples

x1 x2 x3

. . .

xT

+

aT1

aT2 aT3aTT

cT. . .a11 a12 a13 a1T

a21 a22 a23 a2T

c1

a31 a32 a33 a3T

c2

aT1 aT2 aT3 aTT

c3

Hidden

States

Context

Vectors

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

0 4 8 12 16 20 24

Input Timestep [h]

0

4

8

12

16

20

24

Ou

tpu

tT

imes

tep

[h]

10−3 10−2 10−1 100

Attention Weights

0 4 8 12 16 20 24

Time [h]

0.0

0.2

0.4

0.6

0.8

En

ergy

Joao Pereira ICMLA’18 17

Conclusions & Future Work

I Effective on detecting anomalies in time series data;

I Unsupervised;

I Suitable for both univariate and multivariate data;

I Efficient: inference and anomaly scores computation is fast;

I General: works with other kinds of sequential data (e.g.,text, videos);

I Exploit the usefulness of the attention maps for detection;

I Make it robust to changes of the normal pattern over time.

Joao Pereira ICMLA’18 18

Acknowledgements

Joao Pereira ICMLA’18 19

Acknowledgements

Thank you for your attention!

mail@joao-pereira.pt

www.joao-pereira.pt

Joao Pereira ICMLA’18 20