Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos...

32
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University

Transcript of Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos...

Page 1: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

Streaming Pattern Discovery in Multiple Time-Series

Jimeng Sun

Spiros Papadimitrou Christos Faloutsos

PARALLEL DATA LABORATORYCarnegie Mellon University

Page 2: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 2

Motivation

• Co-evolving time series (data streams) appear in many different applications—e.g.:• Disk access traffic in network clusters• Internet flow traffic in a network• Temperatures in a large building• Chlorine concentration in water distribution

network

Values are typically correlated

Would be very useful if we could summarize them on the fly

Page 3: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 3

Example

water distribution network

normal operation

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

time

Page 4: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 4

• Discover “hidden” (latent) variables for:• Summarization of main trends for users• Efficient forecasting, spotting outliers/anomalies

• Incremental, real-time computation• Limited memory requirements

Goals

Page 5: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 5

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

Example: chlorine measurements

water distribution network

normal operation major leak

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

Page 6: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 6

Phase 1

k = 1

Example: hidden variable

actual measurements(n streams)

k hidden variable(s)

We would like to discover a few “hidden(latent) variables” that summarize the key trends

Phase 1

: : : : : :

: : : : : :

chlo

rine c

once

ntr

ati

ons

Page 7: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 7

Example: hidden variable trackingch

lori

ne c

once

ntr

ati

ons

Phase 1 Phase 1Phase 2 Phase 2

actual measurements(n streams)

k hidden variable(s)

k = 2

: : : : : :

: : : : : :

We would like to discover a few “hidden(latent) variables” that summarize the key trends

Page 8: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 8

Example: hidden variable trackingch

lori

ne c

once

ntr

ati

ons

Phase 1 Phase 1Phase 2 Phase 2Phase 3 Phase 3

actual measurements(n streams)

k hidden variable(s)

k = 1

: : : : : :

: : : : : :

We would like to discover a few “hidden(latent) variables” that summarize the key trends

Page 9: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 9

Method outline

• Step 1: How to capture correlations?

• Step 2: How to do it incrementally, when we have a very large number of points?

• Step 3: How to dynamically adjust the number of hidden variables?

Page 10: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 10

1. How to capture correlations?

20oC

30oC

Tem

pera

ture

T1

• First sensor

time

Page 11: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 11

1. How to capture correlations?

• First sensor• Second sensor

20oC

30oC

Tem

pera

ture

T2

time

Page 12: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 12

20oC 30oC

1. How to capture correlations

20oC

30oC

Temperature T1

•Correlations:

•Let’s take a closer look at the first three value-pairs…

Tem

pera

ture

T2

Page 13: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 13

20oC 30oC

1. How to capture correlations

20oC

30oC

Tem

pera

ture

T2

Temperature T1

•First three lie (almost) on a line in the space of value-pairs…

O(n) numbers for the slope, and One number for each value-pair (offset on line)

offse

t = “h

idde

n va

riabl

e”

time=1

time=2

time=3

Page 14: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 14

1. How to capture correlations

20oC 30oC

20oC

30oC

Tem

pera

ture

T2

Temperature T1

•Other pairs also follow the same pattern: they lie (approximately) on this line

Page 15: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 15

Method outline

• Step 1: How to capture correlations?

• Step 2: How to do it incrementally, when we have a very large number of points?

• Step 3: How to dynamically adjust the number of hidden variables?

Page 16: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 16

From hidden variables

Experiments: chlorine concentration

166 streams2 hidden variables (~4% error)

Measurements

Reconstruction

[CMU Civil Engineering]

from sensor

Page 17: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 17

Experiments: chlorine concentration

hidden variables

[CMU Civil Engineering]

• Both capture global, periodic pattern• Second: ~ first, but “phase-shifted”• Can express any “phase-shift”…

Page 18: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 18

Conclusion• Many settings with hundreds of streams, but

• Stream values are, by nature, related• We proposed a method to

• discover hidden variables as summarization of main trends for users

• require only incremental computation without buffering of any past data

• Future work:• Apply on more applications: e.g, performance

monitoring for storage system, network system.

Page 19: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 19

Related work

• Stream SVD [Guha, Gunopulos, Koudas / KDD03]• StatStream [Zhu, Shasha / VLDB02]• Clustering• [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson,

et al / TKDE],• [Lin, Vlachos, Keogh, Gunopulos / EDBT04], • Classification• [Wang, Fan, et al / KDD03], [Hulten, Spencer,

Domingos / KDD01]• Piecewise approximations• [Palpanas, Vlachos, Keogh, etal / ICDE 2004]

Page 20: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 20

Experiments: Light measurements

54 sensors2-4 hidden variables (~6% error)

measurementreconstruction

Page 21: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 21

Experiments: Light measurements

• 1 & 2: main trend (as before)• 3 & 4: potential anomalies and

outliers

hidden variables

intermittentintermittent

Page 22: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 22

Stream correlations

• Step 1: How to capture correlations?

• Step 2: How to do it incrementally, when we have a very large number of points?

• Step 3: How to dynamically adjust the number of hidden variables?

Page 23: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 23

2. Incremental update

error

20oC 30oC

20oC

30oC

Tem

pera

ture

T2

Temperature T1

• For each new point

• Project onto current line

• Estimate error

New value

Page 24: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 24

2. Incremental update

error

20oC

30oC

20oC 30oC

Tem

pera

ture

T2

Temperature T1

• For each new point• Project onto

current line• Estimate error• Rotate line in the

direction of the error and in proportion to its magnitude

O(n) time New value

Page 25: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 25

2. Incremental update

20oC

30oC

20oC 30oC

Tem

pera

ture

T2

Temperature T1

• For each new point• Project onto

current line• Estimate error• Rotate line in the

direction of the error and in proportion to its magnitude

Page 26: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 26

Stream correlationsPrincipal Component Analysis (PCA)

• The “line” is the first principal component (PC) vector

• This line is optimal: it minimizes the sum of squared projection errors

Page 27: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 27

2. Incremental updateGiven number of hidden variables k

• Assuming k is known• We know how to update the slope• (detailed equations in paper)

• For each new point x and for i = 1, …, k :

• yi := wiTx (proj. onto wi)

• di di + yi2 (energy i-th eigenval.)

• ei := x – yiwi (error)

• wi wi + (1/di) yiei (update estimate)

• x x – yiwi (repeat with remainder)

y1

w1

xe1

w1 updated

Page 28: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 28

Stream correlations

• Step 1: How to capture correlations?

• Step 2: How to do it incrementally, when we have a very large number of points?

• Step 3: How to dynamically adjust k, the number of hidden variables?

Page 29: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 29

T3

3. Number of hidden variables

• If we had three sensors with similar measurements

• Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space

T1

T2

value-tuple space

Page 30: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 30

T3

3. Number of hidden variables

• Assume one sensor intermittently gets stuck

• Now, no line can give a good approximation

T1

T2

value-tuple space

Page 31: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 31

T3

3. Number of hidden variables

• Assume one sensor intermittently gets stuck

• Now, no line can give a good approximation

• But a plane will do (two hidden variables, k = 2)

T1

T2

value-tuple space

Page 32: Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

<your name here> © Apr 21, 2023 http://www.pdl.cmu.edu/ 32

Number of hidden variables (PCs)

•Keep track of energy maintained by approximation with k variables (PCs):

• Reconstruction accuracy, w.r.t. total squared error

•Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold

• If below 95%, k k 1

• If above 98%, k k 1