VLDB Talk Nurjahan Begum for pdf

29

Transcript of VLDB Talk Nurjahan Begum for pdf

Rare Time Series Motif Discovery from Unbounded Streams

Nurjahan Begum and Eamonn Keogh

VLDB 2015

Talk Outline

• Ubiquity of time series

• What are time series motifs?

• Rare Motif Discovery

• Conclusions

Talk Outline

• Ubiquity of time series

• What are time series motifs?

• Rare Motif Discovery

• Conclusions

Time Series is Ubiquitous

0 20

0

40

0

60

0

80

0

100

0

120

0

0 50 100 150 200 250 300 350 400 450 0

0.5

1

Unstructured audio stream

Sesnors on machine Shapes Hand Writing

Motion Capture Human Speech Web Clicks

Electrocardiogram

Insect Wingbeat Sound

Talk Outline

• Ubiquity of time series

• What are time series motifs?

• Rare Motif Discovery

• Conclusions

What are time series motifs?

- Approximately repeated subsequences - An example: Activity Recognition

walking walking stretching walking

0 200 400 600 800 1000

vacuuming

Motifs are useful as a subroutine for: -Classification -Clustering -Rule Discovery -Anomaly Detection

Talk Outline

• Ubiquity of time series

• What are time series motifs?

• Rare Motif Discovery

• Conclusions

Rare Motif Discovery

• Motivation

• Algorithms

– Brute Force

– Limited cache

• Performance Improvement

– Changing Data Representation

– Sticky Cache

• Experiments

Rare Motif Discovery

• Motivation

• Algorithms

– Brute Force

– Limited cache

• Performance Improvement

– Changing Data Representation

– Sticky Cache

• Experiments

What are time series motifs?

- Approximately repeated subsequences - An example: Activity Recognition

walking walking stretching walking

0 200 400 600 800 1000

vacuuming

Situations where current motif finding algorithms can perform

poorly/ fail Far apart in space (Motifs occurring in different data chunks )

Infrequent (Computationally expensive!)

Rare Motifs: A real life example

(Four months

omitted)

3 days ago 2 days ago now 131 days ago 129 days ago 127 days ago : : : : :

0

20

40

Solar Panel

Current (mA)

A never-ending time series stream from a weather station’s solar panel, only a fraction of which we can buffer. A pattern we are observing now seems to have also occurred about four months ago.

Rare Motif Discovery

• Motivation

• Algorithms

– Brute Force

– Limited cache

• Performance Improvement

– Changing Data Representation

– Sticky Cache

• Experiments

Brute Force Approach

I1 I2 I3 I4 …

“current item is a motif pattern” if we find that D(k + 1, j) < T and j < k + 1.

Ik

Brute Force Approach

• Brute Force with Limited Memory

– A cache of fixed size w

Success Metric Expected number of objects we see before we report success

I1 I2 I3 I4 …

“current item is a motif pattern” if we find that D(k + 1, j) < T and j < k + 1.

Ik

Rare Motif Discovery

• Motivation

• Algorithms

– Brute Force

– Limited cache

• Performance Improvement

– Changing Data Representation

– Sticky Cache

• Experiments

Changing Data Representation

Emulating virtually large cache - Downsampling the data - Reducing the dimensionality of the data - Reducing the cardinality of the data

16 20 24 30 0

2000

4000

6000

8000

10000

12000

2 4 8 12

Exp

ecte

d n

um

ber

of

ob

ject

s p

roce

ssed

b

efo

re s

ucc

ess

Virtual Cache Size

Dimensionality Reduction

Cardinality Reduction

Downsampling

Rare Motif Discovery

• Motivation

• Why the problem is hard?

• Algorithms – Brute Force

– Limited cache

• Performance Improvement – Changing Data Representation

– Sticky Cache

• Experiments

Sticky Cache

0 300 600 900 1200 1500

0.4

0.6

0.8

0.99 1

P100 = Probability of discarding an element from R is 100 times greater

P50 = Probability of discarding an element from R is 50 times greater

P100

P50

P1

Pro

bab

ility

of

succ

ess

Number of objects seen before success

•A magic cache where potential motif patterns tend to remain for longer •Biased cache replacement policy

Sticky Cache

Algorithm for detecting potential motifs

– Discretize each time series subsequence

– Query the Bloom Filter for the instance in question

• If Bloom Filter saw the instance before – Tag it as potential motif pattern

• Else – Tag it as random pattern

0 300 600 900 1200 1500

0.4

0.6

0.8

0.99 1

P100 = Probability of discarding an element from R is 100 times greater

P50 = Probability of discarding an element from R is 50 times greater

P100

P50

P1

Pro

bab

ility

of

succ

ess

Number of objects seen before success

0 50 100 150 200 250 300 350

100

1000

400

Exp

ecte

d n

um

be

r o

f e

lem

en

ts s

een

b

efo

re s

ucc

ess

Virtual Cache Size

Downsampling

Dimensionality Reduction

Cardinality Reduction

Cardinality Reduction with Sticky cache

Which approach is best?

Comparison of all approaches in commensurate scale

Rare Motif Discovery

• Motivation

• Algorithms

– Brute Force

– Limited cache

• Performance Improvement

– Changing Data Representation

– Sticky Cache

• Case Studies

7.88 7.9 7.92 7.94 7.96 7.98 8

x 10 4

Dish washer

TS: Dishwasher + Refrigerator Motif Length: 160 (2 hrs 40 mins) Sampling Rate: 0.017 Hz

Day 11 Day 19 : : : : : : : : : : :

(omitted section)

Day 70 Day 140 Day 210 Day 280 Day 350

Day 70 Day 140 Day 210 Day 280 Day 350

Ground Truth

Motifs Detected

Time Series Length: 2245824 (10 hours) Sampling Frequency: 62.3 Hz Motif Length: 188 (3 sec)

White-crowned Sparrow (Zonotrichia leucophrys)

37 minutes : : : : : : : : : : : : : : : : : : : 140 minutes

(omitted section)

0 40 80 120 160 200

36 min 54 sec

2.3 hours

A

0 40 80 120 160 200

1 min 57 sec

B

0 40 80 120 160 200

31 min 27 sec

C

Dataset: NPR August 01, 2013 Time Series Length: 29 hr 21 min 57 sec MFCC space length: 6596741 (6.5 million) Sampling Frequency: 62.4 Hz Motif Length: 4 sec

Conclusions

• We address the problem of detecting rare motifs

– Changing Data representation

– Sticky Cache

• All the code and data for this paper is publicly available!

Thank you!