Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Post on 13-Aug-2015

238 views 0 download

Tags:

Transcript of Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Mining Correlations on Massive Bursty Time Series CollectionsTomasz Kuśmierczyk and Kjetil Nørvåg

Problem statement

bursty streams

2

one of many various detection

methods

Problem statement

bursty streams

streams of bursts

3

Problem statement

bursty streams

streams of bursts

correlated bursts

identify correlated bursty streams

4

Problem: Massive Collections

● identify pairs: correlation >= threshold ● N ~ millions of streams● naive (all pairs) solution complexity ~ N 2

● pruning● indexing

5

Motivation

● any source of large number of streams:○ social media○ web page view counts○ traffic monitoring sensors○ smart grid (electricity consumption meters)○ and more

6

Correlated bursts

● different lengths● different heights● slight shifts

but● should overlap

7

Correlated bursty streams

number of bursts per stream

number of bursts from stream i

overlapping with j

number of bursts from stream j

overlapping with i

two streams i and j

Correlated bursty streams

number of bursts per stream

number of bursts from stream i

overlapping with j

number of bursts from stream j

overlapping with i

Ei

Ejtime

ei = 4 oij = 3

ej = 3 oij = 2

min(oij , oi

j) = min(3 , 2) = 2

J(Ei, Ej) = 2 / (4+3 - 2) = 2/5

9

two streams i and j

Enumerating pairs

Order streams according to number of bursts:

❏ FOREACH base count b ❏ FOREACH b’ IN connected counts of b

❏ compare streams with b and b’ bursts

10

Pruning

● for each base count b we need to consider only connected counts b’ such that:

JT • b ≤ b’≤ b

11

threshold particular base countpossible connected counts

Interval Boxes (IB) index

● k-subset of bursts = k-dim box● k-dimensional R-trees

1 2 3 4

4

3

2

1For example (k=2): the representation of stream Ei as 2-dimensional boxes

12

Interval Boxes (IB) index

● k-subset of bursts = k-dim box● k-dimensional R-trees● k-dim boxes overlapping =

at least k bursts overlap

IndexedQuery min(oi

j , oij) ≥ k

13

Interval Boxes (IB) index: mining

● mining: ○ for each base count b maintain

an IB (RTrees) index ○ query it with streams having

connected counts b’

b=1

b=2

b=3

b=414

Interval Boxes (IB) index: mining

● mining: ○ for each base count b maintain

an IB (RTrees) index ○ query it with streams having

connected counts b’

b=1

b=2

b=3

b=4

candidate pairs of streams: min(oij , oi

j) ≥ k

15

correlated output pairs

IB index: what dimensionality k?

● small k (IB Low Dimensional = IBLD)○ small indexes○ large number of candidate pairs

● high k (IB High Dimensional = IBHD)○ large indexes○ small number of candidate pairs○ kmax = JT • b (correlation ≥ threshold guaranteed)

16

IBHD index in practice● to speed up some k-subsets are skipped● some pairs may be missing for multiple overlapping ● efficiency-effectiveness tradeoff

17

List-based (LS) index: bins

separate bin for each (not pruned) b, b’

b=1, b’=2

b=2, b’=3

b=3, b’=4

b=1, b’=3

b=2, b’=4

b=3, b’=5

b=1, b’=4

b=2, b’=5

b=3, b’=6

b=4, b’=5 b=4, b’=6 b=4, b’=7

b=1, b’=5

18

List-based (LS) index: single bin

time

19

time granularity

LS index: mining algorithm● Returns oi

j and oji

● Only for such pairs Ei, Ej that have at least one overlap● Immediate validation of pairs correlation J

20

LS index: mining algorithm● For each set of bursts pointers (time moment):

21

time

current time moment (set of pointers)

bursts active in current moment

bursts active in previous moment

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)

22

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ maintain map

OVERLAPS = burst → set of overlapping streams

23

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ maintain map

OVERLAPS = burst → set of overlapping streams○ update counts oi

j and oji

24

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

Hybrid index

● LS index works well when:○ low number of overlaps○ high number of bursts per stream

● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps

25

Hybrid index

● LS index works well when:○ low number of overlaps○ high number of bursts per stream

● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps

● Solution: Hybrid index:IBHD index for low and LS for high base counts

26

Experimental evaluation● Wikipedia page views from the years 2011-2013● Kleinberg’s burst extraction● streams having at least 5 bursts● 2.1M streams and 43M bursts in total● 10 bursts per stream on average● mean burst length 28h

27

Mining & building

Threshold: JT = 0.9528

Hybrid mining

Number of streams: N = 2.1M29

Number of generated pairs

Threshold: JT = 0.95 (<10% pairs missing)30

How_I_Met_Your_Mother_(season_7)

Two_and_a_Half_Men_(season_9)

Process_(computing) Central_processing_unit

Endoplasmic_reticulum Ribosome

Greatest_Hits,_Vol._2_(Ronnie_Milsap_album)

Greatest_Hits,_Vol._3_(Ronnie_Milsap_album)

DigiTech_JamMan Lexicon_JamMan

Humanistic_psychology Positive_psychology

Computational limits for Naive/LS index

What’s more in the paper?

● formal definitions and proofs● considerations of combinatorial aspects● multiple overlap cases● on-line maintenance of indexes

31

Questions?

Tomasz Kuśmierczyktomaszku@idi.ntnu.no

32

Thank you!

Tomasz Kuśmierczyktomaszku@idi.ntnu.no

33

LS index: mining● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ new overlapping bursts: NEW x OLD ∪ NEW x NEW ○ remove ENDING and add new overlapping bursts to the

map OVERLAPS = burst → set of overlapping streams:○ update counts oi

j and oji for new overlapping bursts and

with the help of OVERLAPS ● For each i and j in o: calculate min(oi

j , oji) and J

34

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD