Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

32
Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervi sor Dr Koh Speaker Nonhlanhla Shongwe April 26, 2010

description

Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09. Preview. Introduction Challenge Method DMM framework Distance Function Selection Experiments Conclusion. Introduction. Mining stream for knowledge mining such as Clustering - PowerPoint PPT Presentation

Transcript of Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Page 1: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Mining Data Streams with Periodically changing

DistributionsYingying Tao, Tamer Ozsu CIKM’09

Supervisor Dr Koh Speaker Nonhlanhla Shongwe

April 26, 2010

Page 2: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Preview Introduction

Challenge

Method

DMM framework

Distance Function Selection

Experiments

Conclusion

Page 3: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Introduction Mining stream for knowledge mining such as

Clustering Classification Frequent patterns discovery , has become important

Important characteristic of unbounded data stream is that the underlying distributions can show important changes over time, leading to dynamic data streams.

Page 4: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Challenge The problem of mining dynamic data streams

Balance accuracy with efficiency – highly accurate mining techniques are generally computationally expensive

Some question to ask: for dynamic data streams Is the distribution changes entirely random and unpredictable Is it possible for the distribution changes to follow certain

patterns?

Page 5: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Method Propose a method for mining dynamic data streams

Where important observed distributions patterns are stored And compared new detected changes with these patterns

Page 6: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Method Two streams with the same distribution

Mining results such as List of all frequent items / itemset for frequent patterns discovery Set of clusters / classes for clustering and classification should be the

same If distribution change is detected and a match is found

Possible to skip the re-mining process Directly output the mining results for the archived distribution

This is called the match-and-reuse strategy

Page 7: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Method Issues to be resolved

Pattern selection Selecting and storing important distributions that have a high

probability of occurring in the future Pattern representation

Storing each pattern succinctly Matching

Efficient procedure for rapid data streams with high accuracy

Page 8: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

DMM framework DMM framework stands for Detect, Match and Mine

Consist of four sequences Choosing representative set Change detection Pattern matching Stream mining

All processes are independent

Page 9: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

DMM framework Window model

Generate reference window (choosing representative set)

Change detection

Distribution matching (Pattern matching)

Choosing important distribution

Page 10: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Window model Two windows on Stream S

Time-based Defines the time intervals Denoted by Wt

Called observation window Implemented as a tumbling window Moves forward at each clock tick

Page 11: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Window model cont’s Count-based

Contains a sub stream with fixed number of elements Denoted by Wr Called reference window

The size of the reference window (|Wr|) and time intervals of Wt are predefined values

Page 12: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

DMM framework Window model

Generate reference window (choosing representative set)

Change detection

Distribution matching (Pattern matching)

Choosing important distribution

Page 13: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Generate reference window (choosing representative set)

Wr stores a set of data elements that represents a current distribution of S

The size needs to be small due to memory limitations Inaccurate results if we use a small data set to represent a

distribution if the distribution is complicated Due to this problem, use a dynamic reference window (Wr)

Page 14: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Generate reference window (choosing representative set)

Merge and select process Dynamic reference window (Wr) Merge Wt and Wr to get a larger substream |Wr|+|Wt| Select |Wr| elements from the merged window (Wr +Wt) and

replace the stream in Wr by the new set Merge and select process is triggered every time Wt tumbles

Page 15: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Generate reference window (choosing representative set)

Page 16: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Generate reference window (choosing representative set)

Selecting representative set: Two-step sampling approach First-step sampling approach

Estimate the density function of Wr + Wt

K= kernel function h= smoothing parameter (bandwidth) si= data element in Wr + Wt

Page 17: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Generate reference window (choosing representative set)

Selecting representative set: Two-step sampling approach K is set to (Standard Gaussian function mean = 0 variance = 1)

Then the density function

h=value between 0 or 1

Page 18: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Generate reference window (choosing representative set)

Selecting representative set: Two-step sampling approach With the density function we are able to estimate the “shape” of

the current distribution

X-axis is the = value of the data s (v(s)) in Wr +Wt Y-axis is the probability (p(v)) for all data values

Page 19: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Generate reference window (choosing representative set)

Selecting representative set: Two-step sampling approach Second-step sampling approach

Page 20: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Generate reference window (choosing representative set)

Selecting representative set: Two-step sampling approach Second-step sampling approach

First calculate the start and end values for each partition

Page 21: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

DMM framework Window model

Generate reference window (choosing representative set)

Change detection

Distribution matching (Pattern matching)

Choosing important distribution

Page 22: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Change detection Online change detection technique that is not restricted to

specific stream processing application

Wt tumbles, the change detection procedure is triggered

Compare the distributions of substreams in both Wr and Wt windows If the distance is greater than the predefined maximum matching

distance, then a distribution change is flagged

Page 23: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

DMM framework Window model

Generate reference window (choosing representative set)

Change detection

Distribution matching (Pattern matching)

Choosing important distribution

Page 24: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Distribution matching (Pattern matching)

We use the appropriate distance measure to check their similarity

If a match is found, then the persevered mining results are outputted

The maximum predefined maximum matching is important Smaller implies a higher accuracy Larger increases the possibility of a new distribution to match a

pattern in the preserved set

Page 25: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

DMM framework Window model

Generate reference window (choosing representative set)

Change detection

Distribution matching (Pattern matching)

Choosing important distribution

Page 26: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Choosing important distribution

Use heuristic rules Distribution that have occurred in the stream for more times are

more important The longer a distribution lasts in the streams lifespan the more

important it is Distribution that has mining results with higher accuracy is more

important that a distributions with less accurate mining results

Page 27: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Distance Function Selection Dynamic Time Wrapping (DTW), Longest Common Subsequence

(LCSS), Edit Distance on Real Sequence (EDR) and Relativized Discrepancy (RD)

A proper distance function that can be used with DMM Efficient , with the ability to stretching

Page 28: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Experiments Change detection

Kernel density approach (KD) Distance function-based approach(DF)

Page 29: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Experiments Distribution matching evaluation

Data from Tropical Atmosphere Ocean Sea surface temperatures. 12 218 streams each with a length of 962

Page 30: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Experiments Efficiency with and without DMM

Adopt a popular decision tree-based clustering technique VFDT to cluster the temperatures Best decision tree generators for dynamic data streams

Time is reduced by 31.3%

Page 31: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Conclusion Introduced a DMM framework to mine dynamic data streams

Window model Generate reference window (choosing representative set) Change detection Distribution matching (Pattern matching) Choosing important distribution

Experiments that showing DMM performs better

Page 32: Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Thank you for your attention