COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

46
COMP5331 1 COMP5331 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

Transcript of COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

Page 1: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 1

COMP5331

Data Stream

Prepared by Raymond WongPresented by Raymond Wong

raywong@cse

Page 2: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 2

Data Mining over Static Data

1. Association2. Clustering3. Classification

StaticData

Output(Data Mining Results)

Page 3: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 3

Data Mining over Data Streams

1. Association2. Clustering3. Classification

Output(Data Mining Results)

Unbounded Data

Real-time Processing

Page 4: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 4

Data Streams

1 2 …

Less recent More recent

Each point: a transaction

Page 5: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 5

Data Streams

Traditional Data Mining

Data Stream Mining

Data Type Static Data of Limited Size

Dynamic Data of Unlimited Size (which arrives at high speed)

Memory Limited Limited More challenging

Efficiency Time-Consuming Efficient

Output Exact Answer Approximate (or Exact) Answer

Page 6: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 6

Entire Data Streams

1 2 …

Less recent More recent

Each point: a transaction

Obtain the data mining results from all data points read so far

Page 7: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 7

Entire Data Streams

1 2 …

Less recent More recent

Each point: a transaction

Obtain the data mining results over a sliding window

Page 8: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 8

Data Streams

Entire Data Streams Data Streams with Sliding Window

Page 9: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 9

Entire Data Streams

Association Clustering Classification

Frequent pattern/item

Page 10: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 10

Frequent Item over Data Streams

Let N be the length of the data streams

Let s be the support threshold (in fraction) (e.g., 20%)

Problem: We want to find all items with frequency >= sN

1 2 …

Less recent More recent

Each point: a transaction

Page 11: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 11

Data Streams

Traditional Data Mining

Data Stream Mining

Data Type Static Data of Limited Size

Dynamic Data of Unlimited Size (which arrives at high speed)

Memory Limited Limited More challenging

Efficiency Time-Consuming Efficient

Output Exact Answer Approximate (or Exact) Answer

Page 12: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 12

Data Streams

StaticData

Output(Data Mining Results)

Output(Data Mining Results)

Unbounded Data

Frequent item•I1

Infrequent item•I2•I3

Frequent item•I1•I3

Infrequent item•I2

Page 13: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 13

False Positive/Negative E.g.

Expected Output Frequent item

I1 Infrequent item

I2 I3

Algorithm Output Frequent item

I1 I3

Infrequent item I2

False Positive-The item is classified as frequent item -In fact, the item is infrequent

Which item is one of the false positives?

I3

More?

No.

No. of false positives = 1

If we say:The algorithm has no false positives.

All true infrequent items are classified as infrequent items in the algorithm output.

Page 14: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 14

False Positive/Negative E.g.

Expected Output Frequent item

I1 I3

Infrequent item I2

Algorithm Output Frequent item

I1 Infrequent item

I2 I3

False Negative-The item is classified as infrequent item -In fact, the item is frequent

Which item is one of the false negatives?

I3

More?

No.

No. of false negatives = 1No. of false positives =

0

If we say:The algorithm has no false negatives.

All true frequent items are classified as frequent items in the algorithm output.

Page 15: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 15

Data Streams

Traditional Data Mining

Data Stream Mining

Data Type Static Data of Limited Size

Dynamic Data of Unlimited Size (which arrives at high speed)

Memory Limited Limited More challenging

Efficiency Time-Consuming Efficient

Output Exact Answer Approximate (or Exact) Answer

We need to introduce an input error parameter

Page 16: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 16

Data Streams

StaticData

Output(Data Mining Results)

Output(Data Mining Results)

Unbounded Data

Frequent item•I1Infrequent item•I2•I3

Frequent item•I1•I3

Infrequent item•I2

Page 17: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 17

Data Streams

StaticData

Output(Data Mining Results)

Output(Data Mining Results)

Unbounded Data

Store the statistics of all items•I1: 10•I2: 8•I3: 12

Estimate the statistics of all items

•I1: 10•I2: 4•I3: 10

Item True Frequency

Estimated Frequency

I1 10 10

I2 8 4

I3 12 10

0

4

2

N: total no. of occurrences of itemsN = 20

= 0.2

N = 4

Diff. D D <= N ?

Yes

Yes

Yes

Page 18: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 18

-deficient synopsis Let N be the current length of the stream

(or total no. of occurrences of items) Let be an input parameter (a real number from 0 to 1)

All true frequent items are classified as frequent items in the algorithm output.

An algorithm maintains an -deficient synopsis if its output satisfies the following properties

Condition 2: The difference between the estimated frequency and the true frequency is at most N.

Condition 1: There is no false negative.

Condition 3: All items whose true frequencies less than (s-)N are classified as infrequent items in the algorithm output

Page 19: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 19

Frequent Pattern Mining over Entire Data Streams

Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

Page 20: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 20

Sticky Sampling Algorithm

Sticky Sampling

s

Unbounded Data

Statistics of items Output

Stored in the memory

Frequent itemsInfrequent items

Support threshold

Error parameter

Confidence parameter

Page 21: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 21

Sticky Sampling Algorithm The sampling rate r varies over the

lifetime of a stream Confidence parameter (a small real

number) Let t = 1/ ln(s-1-1)

Data No. r (sampling rate)

1 ~ 2t 1

2t+1 ~ 4t 2

4t+1 ~ 8t 4

… …

Page 22: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 22

Sticky Sampling Algorithm The sampling rate r varies over the

lifetime of a stream Confidence parameter (a small real

number) Let t = 1/ ln(s-1-1)

Data No. r (sampling rate)

1 ~ 2t 1

2t+1 ~ 4t 2

4t+1 ~ 8t 4

… …

e.g. s = 0.02 = 0.01= 0.1

t = 622

1 ~ 2*622

2*622+1 ~ 4*622

4*622+1 ~ 8*622

1~1244

1245~24882489~4976

Page 23: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 23

Sticky Sampling Algorithm The sampling rate r varies over the

lifetime of a stream Confidence parameter (a small real

number) Let t = 1/ ln(s-1-1)

Data No. r (sampling rate)

1 ~ 2t 1

2t+1 ~ 4t 2

4t+1 ~ 8t 4

… …

e.g. s = 0.5 = 0.35= 0.5

t = 4

1 ~ 2*4

2*4+1 ~ 4*4

4*4+1 ~ 8*4

1~8

9~16

17~32

Page 24: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 24

Sticky Sampling Algorithm1. S: empty list

will contain (e, f)

element

Estimated frequency

2. When data e arrives, if e exists in S, increment f in (e, f) if e does not exist in S, add entry (e, 1) with prob. 1/r

(where r: sampling rate)3. Just after r changes,

For each entry (e, f), Repeatedly toss a coin with P(head) = 1/r until the outcome

of the coin toss is head If the outcome of the toss is tail,

Decrement f in (e, f) If f = 0, delete the entry (e, f)

4. [Output] Get a list of items where f + N >= sN

Page 25: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 25

Analysis

-deficient synopsis Sticky Sampling computes an -

deficient synopsis with probability at least 1-

Memory Consumption Sticky Sampling occupies at most

2/ ln(s-1-1) entries on average

Page 26: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 26

Frequent Pattern Mining over Entire Data Streams

Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

Page 27: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 27

Lossy Counting Algorithm

Lossy Counting

s

Unbounded Data

Statistics of items Output

Stored in the memory

Frequent itemsInfrequent items

Support threshold

Error parameter

Page 28: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 28

Lossy Counting Algorithm

1 2 …

Less recent More recent

Each point: a transaction

Bucket 1 Bucket 2 Bucket 3

Bucket bcurrent

N: current length of stream

bcurrent =

w

NWidth w =

1

Page 29: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 29

Lossy Counting Algorithm1. D: Empty set

Will contain (e, f, )

element Frequency of element

since this entry was inserted into D

Max. possible error in f2. When data e arrives,

If e exists in D, Increment f in (e, f, )

If e does not exist in D, Add entry (e, 1, bcurrent-1)

3. Remove some entries in D whenever N 0 mod w(i.e., whenever it reaches the bucket boundary)The rule of deletion is: (e, f, ) is deleted if f + <= bcurrent

4. [Output] Get a list of items where f + N >= sN

Page 30: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 30

Lossy Counting Algorithm

-deficient synopsis Lossy Counting computes an -

deficient synopsis Memory Consumption

Lossy Counting occupies at most 1/ log(N) entries.

Page 31: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 31

Comparison

-deficient synopsis

Memory Consumption

Sticky Sampling

1- confidence 2/ ln(s-1-1)

Lossy Counting 100% confidence 1/ log(N)

Memory = 1243

e.g. s = 0.02 = 0.01= 0.1N = 1000

Memory = 231

Page 32: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 32

Comparison

-deficient synopsis

Memory Consumption

Sticky Sampling

1- confidence 2/ ln(s-1-1)

Lossy Counting 100% confidence 1/ log(N)

Memory = 1243

e.g. s = 0.02 = 0.01= 0.1N = 1,000,000

Memory = 922

Page 33: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 33

Comparison

-deficient synopsis

Memory Consumption

Sticky Sampling

1- confidence 2/ ln(s-1-1)

Lossy Counting 100% confidence 1/ log(N)

Memory = 1243

e.g. s = 0.02 = 0.01= 0.1N = 1,000,000,000

Memory = 1612

Page 34: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 34

Frequent Pattern Mining over Entire Data Streams

Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

Page 35: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 35

Sticky Sampling Algorithm

Sticky Sampling

s

Unbounded Data

Statistics of items Output

Stored in the memory

Frequent itemsInfrequent items

Support threshold

Error parameter

Confidence parameter

Page 36: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 36

Lossy Counting Algorithm

Lossy Counting

s

Unbounded Data

Statistics of items Output

Stored in the memory

Frequent itemsInfrequent items

Support threshold

Error parameter

Page 37: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 37

Space-Saving Algorithm

Space-Saving

s

M

Unbounded Data

Statistics of items Output

Stored in the memory

Frequent itemsInfrequent items

Support threshold

Memory parameter

Page 38: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 38

Space-Saving

M: the greatest number of possible entries stored in the memory

Page 39: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 39

Space-Saving1. D: Empty set

Will contain (e, f, )

element

Frequency of element since this entry was inserted into D

Max. possible error in f2. pe = 0

3. When data e arrives, If e exists in D,

Increment f in (e, f, ) If e does not exist in D,

If the size of D = M pe mineD {f + } Remove all entries e where f + pe

Add entry (e, 1, pe)4. [Output] Get a list of items where

f + >= sN

Page 40: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 40

Space-Saving

Greatest Error Let E be the greatest error in any

estimated frequency.

E 1/M -deficient synopsis

Space-Saving computes an -deficient synopsis if E

Page 41: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 41

Comparison

-deficient synopsis

Memory Consumption

Sticky Sampling

1- confidence 2/ ln(s-1-1)

Lossy Counting 100% confidence 1/ log(N)

Space-Saving100% confidencewhere E <=

M

Memory = 1243

e.g. s = 0.02 = 0.01= 0.1N = 1,000,000,000

Memory = 1612

Memory can be very large (e.g., 4,000,000)Since E <= 1/M the error is very small

Page 42: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 42

Data Streams

Entire Data Streams Data Streams with Sliding Window

Page 43: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 43

Data Streams with Sliding Window

Association Clustering Classification

Frequent pattern/itemset

Page 44: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 44

Sliding Window Mining Frequent Itemsets in a sliding window E.g.

t1: I1 I2t2: I1 I3 I4

… To find frequent itemsets in a sliding window

t1 t2 …

Sliding window

Page 45: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 45

B1 B2B3 B4

Sliding WindowSliding window

Last 4 batches

Storage Storage Storage Storage

Page 46: COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse.

COMP5331 46

Sliding WindowSliding window

B1 B2B3 B4

Last 4 batches

B5

Remove the whole batch

Storage Storage Storage Storage Storage