IncSpan :Incremental Mining of Sequential Patterns in Large Database

21
1 IncSpan :Incremental Mining of S equential Patterns in Large Data base Hong Cheng, Xifeng Yan , Jiawei Han Proc. 2004 Int. Conf. on Knowledge Discover y and Data Mining (KDD'04) Advisor Jia-Ling Koh Speaker Chun-Wei Hsieh 02/25/2005

description

IncSpan :Incremental Mining of Sequential Patterns in Large Database. Hong Cheng , Xifeng Yan , Jiawei Han Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD'04) Advisor : Jia-Ling Koh - PowerPoint PPT Presentation

Transcript of IncSpan :Incremental Mining of Sequential Patterns in Large Database

Page 1: IncSpan :Incremental Mining of Sequential Patterns in Large Database

1

IncSpan :Incremental Mining of Sequential Patterns in Large Database

Hong Cheng, Xifeng Yan , Jiawei Han

Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD'04) Advisor: Jia-Ling Koh Speaker: Chun-Wei Hsieh 02/25/2005

Page 2: IncSpan :Incremental Mining of Sequential Patterns in Large Database

2

Problem

Databases are updated incrementally. (Customer shopping transaction sequences, Weather sequences and patient treatment sequences)

Two kinds of database updates (1) INSERT :inserting new sequences (New customers) (2) APPEND: appending new itemsets/items to the exist

ing sequences (newly purchased items for existing customers)

Page 3: IncSpan :Incremental Mining of Sequential Patterns in Large Database

3

The property of updates :

INSERT : If a sequence is infrequent in both and ,it cannot be

frequent in

APPEND: Even if a sequence is infrequent in both and

,it might be frequent in

When the database is updated with a combination of INSERT

and APPEND, we can treat INSERT as a special case of APPEND – treating the inserted sequences as appended transactions to an empty sequence in the original database.

dbDB

'DB

DB

'DBdb

Page 4: IncSpan :Incremental Mining of Sequential Patterns in Large Database

4

Examples:

Examples in INSERT and APPEND database

Page 5: IncSpan :Incremental Mining of Sequential Patterns in Large Database

5

Preliminary Concepts

An original sequence database An appended sequence database Min_sup: a minimum support threshold FS: the set of frequent sequential pattern Buffer ratio : SFS: the set of semi-frequent sequential pattern The problem of incremental sequential pattern mining i

s to mine the set of frequent subsequences FS’ in based on FS instead of mining on from scratch.

},,,{ 21 nSSSDB dbDBDB '

1

'DB'DB

Page 6: IncSpan :Incremental Mining of Sequential Patterns in Large Database

6

Buffering Semi-frequent Patterns

When the database is updated to , there are several possibilities:

1. A pattern which is frequent in is still frequent in 2. A pattern which is semi-frequent in becomes frequent

in 3. A pattern which is semi-frequent in is still semi-frequent

in 4. Appended database brings new items. 5. A pattern which is infrequent in becomes frequent in 6. A pattern which is infrequent in becomes semi-frequent

in Case (1)–(3) are trivial cases

DB 'DB

DB 'DBDB

'DBDB

'DB

DBDB

'DB

'DB

db

Page 7: IncSpan :Incremental Mining of Sequential Patterns in Large Database

7

Case (4):

Appended database brings new items. It does not appear in

Property: An item which does not appear in and is brought by has no information in FS or SFS.

Solution: Scan the database LDB for single items. Then use the new frequent item as prefix to construct projected database and discover frequent and semi-frequent sequences recursively.

db

DBdb

DB

Page 8: IncSpan :Incremental Mining of Sequential Patterns in Large Database

8

LDB and ODB

LDB is the set of sequences in DB’ which are appended with items/itemsets.

ODB is the set of sequences in DB which are appended with items/itemsets in DB’.

LDBODB

Page 9: IncSpan :Incremental Mining of Sequential Patterns in Large Database

9

Case (4):examples (c)

Min_sup=3u=0.6

Page 10: IncSpan :Incremental Mining of Sequential Patterns in Large Database

10

Case (5):

A pattern which is infrequent in becomes frequent in

Property: If an infrequent sequence p’ in becomes fr

equent in , all of its prefix subsequences must also be frequent in .

Solution: Start from its frequent prefix p in FS and construct p-projected database, we will discover p’.

A sequence p’ which changes from infrequent to frequent must have sup(p’) > (1 - )*min_sup.

If supLDB(p) < (1 - )*min_sup, we can safely prune search with prefix p.

DB

'DB

'DB

'DB

DB

Page 11: IncSpan :Incremental Mining of Sequential Patterns in Large Database

11

Case (5):examples (a,c)

Min_sup=3u=0.6

Page 12: IncSpan :Incremental Mining of Sequential Patterns in Large Database

12

Case (5):theorem

For a frequent pattern p, if its support in LDB supLDB(p) < (1 - )*min_sup, then there is no sequence p’ having p as prefix changing from infrequent in to frequent in

Proof : p’ was infrequent in , so sup (p’) < *min_sup (1) If supLDB(p) < (1 - )*min_sup, then supLDB(p’ ) supLDB(p) < (1 - )*min_sup

Since supLDB(p’ ) = supODB(p’ ) + sup(p’ ). Then we have sup(p’ ) supLDB(p’ ) < (1 - )*min_sup.(2)

Since sup (p’ ) = sup (p’) + sup(p’), combining (1)and (2), we have sup (p’) < min_sup. So p’ cannot be frequent in

DB'DB

DB

'DB'DB

DB 'DB

DB

Page 13: IncSpan :Incremental Mining of Sequential Patterns in Large Database

13

Case (6):

A pattern which is infrequent in becomes semi-frequent in

Property: If an infrequent sequence p’ becomes semifrequent in , all of its prefix subsequences must be either frequent or semi-frequent.

Solution: Start from its prefix p in FS or SFS and construct p-projected database, we will discover p’

DB'DB

'DB

Page 14: IncSpan :Incremental Mining of Sequential Patterns in Large Database

14

Case (6):examples (be)

Min_sup=3u=0.6

Page 15: IncSpan :Incremental Mining of Sequential Patterns in Large Database

15

IncSpan

Step 1: Scan LDB for single items, as shown in case (4). Step 2: Check every pattern in FS and SFS in LDB to

adjust the support of those patterns. Step 2.1: If a pattern becomes frequent, add it to FS’. Th

en check whether it meets the projection condition. If so,use it as prefix to project database, as shown in case (5).

Step 2.2: If a pattern is semi-frequent, add it to SFS’.

Page 16: IncSpan :Incremental Mining of Sequential Patterns in Large Database

16

Algorithm

Page 17: IncSpan :Incremental Mining of Sequential Patterns in Large Database

17

Reverse Pattern Matching

Since the appended items are always at the end part of the original sequence, reverse pattern matching would be more efficient than projection from the front

If the last item of p is not supported by sa, we can prune searching.

If the last item of p is supported by sa, we have to check whether s’ supports p. If p is not supported by s’, we can prune searching and keep sup(p) unchanged. Otherwise we have to check whether s supports p. If s supports p, keep sup(p) unchanged; otherwise, increase sup(p) by 1.

Page 18: IncSpan :Incremental Mining of Sequential Patterns in Large Database

18

Shared Projection

when we detect some subsequence that needs projecting database, we do not do the projection immediately. Instead we label it. After finishing checking and labeling all the sequences, we do the projection by traversing the sequential pattern tree.

<(a)(b)(c)(d)> <(a)(b)(c)(e)>

DB’|<(a)(b)(c)>

Page 19: IncSpan :Incremental Mining of Sequential Patterns in Large Database

19

Experiment

(a) varying min sup (b) varying percentage of updated sequences

Page 20: IncSpan :Incremental Mining of Sequential Patterns in Large Database

20

Experiment

(a) varying buffer ratio (c) Memory Usage under varied min sup

Page 21: IncSpan :Incremental Mining of Sequential Patterns in Large Database

21

Experiment

(b) multiple increments ofdatabase

(c) varying # of sequences (in1000) in DB