LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M...

LCM: An Efficient Algorithm forLCM: An Efficient Algorithm forEnumerating Frequent Closed Enumerating Frequent Closed

Item SetsItem Sets

LLinear time inear time CClosed itemset losed itemset

MMineriner

Takeaki UnoTakeaki Uno

Tatsuya AsaiTatsuya Asai

Hiroaki ArimuraHiroaki Arimura

Yuzo UchidaYuzo Uchida

National Institute of Informatics

Kyushu University

Kyushu University

Kyushu University

19/Nov/2003 FIMI 2003

small supports

MotivationMotivation

- We want to solve difficult problems in short time

Few solutions for small support

Many solutions foreven large support

#closed set = #freq. set #closed set << #freq. set

retail

accidents

IBMdatas

chess

connect

mushroom

kosarak

pumsb*

pumsb

BMS POS BMS web1,2

・・ database reductiondatabase reduction・・ remove infrequent itemsremove infrequent items

・・ sparse/densesparse/dense (occ-deliv/diffsets)(occ-deliv/diffsets)

・・ exact enumerationexact enumeration of closed item setof closed item set

・・ generation of generation of all/maximal item set all/maximal item set from closed item setfrom closed item set

large supports

Outline of Our ResearchOutline of Our Research

- Exact enumerationExact enumeration of closed item sets

(no sophisticated pruning, post processing, nor memory for

obtained closed item sets)

- Enumerate all/maximal frequent item sets using closed item set

- Algorithms for updating occurrences/maximality check

in dense/sparse cases, and their adaptive hybridadaptive hybrid

- Save additional memorySave additional memory useuse

(right first sweep, adjacency matrix only for large transactions)

- Introduce acyclic parent-child relationshipparent-child relationship on freq. closed sets

( it induces a tree-shaped transversal routetree-shaped transversal route )

- Traverse the route in depth-first mannerdepth-first manner

( find a child, and go to it )

Exact Enumeration of Closed Item SetsExact Enumeration of Closed Item Sets

Exact enumeration (linear time to #closed set)

Any child is found by taking closure (in short time)

Not need to store obtained item sets (small memory) can enumerate all closed item sets (even without min. support)

rootroot((== φφ))

X : closed item set

parent of X = closure of X∩{1,…,i}

where i is the maximum s.t. X ≠closure of X∩{1,…,i}

parent of X ⊆ X, acyclic

X' = child of X ⇔ X' is closure of X {∪ i} for some i

and (cond) X' ＼ X includes no item <i

Definition of ParentDefinition of Parent

All children are found by taking closure of X {∪ i}

(cond) can be checked in short time by using some algorithms

xx

x'x'

Closure = maximal item set with the same

occurrences

child

Computation of Occurrences X {∪ i} for Sparse and Dense Cases

- In sparse case, by tracing items of each occurrence of X

(occurrence deliver : maybe a known technique)

- In dense case, use diffsets (proposed by Zaki)

Adaptive Hybrid AlgorithmAdaptive Hybrid Algorithm

We choose best one according to estimations of computation timein each iterations

- Maximal frequent sets generated from closed item sets

- All frequent sets (hypercube decomposition) -- decompose classes of closed item sets into complete sublattices

-- enumerate pairs of greatest/least elements of sublattices

-- generate others from the pairs

Maximal and All Frequent SetsMaximal and All Frequent Sets

000 ••• 0

111 ••• 1

closed item set

class01 lattice

ResultResult

retail

accidents

IBMdatas

chess

connect

mushroom

kosarak

pumsb*

pumsb

BMS POS BMS web1,2

fast if support is small

fast or usual

Slower than others

large supports

small supports

fast

ConclusionConclusion

- For data sets s.t. #freq. closed sets << #freq. sets

- large business datasets: BMS-web1,2, retails

- machine learning datasets with small supports: UCI repository

exact enumerationexact enumeration of closed item sets and

hypercube decomposition hypercube decomposition perform well

- These techniques are orthogonal to other techniques,

( ・ database reduction, ・ pruning infrequent items,… )

we can do better for large supports / accidents (blue area).

- Parameter of hybridhybrid is not tuned

not fast for kosarak, IBMdatas now faster

For further speed upFor further speed upFast without pruning, trie,

other existing method

We think…We think…

● What are the real problem (bottleneck) What are the real problem (bottleneck) ??

---- Mining structured item sets

(closed item sets, association rule with threshold,… )

● Is it only a counting problem ?Is it only a counting problem ?

---- for all frequent item set mining, Yes.

the problem is how to make the occurrences of an item set

from other item sets (choose best way, represent

● Is maximal item set useful ?Is maximal item set useful ?

---- closed item set is useful!!

have an application for classification, association rule mining

Usually, < 1/2 Really need to prune ?

- Computing occurrences for infrequent items from X

Some ObservationsSome Observations

X X {1∪ } X {2∪ } X {3∪ } X {4∪ } X {5∪ }

frequency

- Almost computation is for updating occurrences- There is a best e to get occurrence of X from X - eCan we design algorithm choosing e in each iteration ? how we find this e ? Does this accelerate? ( we can evaluate the lower bound of occurrence computation )

Pruning of infrequent sets really necessary?Pruning of infrequent sets really necessary?

Need for accelerating occurrence computation ?Need for accelerating occurrence computation ?

Usually, < 1/2

- Computing occurrences for infrequent items from X

Some ObservationsSome Observations

Really need to prune ?

X X {10∪ } X {11∪ } X {12∪ } X {13∪ } X {14∪ }

frequency

- Generate recursive calls in decreasing order of items

- Clear memory after the recursive call

- Re-use the memory in the following recursive calls

Right First SweepRight First Sweep

Child iterations need no memory

X {10∪ } X {11∪ } X {12∪ } X {13∪ } X {14∪ }

A A ABBCDD

DE

Compute T(X {∪ i}) by tracing each occurrence of X

Occurrence deliverOccurrence deliver

In sparse cases, fast

E

D

C

B

A

X {10∪ } X {11∪ } X {12∪ } X {13∪ } X {14∪ }

A A ABBCDD

DE

- Check (cond) closure of X {∪ i} ＼ X includes no item <i

- In sparse case, find an occurrence not including j,

for all possible item j

- In dense case, update occurrences of all frequent X {∪ j},

and compute T(X {∪ i} {∪ j})

CheckingChecking (cond) (cond) of Closure of Closure

Quite faster than computing the closure of X {∪ i}

ABC

X {1∪ } X {2∪ } X {∪ i} X {14∪ }

ABC

A・・・

・・・

C

connect

0.1

1

10

100

1000

95 90 80 70 60 50 40 30minsup (%)

time (sec)IBM T10I4D100K

1

10

100

1000

0.15 0.125 0.1 0.075 0.05 0.025 minsup (%)

time (sec)LCMfreq

LCM

LCMmax

fpgrowth

fp_eclat

fp_apriori

mafia_fi

mafia_fci

mafia_mfi

BMS-WebView-2

1

10

100

1000

0.1 0.08 0.06 0.04 0.02 0.01

minsup (%)

time (sec)BMS-WebView-1

0.1

1

10

100

1000

0.1 0.08 0.06 0.04 0.02 0.01 minsup (%)

time (sec)LCMfreq

LCM

LCMmax

fpgrowth

fp_eclat

fp_apriori

mafia_fi

mafia_mfi

ResultsResults all closed maximal

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M...

Documents

Transcript of LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M...