Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Cartesian Contour: A Concise Representation for a Collection of

Frequent Sets

Ruoming Jin Kent State University

Joint work with Yang Xiang and Lin Liu (KSU)

Frequent Pattern Mining• Summarizing the underlying datasets, providing

key insights• Key building block for data mining toolbox

– Association rule mining– Classification– Clustering– Change Detection– etc…

• Application Domains– Business, biology, chemistry, WWW, computer/netwo

ring security, software engineering, …

The Problem

• The number of patterns is too large• Attempt

– Maximal Frequent Itemsets – Closed Frequent Itemsets– Non-Derivable Itemsets – Compressed or Top-k Patterns– …

• Tradeoff– Significant Information Loss– Large Size

Pattern Summarization

• Using a small number of itemsets to best represent the entire collection of frequent itemsets– The Spanning Set Approach [Afrati-Gionis-Mannila,

KDD04]– Exact Description = Maximal Frequent Itemsets

• Our problem:– Can we find a concise representation which can allow

both exact and approximate summarization of a collection of frequent itemsets?

Basic Idea{A,B,G,H}, {A,B,I,J}, {A,B,K,L}

{C,D,G,H}, {C,D,I,J}, {C,D,K,L}

{E,F,G,H}, {E,F,I,J}, {E,F,K,L}

9 itemsets, 36 items.

{{A,B},{C,D},{E,F}}

Picturing

Cartesian Product

{{G,H},{I,J},{K,L}}

1 biclique, 6 itemsets, 12 items

Covering

Cartesian Covering

Non-frequent itemsets

Problem Formulation• Cartesian product

– e.g.• Cost of a Cartesian product

– e.g. 1 biclique, 3 itemsets, and 5 items• Covering

– e.g.

}},,{},,{},,{},,{},{},{},{{}},,,{},,,{},,,{

},,{},,,{},,{},,{},,{},{},,{},,{},{},{},{},{{

DCEDCDECEDCEDCBADCBDCA

DBACBADCDBCBDACABADCBA

}},,{},,,,{{}},{{}}{},,{{ DCEDCBADCEBA

},,{},,,{ 22}),{}}{},,({{ DCEDCBADCEBAC

How can we use Cartesian products to concisely represent a collection of frequent itemsets?

Exact and Approximate Covering

Exact Representation

Approximate Representation

}},{{}{}},{{}}{{

DCBAG

}},{},,{{}}{{ DCBAG Cost: 1 biclique, 3 itemsets, 5 items

False positive: {G,C},{G,D},{G,C,D}

Cost: 2 biclique, 4 itemsets, 6 items

False positive: none

Covering Maximal Frequent Itemsets

{{ABC}, {CDE}}

{{GHI}, {JKL}}

{{MNO}, {PQR}}

{{STU}, {VWX}}

ABCSTU

ABCGHI

CDESTU

CDEGHI

CDEVWXCDEJKL MNOVWX

MNOGHI

PQRJKL

Problem Reformulation

Given Maximal Frequent Itemsets: }}{},{},{},{},{},{{ EGHKDGHKDEFJCEFJABDIABCI

Frequent Itemsets

Exact representation

Approximate representation

C1 C2 C1 C2

Minimal Biclique Set Cover Problem

Ground Set: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

1, 2,3,4,6,7,8,9

5,10,11

NP-hardness

• By reducing the Minimal Biclique Set Cover into our problem, we can easily prove our problem1 (exact) and problem2 (approximate) are NP-hard.

• Minimal Biclique Set Cover is a Variant of the Classical Set Cover Problem

Can we use the standard set-cover greedy algorithm?

Naïve greedy algorithm

• Greedy algorithm:– Each time choose a biclique with the lowest price

.

– is the cost.– This method has a logarithmic approximation

bound.• The problem?

– The number of candidate bicliques are 2|X|+|Y| !!

||||||)(

ii YXee

iii S

YXC

|||| ii YX

Candidate Reduction

• Assume one side of the biclique candidate is known, how to choose the other side?

Greedy Algorithm

Fixed!

Split and sort

Covering 4 Covering 3 Covering 3

Add 1st single Y-vertex Biclique

Cost = 1;

Add 2nd single Y-vertex Biclique

Cost = 5/7;

Add 3th single Y-vertex Biclique

Cost = 6/8> 5/7

Cheapest sub-biclique!

Biclique Candidate

Approximation Bound of the Greedy Algorithm

The greedy SubBiclique procedure can find a sub-biclique whose price is less than or equal to e/(1-e) of the price of the optimal sub-biclique (cheapest price)!

Further Reduction

• Only using the IDEA1, the time complexity is still exponential .

• How to reduce this further??– Are all the combinations equally

important?– No, because some are more likely to connect to

the Y side.– Our solution: Frequent itemset mining!

||2 || YX

||2 X

Using Frequent Itemset Mining

Overall Algorithm

• Step 1: Use the Frequent Itemset Mining tool to find all the (one-side maximal) biclique candidates;

• Step 2: Calculate the cheapest sub-biclique for each candidate using the greedy procedure;

• Step 3: Compare all the sub-bicliques, choose the cheapest one;

• Step 4: if MFI totally covered, done; else go to Step 2.

Approximation Bound

Our algorithm has e/(1-e) (ln (n)+1) approximation ratio with respect to the candidate set (all the sub-bicliques with one sides coming from the frequent itemset mining).

Speed-up techniques (1)

• Using Closed itemsets for X and Y– Initially X and Y contain all the FI, respectively.– Using to cover MFI is similar to factorizing MFI;

– MFI’s maximal factor itemsets are closed itemsets, whose number is much smaller!

},,,,,,{ 1111 mllm YXYXYXYXYX

Speed-up techniques (2)A,B C,D E,F A,G K,L G ,H B,G A,I

G I J M A B

G ,I M ,G B,I

(1)

Sparse Graph Dense Graph

# Frequent itemsets is small;

Valuable biclique candidates are not be fully used!

# Frequent itemsets is big;

Handling those candidates are too slow!

Frequent Itemset

Supporting Transaction

TRADEOFF

Speed-up techniques (3)

• Iterative procedure– A large number of closed itemsets;– To cover MFI in one time can produce a huge

number of biclique candidates;– So to cover MFI in several times ; – Support level is reduced gradually!

Experiments

• Data sets:

Conclusion

• We propose an interesting summarization problem which consider the interaction between frequent patterns

• We transform this problem into a generalized minimal biclique covering problem and design an approximate algorithm with bound

• The experimental results demonstrate the effective and efficiency of our approach

Thank you !!!

Reference[Bayardo98] Roberto J. Bayardo Jr. Efficiently mining long patterns from databases. SIGMOD98.[Pasquier99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Descovering frequent clo

sed itemsets for association rules. ICDT99.[Calder07] Toon Calder and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discover.

07.[Han02] Jiawei Han, Jianyong Wang, Ying Lu and Petre Tzvetkov. Mining top-k frequent closed patter

ns without minimum support. ICDM02.[Xin06] Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patt

erns. KDD06.[Xin05] Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern set

s. VLDB05.[Afrati04] Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets.

KDD04.[Yan05] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarization itemset patterns: a profil

e-based approach. KDD05.[Wang06] Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilisti

c models. KDD06.[Jin08] Ruoming Jin, Muad Abu-Ata, Yang Xiang, and Ning Ruan. Effective and efficient itemset patte

rn summarization: regression-based approaches. KDD08.[Xiang08] Yang Xiang, Ruoming Jin, David Fuhy, and Feodor F. Dragan. Succinct Summarization of t

ransactional databases: an overlapped hyperrectangle scheme. KDD08.

Related Work

• K-itemset approximation: [Afrati04].– Difference:

• their work is a special case of our work;• their work is expensive for exact description;• Our work use set cover and max-k cover methods.

• Restoring the frequency of frequent itemsets: [Yan05, Wang06, Jin08].

• Hyperrectangle covering problem: [Xiang08].

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Documents

Transcript of Cartesian Contour: A Concise Representation for a Collection of Frequent Sets