Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

30
Cartesian Contour: A Concise Representation for a Collection of Frequent Sets Ruoming Jin Kent State University Joint work with Yang Xiang and Lin Liu (KSU)

description

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets. Ruoming Jin Kent State University. Joint work with Yang Xiang and Lin Liu (KSU). Frequent Pattern Mining. Summarizing the underlying datasets, providing key insights Key building block for data mining toolbox - PowerPoint PPT Presentation

Transcript of Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Page 1: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Cartesian Contour: A Concise Representation for a Collection of

Frequent Sets

Ruoming Jin Kent State University

Joint work with Yang Xiang and Lin Liu (KSU)

Page 2: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Frequent Pattern Mining• Summarizing the underlying datasets, providing

key insights• Key building block for data mining toolbox

– Association rule mining– Classification– Clustering– Change Detection– etc…

• Application Domains– Business, biology, chemistry, WWW, computer/netwo

ring security, software engineering, …

Page 3: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

The Problem

• The number of patterns is too large• Attempt

– Maximal Frequent Itemsets – Closed Frequent Itemsets– Non-Derivable Itemsets – Compressed or Top-k Patterns– …

• Tradeoff– Significant Information Loss– Large Size

Page 4: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Pattern Summarization

• Using a small number of itemsets to best represent the entire collection of frequent itemsets– The Spanning Set Approach [Afrati-Gionis-Mannila,

KDD04]– Exact Description = Maximal Frequent Itemsets

• Our problem:– Can we find a concise representation which can allow

both exact and approximate summarization of a collection of frequent itemsets?

Page 5: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Basic Idea{A,B,G,H}, {A,B,I,J}, {A,B,K,L}

{C,D,G,H}, {C,D,I,J}, {C,D,K,L}

{E,F,G,H}, {E,F,I,J}, {E,F,K,L}

9 itemsets, 36 items.

{{A,B},{C,D},{E,F}}

Picturing

Cartesian Product

{{G,H},{I,J},{K,L}}

1 biclique, 6 itemsets, 12 items

Covering

Page 6: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Cartesian Covering

Non-frequent itemsets

Page 7: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Problem Formulation• Cartesian product

– e.g.• Cost of a Cartesian product

– e.g. 1 biclique, 3 itemsets, and 5 items• Covering

– e.g.

}},,{},,{},,{},,{},{},{},{{}},,,{},,,{},,,{

},,{},,,{},,{},,{},,{},{},,{},,{},{},{},{},{{

DCEDCDECEDCEDCBADCBDCA

DBACBADCDBCBDACABADCBA

}},,{},,,,{{}},{{}}{},,{{ DCEDCBADCEBA

},,{},,,{ 22}),{}}{},,({{ DCEDCBADCEBAC

How can we use Cartesian products to concisely represent a collection of frequent itemsets?

Page 8: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Exact and Approximate Covering

Exact Representation

Approximate Representation

}},{{}{}},{{}}{{

DCBAG

}},{},,{{}}{{ DCBAG Cost: 1 biclique, 3 itemsets, 5 items

False positive: {G,C},{G,D},{G,C,D}

Cost: 2 biclique, 4 itemsets, 6 items

False positive: none

Page 9: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Covering Maximal Frequent Itemsets

{{ABC}, {CDE}}

{{GHI}, {JKL}}

{{MNO}, {PQR}}

{{STU}, {VWX}}

ABCSTU

ABCGHI

CDESTU

CDEGHI

CDEVWXCDEJKL MNOVWX

MNOGHI

PQRJKL

Page 10: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Problem Reformulation

Given Maximal Frequent Itemsets: }}{},{},{},{},{},{{ EGHKDGHKDEFJCEFJABDIABCI

Frequent Itemsets

Exact representation

Approximate representation

C1 C2 C1 C2

Page 11: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Minimal Biclique Set Cover Problem

Ground Set: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

1, 2,3,4,6,7,8,9

5,10,11

Page 12: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

NP-hardness

• By reducing the Minimal Biclique Set Cover into our problem, we can easily prove our problem1 (exact) and problem2 (approximate) are NP-hard.

• Minimal Biclique Set Cover is a Variant of the Classical Set Cover Problem

Can we use the standard set-cover greedy algorithm?

Page 13: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Naïve greedy algorithm

• Greedy algorithm:– Each time choose a biclique with the lowest price

.

– is the cost.– This method has a logarithmic approximation

bound.• The problem?

– The number of candidate bicliques are 2|X|+|Y| !!

||||||)(

ii YXee

iii S

YXC

|||| ii YX

Page 14: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Candidate Reduction

• Assume one side of the biclique candidate is known, how to choose the other side?

Page 15: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Greedy Algorithm

Fixed!

Split and sort

Covering 4 Covering 3 Covering 3

Add 1st single Y-vertex Biclique

Cost = 1;

Add 2nd single Y-vertex Biclique

Cost = 5/7;

Add 3th single Y-vertex Biclique

Cost = 6/8> 5/7

Cheapest sub-biclique!

Biclique Candidate

Page 16: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Approximation Bound of the Greedy Algorithm

The greedy SubBiclique procedure can find a sub-biclique whose price is less than or equal to e/(1-e) of the price of the optimal sub-biclique (cheapest price)!

Page 17: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Further Reduction

• Only using the IDEA1, the time complexity is still exponential .

• How to reduce this further??– Are all the combinations equally

important?– No, because some are more likely to connect to

the Y side.– Our solution: Frequent itemset mining!

||2 || YX

||2 X

Page 18: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Using Frequent Itemset Mining

Page 19: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Overall Algorithm

• Step 1: Use the Frequent Itemset Mining tool to find all the (one-side maximal) biclique candidates;

• Step 2: Calculate the cheapest sub-biclique for each candidate using the greedy procedure;

• Step 3: Compare all the sub-bicliques, choose the cheapest one;

• Step 4: if MFI totally covered, done; else go to Step 2.

Page 20: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Approximation Bound

Our algorithm has e/(1-e) (ln (n)+1) approximation ratio with respect to the candidate set (all the sub-bicliques with one sides coming from the frequent itemset mining).

Page 21: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Speed-up techniques (1)

• Using Closed itemsets for X and Y– Initially X and Y contain all the FI, respectively.– Using to cover MFI is similar to factorizing MFI;

– MFI’s maximal factor itemsets are closed itemsets, whose number is much smaller!

},,,,,,{ 1111 mllm YXYXYXYXYX

Page 22: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Speed-up techniques (2)A,B C,D E,F A,G K,L G ,H B,G A,I

G I J M A B

G ,I M ,G B,I

(1)

Sparse Graph Dense Graph

# Frequent itemsets is small;

Valuable biclique candidates are not be fully used!

# Frequent itemsets is big;

Handling those candidates are too slow!

Frequent Itemset

Supporting Transaction

TRADEOFF

Page 23: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Speed-up techniques (3)

• Iterative procedure– A large number of closed itemsets;– To cover MFI in one time can produce a huge

number of biclique candidates;– So to cover MFI in several times ; – Support level is reduced gradually!

Page 24: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Experiments

• Data sets:

Page 25: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets
Page 26: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets
Page 27: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Conclusion

• We propose an interesting summarization problem which consider the interaction between frequent patterns

• We transform this problem into a generalized minimal biclique covering problem and design an approximate algorithm with bound

• The experimental results demonstrate the effective and efficiency of our approach

Page 28: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Thank you !!!

Page 29: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Reference[Bayardo98] Roberto J. Bayardo Jr. Efficiently mining long patterns from databases. SIGMOD98.[Pasquier99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Descovering frequent clo

sed itemsets for association rules. ICDT99.[Calder07] Toon Calder and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discover.

07.[Han02] Jiawei Han, Jianyong Wang, Ying Lu and Petre Tzvetkov. Mining top-k frequent closed patter

ns without minimum support. ICDM02.[Xin06] Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patt

erns. KDD06.[Xin05] Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern set

s. VLDB05.[Afrati04] Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets.

KDD04.[Yan05] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarization itemset patterns: a profil

e-based approach. KDD05.[Wang06] Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilisti

c models. KDD06.[Jin08] Ruoming Jin, Muad Abu-Ata, Yang Xiang, and Ning Ruan. Effective and efficient itemset patte

rn summarization: regression-based approaches. KDD08.[Xiang08] Yang Xiang, Ruoming Jin, David Fuhy, and Feodor F. Dragan. Succinct Summarization of t

ransactional databases: an overlapped hyperrectangle scheme. KDD08.

Page 30: Cartesian Contour: A Concise Representation for a Collection of Frequent Sets

Related Work

• K-itemset approximation: [Afrati04].– Difference:

• their work is a special case of our work;• their work is expensive for exact description;• Our work use set cover and max-k cover methods.

• Restoring the frequency of frequent itemsets: [Yan05, Wang06, Jin08].

• Hyperrectangle covering problem: [Xiang08].