Approximate Mining of Consensus Sequential Patterns
description
Transcript of Approximate Mining of Consensus Sequential Patterns
Hye-Chung (Monica) Kum
University of North Carolina, Chapel HillComputer Science Department
School of Social Work
http://www.cs.unc.edu/~kum/approxMAP
Approximate Mining of Consensus Sequential Patterns
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Knowledge Discovery & Data mining (KDD)
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Knowledge Discovery & Data mining (KDD)
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Knowledge Discovery & Data mining (KDD)
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Knowledge Discovery & Data mining (KDD)
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Knowledge Discovery & Data mining (KDD)
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Knowledge Discovery & Data mining (KDD)
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Knowledge Discovery & Data mining (KDD)
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Knowledge Discovery & Data mining (KDD)
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner
combining ideas drawn from databases, machine learning, artificial intelligence, knowledge-based systems, information retrieval, statistics, pattern recognition, visualization, and parallel and distributed computing
Fayyad, Piatetsky-Shapiro, Smyth 1996
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
What is KDD ?
Purpose– Extract useful information
Source– Operational or Administrative Data
Example– VIC card database for buying patterns– monthly welfare service patterns
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Example
Analyze buying patterns for sales marketingTID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Example
VIC card : 4/8 = 50%TID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Example
VIC card : 5/8=63%TID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Overview
What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Overview
What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Sequential Pattern Mining
CID TID Transaction1 1 {Diapers, Hotdogs, Buns, Beer}1 2 {Bread, Milk, Diapers, Wipes, Beer}1 3 {Milk, Diapers, Beer, Water}2 4 {Bread, Milk, Bananas, Cereal}2 6 {Steak, Corn, Coke, Beer}3 5 {Bread, Milk, Diapers, Beer}3 7 {Milk, Orange Juice, Diapers, Baby Food}3 8 {Bread, Milk, Diapers, Beer}
C Sequential Transaction1 {Dp, HD, Buns, Br} {Bread, Mk, Dp, Wipes, Br} {Mk, Dp, Br, Wt}2 {Bread, Mk, Bananas, Cereal} {Steak, Corn, Coke, Br}3 {Bread, Mk, Dp, Br} {Mk, OJ, Dp, Baby Food} {Bread, Mk, Dp, Br}
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Sequential Pattern Mining
C Sequential Transaction1 {Dp, HD, Buns, Br} {Bread, Mk, Dp, Wipes, Br} {Mk, Dp, Br, Wt}2 {Bread, Mk, Bananas, Cereal} {Steak, Corn, Coke, Br}3 {Bread, Mk, Dp, Br} {Mk, OJ, Dp, Baby Food} {Bread, Mk, Dp, Br}
Detecting patterns in sequences of sets
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Welfare Program Participation Patterns
What are the common participation patterns ?What are the variations to them ?How do different policies affect these patterns?
Cid Sequential Transaction1 {W(elfare) M(edi) F(oodstamp)} {WMF} {WMF} {MF} {MF} {F}2 {WMF} {WMF} {WMF} {WMF} {WMF} {M} {M}3 {F} {F} {F} {WMF} {WMF} {WMF} {MF} {MF} {F}
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Thesis Statement
The author of this dissertation asserts that multiple alignment is an effective model to uncover the underlying trend in sequences of sets.
I will show that approxMAP, – is a novel method to apply multiple alignment techniques to sequences of sets,– will effectively extract the underlying trend in the data – by organizing the large database into clusters – as well as give reasonable descriptors (weighted sequences and consensus
sequences) for the clusters via multiple alignment Furthermore, I will show that approxMAP
– is robust to its input parameters, – is robust to noise and outliers in the data, – scalable with respect to the size of the database, – and in comparison to the conventional support model, approxMAP can better
recover the underlying pattern with little confounding information under most circumstances.
In addition, I will demonstrate the usefulness of approxMAP using real world data.
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Thesis Statement
Multiple alignment is an effective model to uncover the underlying trend in sequences of sets.
ApproxMAP is a novel method to apply multiple alignment techniques to sequences of sets.
ApproxMAP can recover the underlying patterns with little confounding information under most circumstances including those in which the conventional methods fail.
I will demonstrate the usefulness of approxMAP using real world data.
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Sequential Pattern Mining
Detecting patterns in sequences of sets
Sequence seq1 < (A,B,D)(B)(C,D)(B,C) >
Itemsets s13 (C,D)
Items I {A, B, C, D}
• Nseq: Total # of sequences in the Database
• Lseq: Avg # of itemsets in a sequence
• Iseq : Avg # of items in an itemset
• Lseq * Iseq : Avg length of a sequence
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Conventional Methods : Support Model
Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Conventional Methods : Support Model
Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Conventional Methods : Support Model
Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)
Support (P ): # of super-sequences of P in D
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Conventional Methods : Support Model
Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)
Support (P ): # of super-sequences of P in D
Given D, and user threshold, min_sup– find complete set of P s.t. Support(P )
min_sup
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Conventional Methods : Support Model
Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)
Support (P ): # of super-sequences of P in DGiven D, and user threshold, min_sup
– find complete set of P s.t. Support(P ) min_sup• R. Agrawal and R. Srikant : ICDE 95 & EBDT 96
Methods– Breadth first – Apriori Principle (GSP)
• R. Agrawal and R. Srikant : ICDE 95 & EBDT 96
– Depth first – pattern growth (PrefixSpan)• J. Han and J. Pei : SIGKDD 2000 & ICDE 2001
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Example: Support Model
{Dp, Br} {Mk, Dp} {Mk, Dp, Br} : 2/3=67% 2L - 1= 27-1=128-1=127 subsequences
C Sequential Transaction1 {Dp, HD, Buns, Br} {Bread, Mk, Dp, Wipes, Br} {Mk, Dp, Br, Wt}2 {Bread, Mk, Bananas, Cereal} {Steak, Corn, Coke, Br}3 {Bread, Mk, Dp, Br} {Mk, OJ, Dp, Baby Food} {Bread, Mk, Dp, Br}
– {Dp, Br} {Mk, Dp} {Mk, Br}
– {Dp, Br} {Mk, Dp} {Mk, Dp}
– {Mk, Dp} {Mk, Dp, Br}
– {Dp, Br} {Mk, Dp, Br}
– … etc …
– {Br} {Mk, Dp} {Mk, Dp, Br}
– {Dp} {Mk, Dp} {Mk, Dp, Br}
– {Dp, Br} {Dp} {Mk, Dp, Br}
– {Dp, Br} {Mk} {Mk, Dp, Br}
– {Dp, Br} {Mk, Dp} {Dp, Br}
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Inherent Problems : the model
Support – cannot distinguish between statistically significant
patterns and random occurrencesTheoretically
– Short random sequences occur often in long sequential data simply by chance
Empirically– # of spurious patterns grows exponential w.r.t. Lseq
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Inherent Problems : exact match
A pattern gets support– the pattern is exactly contained in the sequence
Often may not find general long patternsExample
– many customers may share similar buying habits– few of them follow an exactly same pattern
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Inherent Problems : Complete set
Mines complete set – Too many trivial patterns
Given long sequences with noise – too expensive and too many patterns– 2L - 1= 210-1=1023
Finding max / closed sequential patterns – is non-trivial– In noisy environment, still too many max/closed
patterns
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Possible Models
Support model– Patterns in sets – unordered list
Multiple alignment model– Find common patterns among strings– Simple ordered list of characters
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Multiple Alignment
line up the sequences to detect the trend– Find common patterns among strings– DNA / bio sequences
P A T T T E R NP A T E R MP T T R NO A T T E R BP S Y Y R T N
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Multiple Alignment
line up the sequences to detect the trend– Find common patterns among strings– DNA / bio sequences
P A T T T E R NP A T E R MP T T R NO A T T E R BP S Y Y R T NP A T T E R N
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Pairwise Score(edit distance) : dist(seq1, seq2) – Minimum # of ops required to change seq1 to seq2
– Ops = INDEL(a) and/or REPLACE(a,b)– Recurrence relation
Edit Distance
P A T T T E R NP A T E R M
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Multiple Alignment Score– ∑PS(seqi, seqj) ( 1 ≤ i ≤ N and 1≤ j ≤ N)– Optimal alignment : minimum score
Pairwise Score(edit distance) : dist(seq1, seq2) – Minimum # of ops required to change seq1 to seq2
– Ops = INDEL(a) and/or REPLACE(a,b)– Recurrence relation
Edit Distance
P A T T T E R NP A T E R M
INDEL INDEL REPL
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
Weighted Sequence : – compression of aligned sequences into one sequence
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
Weighted Sequence : – compression of aligned sequences into one sequence
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
Weighted Sequence : – compression of aligned sequences into one sequence
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
Weighted Sequence : – compression of aligned sequences into one sequence
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
Weighted Sequence : – compression of aligned sequences into one sequence
strength(i, j) = # of occurrences of item i in position jtotal # of sequences
– A : 3/3 = 100%– E : 1/3 = 33%– H : 1/3 = 33%
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
Weighted Sequence : – compression of aligned sequences into one sequence
strength(i, j) = # of occurrences of item i in position jtotal # of sequences
Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
Weighted Sequence : – compression of aligned sequences into one sequence
strength(i, j) = # of occurrences of item i in position jtotal # of sequences
Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )
Consensus sequence : – concatenation of the consensus itemsets
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3
Consensus Seq (A) (BC) (DE)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Consensus Sequence
Weighted Sequence : – compression of aligned sequences into one sequence
strength(i, j) = # of occurrences of item i in position jtotal # of sequences
Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )
Consensus sequence : – concatenation of the consensus itemsets
seq1
(A) (B) (DE)
seq2 (AE) (H) (BC) (E)
seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3
Consensus Seq (A) (BC) (DE)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>Multiple Alignment
Sequential Pattern Mining
Given – N sequences of sets, – Op costs (INDEL & REPLACE) for itemsets, and– Strength thresholds for consensus sequences
To(1) partition the N sequences into K sets of sequences such
that the sum of the K multiple alignment scores is minimum, and
(2) find the multiple alignment for each partition, and (3) find the pattern consensus sequence and the variation
consensus sequence for each partition
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Overview
What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>ApproxMAP
(Approximate Multiple Alignment Pattern mining)
Exact solution : Too expensive!Approximation Method : ApproxMAP
– Organize into K partitions• Use clustering
– Compress each partition into • weighted sequences
– Summarize each partition into • Pattern consensus sequence• Variation consensus sequence
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Tasks
Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions
– Use clusteringCompress each partition into
– weighted sequencesSummarize each partition into
– Pattern consensus sequence– Variation consensus sequence
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Tasks
Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions
– Use clusteringCompress each partition into
– weighted sequencesSummarize each partition into
– Pattern consensus sequence– Variation consensus sequence
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
Jaccard coefficient– 1-|XY| / |XY|
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
Jaccard coefficient– 1-|XY| / |XY|
Sørensen coefficient : simple index – Give greater "weight" to common elements
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
Jaccard coefficient– 1-|XY| / |XY|– 1-|(XY)| / |X-Y|+|Y-X|+|XY|
Sørensen coefficient : simple index – Give greater "weight" to common elements
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
Jaccard coefficient– 1-|XY| / |XY|– 1-|(XY)| / |X-Y|+|Y-X|+|XY|
Sørensen coefficient : simple index – Give greater "weight" to common elements – 1-2*|(XY)| / |X-Y|+|Y-X|+2*|XY|
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op costs for itemsets
Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1
Jaccard coefficient– 1-|XY| / |XY|– 1-|(XY)| / |X-Y|+|Y-X|+|XY|
Sørensen coefficient : simple index – Give greater "weight" to common elements – 1-2*|(XY)| / |X-Y|+|Y-X|+2*|XY|– = (|X|+|Y|-2*|XY|) / (|X|+|Y|) = R(X,Y)
REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Tasks
Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions
– Use clusteringCompress each partition into
– weighted sequencesSummarize each partition into
– Pattern consensus sequence– Variation consensus sequence
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Organize : Partition into K sets
Goal: – To minimize the sum of the K multiple
alignment scores– Group similar sequences
Approximate– Calculate N*N proximity matrix
• Pairwise score : edit distance– Any clustering that works best for your data
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Organize : Clustering
Desirable Properties– Form groups of arbitrary shape and size– Can estimate the number of clusters from the data
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Density Based Clustering
k-nearest neighbor Partition based at the valleys of the density estimate Density of sequence = n / (|D|*d) n/d
– n & d : Based on user defined k nearest neighbor space– n : # of neighbors– d : size of neighbor region
Parameter k : Neighbor space– Can cluster at different resolutions as desired
General : Uniform kernel k-NN clustering– Efficient = O(kN)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Tasks
Op costs (INDEL & REPLACE) for itemsetsOrganize into k partitions
– Use clusteringCompress each partition into
– weighted sequencesSummarize each partition into
– Pattern consensus sequence– Variation consensus sequence
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Data Compression : Multiple Alignment
Optimal multiple alignment too expensive !greedy approximation
– Incrementally align – in density-descending order– Pairwise alignment
• Sequence to Weighted sequence
ID Lexically sorted sequence clusterseq3 (A) (B) (DE)seq4 (A) (BCG) (D)seq2 (AE) (H) (B) (D)seq1 (AG) (F) (BC) (AE) (H)seq5 (BCI) (DE)
ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)
ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)
: 4(H:1,F:1): 2
(B:5,C:3,G:1,I:1) : 5
(A:1,D:4,E:3): 5
(H:1): 1
5
ID Lexically sorted sequence clusterseq3 (A) (B) (DE)seq4 (A) (BCG) (D)seq2 (AE) (H) (B) (D)seq1 (AG) (F) (BC) (AE) (H)seq5 (BCI) (DE)
Multiple Alignment
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)
Multiple Alignment
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)
seq4 (A) (BCG) (D)
Multiple Alignment
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2
seq4 (A) (BCG) (D)
Multiple Alignment
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2
seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3
Multiple Alignment
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2
seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3
seq5 (BCI) (DE)
Multiple Alignment
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2
seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3
seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):
4(D:4,E:2):4 4
Multiple Alignment
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2
seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3
seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):
4(D:4,E:2):4 4
seq1 (AG) (F) (BC) (AE) (H)
Multiple Alignment
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2
seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3
seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):4 (D:4,E:2):4 4
seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)
: 4(H:1,F:1): 2
(B:5,C:3,G:1,I:1) : 5
(A:1,D:4,E:3): 5
(H:1): 1
5
Op Cost for Itemset to weighted itemset
seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):4 (D:4,E:2):4 4
seq1 (AG) (F) (BC) (AE) (H)
Replace ((A:3,E:1):3 – 4 , (AG) ) = ?
Op Cost for Itemset to weighted itemset
seq3 (A) R=1/3
Tot Avg=65/120seq2 (AE) R=1/2seq4 (A) R=1/3seq5 INDEL=1
seq1 (AG)
Replace ((A:3,E:1):3 – 4 , (AG) ) 65/120
Op Cost for Itemset to weighted itemset
seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2
seq4 (A) R=1/3
seq5 INDEL=1 Tot Avg=65/120
seq1 (AG)
Op Cost for Itemset to weighted itemset
seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2
seq4 (A) R=1/3
seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3
- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90
seq1 (AG) 1* ( n – wX )
Op Cost for Itemset to weighted itemset
seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2
seq4 (A) R=1/3
seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3
- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90
seq1 (AG) R’w(Xw,Y) = [ weight(X)+|Y|*wX – 2*weight(XY) ]
[ weight(X) + |Y|*wX ](|X|+|Y|-2*|XY|)
(|X|+|Y|)
Op Cost for Itemset to weighted itemset
seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2
seq4 (A) R=1/3
seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3
- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90
seq1 (AG) R’w(Xw,Y) = [ weight(X)+|Y|*wX – 2*weight(XY) ]
[ weight(X) + |Y|*wX ]
R’w(Xw,Y) * wX
Op Cost for Itemset to weighted itemset: Rw
seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2
seq4 (A) R=1/3
seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3
- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90
seq1 (AG) 1* ( n – wX )
R’w(Xw,Y) * wX
Replace ((A:3,E:1):3 – 4 , (AG) )= Rw(Xw, Y) = [ R’w(Xw,Y) * wX + n – wX ] / n
seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2
seq4 (A) R=1/3
seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3
- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90
Rw=[(2/5)*3+1]/4=11/20=66/120seq1 (AG)
Op Cost for Itemset to weighted itemset: Rw
1* ( n – wX )
R’w(Xw,Y) * wX
Replace ((A:3,E:1):3 – 4 , (AG) )= Rw(Xw, Y) = [ R’w(Xw,Y) * wX + n – wX ] / n
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Op Cost for Itemset to weighted itemset:Rw
Op cost– R’w(Xw,Y)= [ weight(X)+|Y|*wX – 2*weight(XY) ]
[ weight(X) + |Y|*wX ]
– Rw(Xw, Y) = [ R’w* wX + n – wX ] / n
– 0 ≤ Rw ≤ 1 , metric
– INDEL(Xw) = Rw (Xw,) = INDEL(Y) = Rw (, Y) =1
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Tasks
Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions
– Use clusteringCompress each partition into
– weighted sequencesSummarize each partition into
– Pattern consensus sequence– Variation consensus sequence
Summarize: Generate and Present results
N sequences K weighted sequencesWeighted sequence : huge
– compression of all sequences
ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)
: 4(H:1,F:1): 2
(B:5,C:3,G:1,I:1) : 5
(A:1,D:4,E:3): 5
(H:1): 1
5
< (E:1, L:1, R:1, T:1, V:1, d:1) (A:1, B:9, C:8, D:8, E:12, F:1, L:4, P:1, S:1, T:8, V:5, X:1, a:1, d:10, e:2, f:1, g:1, p:1) (B:99, C:96, D:91, E:24, F:2, G:1, L:15, P:7, R:2, S:8, T:95, V:15, X:2, Y:1, a:2, d:26, e:3, g:6, l:1, m:1) (A:5, B:16, C:5, D:3, E:13, F:1, H:2, L:7, P:1, R:2, S:7, T:6, V:7, Y:3, d:3, g:1) (A:13, B:126, C:27, D:1, E:32, G:5, H:3, J:1, L:1, R:1, S:32, T:21, V:1, W:3, X:2, Y:8, d:13, e:1, f:8, i:2, p:7, l:3, g:1) (A:12, B:6, C:28, D:1, E:28, G:5, H:2, J:6, L:2, S:137, T:10, V:2, W:6, X:8, Y:124, a:1, d:6, g:2, i:1, l:1, m:2) (A:135, B:2, C:23, E:36, G:12, H:124, K:1, L:4, O:2, R:2, S:27, T:6, V:6, W:10, X:3, Y:8, Z:2, a:1, d:6, g:1, h:2, j:1, k:5, l:3, m:7, n:1) (A:11, B:1, C:5, E:12, G:3, H:10, L:7, O:4, S:5, T:1, V:7, W:3, X:2, Y:3, a:1, m:2) (A:31, C:15, E:10, G:15, H:25, K:1, L:7, M:1, O:1, R:4, S:12, T:10, V:6, W:3, Y:3, Z:3, d:7, h:3, j:2, l:1, n:1, p:1, q:1) (A:3, C:5, E:4, G:7, H:1, K:1, R:1, T:1, W:2, Z:2, a:1, d:1, h:1, n:1) (A:20, C:27, E:13, G:35, H:7, K:7, L:111, N:2, O:1, Q:3, R:11, S:10, T:20, V:111, W:2, X:2, Y:3, Z:8, a:1, b:1, d:13, h:9, j:1, n:1, o:2) (A:17, B:2, C:14, E:17, F:1, G:31, H:8, K:13, L:2, M:2, N:1, R:22, S:2, T:140, U:1, V:2, W:2, X:1, Z:13, a:1, b:8, d:6, h:14, n:6, p:1, q:1) (A:12, B:7, C:5, E:13, G:16, H:5, K:106, L:8, N:2, O:1, R:32, S:3, T:29, V:9, X:2, Z:9, b:16, c:5, d:5, h:7, l:1) (A:7, B:1, C:9, E:5, G:7, H:3, K:7, R:8, S:1, T:10, X:1, Z:3, a:2, b:3, c:1, d:5, h:3) (A:1, B:1, H:1, R:1, T:1, b:2, c:1) (A:3, B:2, C:2, E:6, F:2, G:4, H:2, K:20, M:2, N:3, R:19, S:3, T:11, U:2, X:4, Z:34, a:3, b:11, c:2, d:4) (H:1, Y:1, a:1, d:1) > : 162
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Presentation model
Frequent items : – definite pattern items– Cutoff : =50%
Common items : – uncertain– Cutoff : =20%
Rare items – Noise items
FrequentItems
CommonItems
RareItems
=50%
=20%
W=100%
W=0%
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Visualization
Pattern Consensus Sequence :– Cutoff : Minimum cluster strength (=50%)– Frequent items
Variation Consensus Sequence : – Cutoff : Minimum cluster strength (=20%)– Frequent items + common items
100%: 85%: 70%: 50%: 35%: 20%
(B:61%, C:59%, D:56%, T:59%)
(B: 78%)
(S:85%, Y:77%)
(A:83%, E:22%, H:77%)
(G:22%, L:69%, V:69%)
(T: 86%)
(K: 65%)
(B: 21%) : 162
Pat tern
(B C D T) (B) (S Y) (A H) (L V) (T) (K) :162 Variation (B C D T) (B) (S Y) (A E H) (G L V) (T) (K) (Z) :162
ID Full Sequence Database lexically sorted Cluster Lseq Len
seq1 (A) (B, C, Y) (D) 1 3 5
seq2 (A) (X) (B, C) (A, E) (Z) 1 5 7
seq3 (A, I) (Z) (K) (L, M) 2 4 6
seq4 (A, L) (D, E) 1 2 4
seq5 (I, J) (B) (K) (L) 2 4 5
seq6 (I, J) (L, M) 2 2 5
seq7 (I, J) (K) (J, K) (L) (M) 2 5 7
seq8 (I, M) (K) (K, M) (L, M) 2 4 7
seq9 (J) (K) (L, M) 2 3 4
seq10 (V) (K, W) (Z) 2 3 4
Example: Given 10 seqs lexically sorted
ID Full Sequence Database lexically sorted Cluster Lseq Len
seq1 (A) (B, C, Y) (D) 1 3 5
seq2 (A) (X) (B, C) (A, E) (Z) 1 5 7
seq3 (A, I) (Z) (K) (L, M) 2 4 6
seq4 (A, L) (D, E) 1 2 4
seq5 (I, J) (B) (K) (L) 2 4 5
seq6 (I, J) (L, M) 2 2 5
seq7 (I, J) (K) (J, K) (L) (M) 2 5 7
seq8 (I, M) (K) (K, M) (L, M) 2 4 7
seq9 (J) (K) (L, M) 2 3 4
seq10 (V) (K, W) (Z) 2 3 4
Example: Given 10 seqs lexically sorted
Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)
seq1 (A) (B, C, Y) (D)
seq4 (A, L) (D, E)
seq2 (A) (X) (B, C) (A, E) (Z)
Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)
seq1 (A) (B, C, Y) (D)
seq4 (A, L) (D, E)
seq2 (A) (X) (B, C) (A, E) (Z)
Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3
Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)
seq1 (A) (B, C, Y) (D)
seq4 (A, L) (D, E)
seq2 (A) (X) (B, C) (A, E) (Z)
Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3
Consensus Pat (A) (B, C) (D, E)
Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)
seq1 (A) (B, C, Y) (D)
seq4 (A, L) (D, E)
seq2 (A) (X) (B, C) (A, E) (Z)
Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3
Consensus Pat (A) (B, C) (D, E)
Cluster 2 (cluster strength = 40% = 3 sequences)seq9 (J) (K) (L, M)
seq5 (I, J) (B) (K) (L)
seq3 (A, I) (Z) (K) (L, M)
seq7 (I, J) (K) (J, K) (L) (M)
seq8 (I, M) (K) (K, M) (L, M)
seq6 (I, J) (L, M)
seq10 (V) (K, W) (Z)
Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)
seq1 (A) (B, C, Y) (D)
seq4 (A, L) (D, E)
seq2 (A) (X) (B, C) (A, E) (Z)
Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3
Consensus Pat (A) (B, C) (D, E)
Cluster 2 (cluster strength = 40% = 3 sequences)seq9 (J) (K) (L, M)
seq5 (I, J) (B) (K) (L)
seq3 (A, I) (Z) (K) (L, M)
seq7 (I, J) (K) (J, K) (L) (M)
seq8 (I, M) (K) (K, M) (L, M)
seq6 (I, J) (L, M)
seq10 (V) (K, W) (Z)
Weightedsequence
(A:1,I:5,J:4,M:1):6
(B:1,K:2,V:1,Z:1):5
(J:1,K:6,M:1,W:1):6 (L:6,M:4):6 (M:1,
Z:1):2 7
Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)
seq1 (A) (B, C, Y) (D)
seq4 (A, L) (D, E)
seq2 (A) (X) (B, C) (A, E) (Z)
Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3
Consensus Pat (A) (B, C) (D, E)
Cluster 2 (cluster strength = 40% = 3 sequences)seq9 (J) (K) (L, M)
seq5 (I, J) (B) (K) (L)
seq3 (A, I) (Z) (K) (L, M)
seq7 (I, J) (K) (J, K) (L) (M)
seq8 (I, M) (K) (K, M) (L, M)
seq6 (I, J) (L, M)
seq10 (V) (K, W) (Z)
Weightedsequence
(A:1,I:5,J:4,M:1):6
(B:1,K:2,V:1,Z:1):5
(J:1,K:6,M:1,W:1):6 (L:6,M:4):6 (M:1,
Z:1):2 7
ConsensusPat (w≥3) (I, J) (K) (L, M)
ConsensusVar (w≥2) (I, J) (K) (K) (L, M)
Example: support Model (20%=2 seq)id pattern sup id pattern sup id pattern sup1 (A) 4 17 (A) (D) 2 33 (I,J) (K) 22 (B) 3 18 (A) (E) 2 34 (I,J) (L) 33 (C) 2 19 (A) (Z) 2 35 (I,J) (M) 24 (D) 2 20 (A) (B,C) 2 36 (I) (K) (K) 25 (E) 2 21 (I) (K) 4 37 (I) (K) (L) 26 (I) 5 22 (I) (L) 5 38 (I) (K) (M) 27 (J) 4 23 (I) (M) 4 39 (I) (K) (L,M) 28 (K) 6 24 (I) (L,M) 3 40 (J) (K) (L) 29 (L) 7 25 (J) (K) 3 41 (J) (K) (M) 210 (M) 5 26 (J) (L) 4 42 (K) (K) (L) 211 (Z) 3 27 (J) (M) 3 43 (K) (K) (M) 212 (B,C) 2 28 (J) (L,M) 2 44 (I,J) (K) (L) 213 (I,J) 2 29 (K) (K) 2 45 (I) (K) (K) (L) 214 (L,M) 2 30 (K) (L) 5 46 (I) (K) (K) (M) 215 (A) (B) 2 31 (K) (M) 416 (A) (C) 2 32 (K) (L,M) 3
(A) (B,C) (D,E)(I,J) (K) (L,M)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>ApproxMAP
(Approximate Multiple Alignment Pattern mining)
Approximation Method : ApproxMAP– Organize into K partitions = O(Nseq
2Lseq2Iseq)
• Proximity matrix = O(Nseq2Lseq
2Iseq)
• Clustering = O(kNseq)
– Compress each partition = O(nL2)• weighted sequences = O(nL2)
– Summarize each partition = O (1)• Pattern consensus sequence• Variation consensus sequence
– Time Complexity : O(Nseq2Lseq
2Iseq)• 2 optimizations
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Overview
What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation
Up to now: Only performance / scalabilityQuality?
– What kind of patterns will the model generate?– Evaluate correctness of the model
Why?– Basis for comparison of different models– Essential in understanding results of approximate
solutions
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Method
Given – Set of Base patterns B : E(FB) & E(LB) D–D Set of Result patterns P
How?– Map each Pi to best Bj
• based on Longest Common Subsequences • of all Bj
• max res pat B(|BP|)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Item level
Predicted (Result Patterns) +
Actual(Base Pat)
+ Pattern Items
Confusion Matrix
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Item level
Predicted (Result Patterns) +
Actual(Base Pat)
Extraneous Items+ Pattern Items
Confusion Matrix
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Item level
Predicted (Result Patterns) +
Actual(Base Pat)
Extraneous Items+ Missed Items Pattern Items
Confusion Matrix
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Criteria : Item level
Recoverability :– Degree of pattern items in the Base Pat (weighted)– ∑ E(FB) * [ max res pat B(|BP|) / E(LB)]– Cutoff so that 0 ≤ R ≤ 1
Precision : – Degree of pattern items in the Result Pat– Pattern Items / (Pattern Items + Extraneous Items)
Predicted (Result Patterns) +
Actual(Base Pat)
N/A Extraneous Items+ Missed Items Pattern Items
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Criteria : Sequence level
spurious patterns– Pattern Items Extraneous Items
determine max pattern for each Bj
– Of all Pi map to a particular Bj
– the Pi with the Longest Common Subsequence
– max res pat P(|BP|)
redundant patterns– All other patterns
Ntotal = Nmax + Nspur + Nredun
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Example
30% : (A)(BC)(DE)
70% : (IJ)(K)(LM)
(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)
(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)
(XY)(K)(Z)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Example
30% : (A)(BC)(DE)
70% : (IJ)(K)(LM)
Ntotal = 7
(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)
(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)
(XY)(K)(Z)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Example
30% : (A)(BC)(DE)
70% : (IJ)(K)(LM)
Ntotal = 7 Spurious = 1
(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)
(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)
(XY)(K)(Z)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Example
30% : (A)(BC)(DE)
70% : (IJ)(K)(LM)
Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5
= 94%
(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)
(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)
(XY)(K)(Z)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Example
30% : (A)(BC)(DE)
70% : (IJ)(K)(LM)
Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5 =
94% Redundant = 4
(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)
(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)
(XY)(K)(Z)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Evaluation Example
30% : (A)(BC)(DE)
70% : (IJ)(K)(LM)
Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5 = 94% Redundant = 4 Precision = 1-5/31=84%
(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)
(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)
(XY)(K)(Z)
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Synthetic data
Patterned data : IBM synthetic data generator– Given certain DB parameters outputs
• sequence DB• base patterns used to generate it : E(FB) and E(LB)
– R. Agrawal and R. Srikant : ICDE 95 & EBDT 96 Random data
– Independence both between and across itemsets Patterned data + systematic noise
– Randomly change item with probability – Yang, SIGMOD 2002
Patterned data + systematic outliers– random sequences
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Overview
What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Results
ApproxMAP– Pattern consensus sequence– No null or one-itemset
Machine : swan– 2GHz Intel Xeon processor – 2GB of memory– Public machine
• Difficult to get consistent running time measurements• Thanks !
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Database Parameter
Notation Meaning Value|| I || # of items 100|| || # of potentially freq itemsets 500
Ipat Avg. # of items per itemset in BP 2
Lpat Avg. # of itemsets per base pat 7
Npat # of base pattern sequences 10
Nseq # of data sequences 1000
Lseq Avg. # of itemsets per data seq 10
Iseq Avg. # of items per itemset in DB 2.5
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Database Parameter
Notation Meaning Value|| I || # of items 100|| || # of potentially freq itemsets 500
Ipat Avg. # of items per itemset in BP 2
Lpat Avg. # of itemsets per base pat 7
Npat # of base pattern sequences 10
Nseq # of data sequences 1000
Lseq Avg. # of itemsets per data seq 10
Iseq Avg. # of items per itemset in DB 2.5
BasePi (E(FB):E(LB)) ||P|| Pattern <100: 85: 70: 50: 35: 20> BaseP1 (0.21:0.66) 14 <(15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) (93) >
PatConSeq1 13 < (15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) > VarConSeq1 18 < (15 16 17 66) (15 22) (58 99) (2 74) (24 31 76) (24 66) (50 62) (93) >
BaseP2 (0.161:0.83) 22 <(22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) (63 74 99) > PatConSeq2 19 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) > VarConSeq2 25 < (22 50 66) (16) (29 99) (22 58 94) (2 45 58 67) (12 28 36) (2 50) (24 96) (51) (66) (2 22 58) > PatConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) > VarConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) >
BaseP3 (0.141:0.82) 14 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) (11 15 74) > PatConSeq4 11 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) > VarConSeq4 13 < (22) (22) (22) (58) (2 16 24 63) (2 24 65 93) (6 50) >
BaseP4 (0.131:0.90) 15 <(31 76) (58 66) (16 22 30) (16) (50 62 66) (2 16 24 63) > PatConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) > VarConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) (16 24) >
BaseP5 (0.123:0.81) 14 <(43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) (93) > PatConSeq6 13 < (43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) > VarConSeq6 16 < (22 43) (2 28 73) (58 96) (95) (2 74) (5) (2 66) (24 63) (20) >
BaseP6 (0.121:0.77) 9 <(63) (16) (2 22) (24) (22 50 66) (50) > PatConSeq7 8 < (63) (16) (2 22) (24) (22 50 66) > VarConSeq7 9 < (63) (16) (2 22) (24) (22 50 66) >
BaseP7 (0.054:0.60) 13 <(70) (58 66) (22) (74) (22 41) (2 74) (31 76) (2 74) > PatConSeq8 16 < (70) (58) (22 58 66) (22 58) (74) (22 41) (2 74) (31 76) (2 74) > VarConSeq8 18 < (70) (58 66) (22 58 66) (22 58) (74) (22 41) (2 22 66 74) (31 76) (2 74) > PatConSeq9 0 cluster size was only 5 sequences so no pattern consensus sequence was produced VarConSeq9 8 < (70) (58 66) (74) (74) (22 41) (74) >
BaseP8 (0.014:0.91) 17 < (20 22 23 96) (50) (51 63) (58) (16) (2 22) (50) (23 26 36) (10 74) > BaseP9 (0.038:0.78) 7 < (88) (24 58 78) (22) (58) (96) > BaseP10 (0.008:0.66) 17 < (16) (2 23 74 88) (24 63) (20 96) (91) (40 62) (15) (40) (29 40 99) >
10 Base Patterns
D = 1000 Sequences
IBM Synthetic Data Generator
Id:LB E(FB):E(LB) Base Pattern
B1:14 0.21:0.66 [15,16,17,66] [15] [58,99] [2,74] [31,76] [66] [62][93]
B2:22 0.161:0.83 [22,50,66][16][29,99][94][45,67]…[2,22,58][63,74,99]
B3:14 0.141:0.82 [22] [22] [58] [2,16,24,63] [24,65,93] [6] [11,15,74]
Etc
B10:17 0.008:0.66 [16][2,23,74,88][24,63][20,96][91][40,62]...[29,40,99]
ApproxMAP
wseq1(162 seqs)
wseq2 wseq9
cluster1(162 seqs)
cluster2 cluster9
10 Base Patterns
D = 1000 Sequences
IBM Synthetic Data Generator
< (E:1, L:1, R:1, T:1, V:1, d:1) (A:1, B:9, C:8, D:8, E:12, F:1, L:4, P:1, S:1, T:8, V:5, X:1, a:1, d:10, e:2, f:1, g:1, p:1) (B:99, C:96, D:91, E:24, F:2, G:1, L:15, P:7, R:2, S:8, T:95, V:15, X:2, Y:1, a:2, d:26, e:3, g:6, l:1, m:1) (A:5, B:16, C:5, D:3, E:13, F:1, H:2, L:7, P:1, R:2, S:7, T:6, V:7, Y:3, d:3, g:1) (A:13, B:126, C:27, D:1, E:32, G:5, H:3, J:1, L:1, R:1, S:32, T:21, V:1, W:3, X:2, Y:8, d:13, e:1, f:8, i:2, p:7, l:3, g:1) (A:12, B:6, C:28, D:1, E:28, G:5, H:2, J:6, L:2, S:137, T:10, V:2, W:6, X:8, Y:124, a:1, d:6, g:2, i:1, l:1, m:2) (A:135, B:2, C:23, E:36, G:12, H:124, K:1, L:4, O:2, R:2, S:27, T:6, V:6, W:10, X:3, Y:8, Z:2, a:1, d:6, g:1, h:2, j:1, k:5, l:3, m:7, n:1) (A:11, B:1, C:5, E:12, G:3, H:10, L:7, O:4, S:5, T:1, V:7, W:3, X:2, Y:3, a:1, m:2) (A:31, C:15, E:10, G:15, H:25, K:1, L:7, M:1, O:1, R:4, S:12, T:10, V:6, W:3, Y:3, Z:3, d:7, h:3, j:2, l:1, n:1, p:1, q:1) (A:3, C:5, E:4, G:7, H:1, K:1, R:1, T:1, W:2, Z:2, a:1, d:1, h:1, n:1) (A:20, C:27, E:13, G:35, H:7, K:7, L:111, N:2, O:1, Q:3, R:11, S:10, T:20, V:111, W:2, X:2, Y:3, Z:8, a:1, b:1, d:13, h:9, j:1, n:1, o:2) (A:17, B:2, C:14, E:17, F:1, G:31, H:8, K:13, L:2, M:2, N:1, R:22, S:2, T:140, U:1, V:2, W:2, X:1, Z:13, a:1, b:8, d:6, h:14, n:6, p:1, q:1) (A:12, B:7, C:5, E:13, G:16, H:5, K:106, L:8, N:2, O:1, R:32, S:3, T:29, V:9, X:2, Z:9, b:16, c:5, d:5, h:7, l:1) (A:7, B:1, C:9, E:5, G:7, H:3, K:7, R:8, S:1, T:10, X:1, Z:3, a:2, b:3, c:1, d:5, h:3) (A:1, B:1, H:1, R:1, T:1, b:2, c:1) (A:3, B:2, C:2, E:6, F:2, G:4, H:2, K:20, M:2, N:3, R:19, S:3, T:11, U:2, X:4, Z:34, a:3, b:11, c:2, d:4) (H:1, Y:1, a:1, d:1) > : 162
cluster1(162 seqs)
cluster2 cluster9
ApproxMAP
wseq1(162 seqs)
wseq2 wseq9
PatConSeq1
VarConSeq1
PatConSeq9
VarConSeq9
10 Base Patterns
D = 1000 Sequences
IBM Synthetic Data Generator
100%: 85%: 70%: 50%: 35%: 20%
(B:61%, C:59%, D:56%, T:59%)
(B: 78%)
(S:85%, Y:77%)
(A:83%, E:22%, H:77%)
(G:22%, L:69%, V:69%)
(T: 86%)
(K: 65%)
(B: 21%) : 162
Pat tern
(B C D T) (B) (S Y) (A H) (L V) (T) (K) :162 Variation (B C D T) (B) (S Y) (A E H) (G L V) (T) (K) (Z) :162
cluster1(162 seqs)
cluster2 cluster9
ApproxMAP
wseq1(162 seqs)
wseq2 wseq9
PatConSeq1
VarConSeq1
PatConSeq9
VarConSeq9
10 Base Patterns
D = 1000 Sequences
IBM Synthetic Data Generator
Evaluation
7 max patterns1 redundant pattern0 spurious patterns
1 null pattern
Recoverability : 91.16%Precision: 97.17%
Extraneous Items: 3/106
BasePi (E(FB):E(LB)) ||P|| Pattern <100: 85: 70: 50: 35: 20> BaseP1 (0.21:0.66) 14 <(15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) (93) >
PatConSeq1 13 < (15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) > VarConSeq1 18 < (15 16 17 66) (15 22) (58 99) (2 74) (24 31 76) (24 66) (50 62) (93) >
BaseP2 (0.161:0.83) 22 <(22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) (63 74 99) > PatConSeq2 19 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) > VarConSeq2 25 < (22 50 66) (16) (29 99) (22 58 94) (2 45 58 67) (12 28 36) (2 50) (24 96) (51) (66) (2 22 58) > PatConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) > VarConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) >
BaseP3 (0.141:0.82) 14 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) (11 15 74) > PatConSeq4 11 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) > VarConSeq4 13 < (22) (22) (22) (58) (2 16 24 63) (2 24 65 93) (6 50) >
BaseP4 (0.131:0.90) 15 <(31 76) (58 66) (16 22 30) (16) (50 62 66) (2 16 24 63) > PatConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) > VarConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) (16 24) >
BaseP5 (0.123:0.81) 14 <(43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) (93) > PatConSeq6 13 < (43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) > VarConSeq6 16 < (22 43) (2 28 73) (58 96) (95) (2 74) (5) (2 66) (24 63) (20) >
BaseP6 (0.121:0.77) 9 <(63) (16) (2 22) (24) (22 50 66) (50) > PatConSeq7 8 < (63) (16) (2 22) (24) (22 50 66) > VarConSeq7 9 < (63) (16) (2 22) (24) (22 50 66) >
BaseP7 (0.054:0.60) 13 <(70) (58 66) (22) (74) (22 41) (2 74) (31 76) (2 74) > PatConSeq8 16 < (70) (58) (22 58 66) (22 58) (74) (22 41) (2 74) (31 76) (2 74) > VarConSeq8 18 < (70) (58 66) (22 58 66) (22 58) (74) (22 41) (2 22 66 74) (31 76) (2 74) > PatConSeq9 0 cluster size was only 5 sequences so no pattern consensus sequence was produced VarConSeq9 8 < (70) (58 66) (74) (74) (22 41) (74) >
BaseP8 (0.014:0.91) 17 < (20 22 23 96) (50) (51 63) (58) (16) (2 22) (50) (23 26 36) (10 74) > BaseP9 (0.038:0.78) 7 < (88) (24 58 78) (22) (58) (96) > BaseP10 (0.008:0.66) 17 < (16) (2 23 74 88) (24 63) (20 96) (91) (40 62) (15) (40) (29 40 99) >
BaseP1(0.21:0.66) 14 <(15 16 17 66)(15)(58 99)(2 74)(31 76)(66)(62)(93)> PatConSeq1 13 <(15 16 17 66)(15)(58 99)(2 74)(31 76)(66)(62)> VarConSeq1 18 <(15 16 17 66)(15 22)(58 99)(2 74)(24 31 76)(24 66)(50 62)(93)>
BasePi (E(FB):E(LB)) ||P|| Pattern <100: 85: 70: 50: 35: 20> BaseP1 (0.21:0.66) 14 <(15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) (93) >
PatConSeq1 13 < (15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) > VarConSeq1 18 < (15 16 17 66) (15 22) (58 99) (2 74) (24 31 76) (24 66) (50 62) (93) >
BaseP2 (0.161:0.83) 22 <(22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) (63 74 99) > PatConSeq2 19 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) > VarConSeq2 25 < (22 50 66) (16) (29 99) (22 58 94) (2 45 58 67) (12 28 36) (2 50) (24 96) (51) (66) (2 22 58) > PatConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) > VarConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) >
BaseP3 (0.141:0.82) 14 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) (11 15 74) > PatConSeq4 11 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) > VarConSeq4 13 < (22) (22) (22) (58) (2 16 24 63) (2 24 65 93) (6 50) >
BaseP4 (0.131:0.90) 15 <(31 76) (58 66) (16 22 30) (16) (50 62 66) (2 16 24 63) > PatConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) > VarConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) (16 24) >
BaseP5 (0.123:0.81) 14 <(43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) (93) > PatConSeq6 13 < (43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) > VarConSeq6 16 < (22 43) (2 28 73) (58 96) (95) (2 74) (5) (2 66) (24 63) (20) >
BaseP6 (0.121:0.77) 9 <(63) (16) (2 22) (24) (22 50 66) (50) > PatConSeq7 8 < (63) (16) (2 22) (24) (22 50 66) > VarConSeq7 9 < (63) (16) (2 22) (24) (22 50 66) >
BaseP7 (0.054:0.60) 13 <(70) (58 66) (22) (74) (22 41) (2 74) (31 76) (2 74) > PatConSeq8 16 < (70) (58) (22 58 66) (22 58) (74) (22 41) (2 74) (31 76) (2 74) > VarConSeq8 18 < (70) (58 66) (22 58 66) (22 58) (74) (22 41) (2 22 66 74) (31 76) (2 74) > PatConSeq9 0 cluster size was only 5 sequences so no pattern consensus sequence was produced VarConSeq9 8 < (70) (58 66) (74) (74) (22 41) (74) >
BaseP8 (0.014:0.91) 17 < (20 22 23 96) (50) (51 63) (58) (16) (2 22) (50) (23 26 36) (10 74) > BaseP9 (0.038:0.78) 7 < (88) (24 58 78) (22) (58) (96) > BaseP10 (0.008:0.66) 17 < (16) (2 23 74 88) (24 63) (20 96) (91) (40 62) (15) (40) (29 40 99) >
8 patterns returned
7 max patterns
1 redundant patterns
0 spurious patterns
Recoverability : 91.16%
Precision: 97.17%
Extraneous Items: 3/106
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Comparative Study
Conventional Sequential Pattern Mining– Support Model
Empirical analysis– Totally random data– Patterned data– Patterned data + noise– Patterned data + outliers
Evaluation : Comparison
ApproxMAP Support ModelRandom
DataNo patterns Numerous spurious patterns
Evaluation : Comparison
ApproxMAP Support ModelRandom
DataNo patterns Numerous spurious patterns
Patterned Data
10 patterns embedded
into 1000 seqs
k=6 & MinStrgh=30%Recoverability : 91.16%Precision: 97.17% Extraneous Items: 3/1068 patterns returned1 redundant patterns0 spurious patterns
MinSup=5%Recoverability : 91.59%Precision: 96.29% Extraneous : 66,058/1,782,583253,782 patterns returned253,714 redundant patterns58 spurious patterns
Evaluation : Comparison
ApproxMAP Support ModelRandom
DataNo patterns Numerous spurious patterns
Patterned Data
10 patterns embedded
into 1000 seqs
k=6 & MinStrgh=30%Recoverability : 91.16%Precision: 97.17% Extraneous Items: 3/1068 patterns returned1 redundant patterns0 spurious patterns
MinSup=5%Recoverability : 91.59%Precision: 96.29% Extraneous : 66,058/1,782,583253,782 patterns returned253,714 redundant patterns58 spurious patterns
Noise Robust Not Robust Recoverability degrades fast
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Robustness w.r.t. noise
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
0% 10% 20% 30% 40% 50%Noise Level (1-alpha)
Reco
vera
bility
Support(min_sup=5%)Mult Alingment(k=6, theta=50%)
Evaluation : Comparison
ApproxMAP Support ModelRandom
DataNo patterns Numerous spurious patterns
Patterned Data
10 patterns embedded
into 1000 seqs
k=6 & MinStrgh=30%Recoverability : 91.16%Precision: 97.17% Extraneous Items: 3/1068 patterns returned1 redundant patterns0 spurious patterns
MinSup=5%Recoverability : 91.59%Precision: 96.29% Extraneous : 66,058/1,782,583253,782 patterns returned253,714 redundant patterns58 spurious patterns
Noise Robust Not Robust Recoverability degrades fast
Outliers Robust Some what Robust
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Understanding ApproxMAP
5 experiments– k in kNN clustering– Strength cutoff– Order of alignment– Optimization 1 : reduced precision in prox. matrix– Optimization 2 : sample based iterative clustering
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
A realistic DB
Notation Meaning Value|| I || # of items 1,000|| || # of potentially freq itemsets 5,000
Ipat Avg. # of items per itemset in BP 2
Lpat Avg. # of itemsets per base pat 14=0.7*Lseq
Npat # of base pattern sequences 100
Nseq # of data sequences 10,000
Lseq Avg. # of itemsets per data seq 20
Iseq Avg. # of items per itemset in DB 2.5
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Input parameters
0%10%20%30%40%50%60%70%80%90%
100%110%
0 10 20 30 40 50 60 70 80 90 100Theta : Strength threshold (%)
Eval
Crite
ria (%
)
Recover Precision
50%
60%
70%
80%
90%
100%
2 4 6 8 10k : for kNN clustering
Eval
crite
ria (%
)
Recoverability Precis ion
The order in multiple alignment
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Understanding ApproxMAP
Optimization 1 : Lseq
– Running time : reduced to 40%Optimization 2 : Nseq
– Running time: reduced to 10%-40% – For negligible reduction in recoverability
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Effects of the DB param & scalability
4 experiments– || I || : # of unique items in the database
• Density of the database• 1,000 – 10,000
– Nseq : # of sequences in the data • 10,000 – 100,000
– Lseq : Avg. # of itemsets per data seq• 10 – 50
– Iseq : Avg. # of items per itemset in DB• 2.5 – 10
0
3600
7200
10800
0 2000 4000 6000 8000 10000
|I| : # of unique items in D
Run
ning
tim
e (s
ec)
.
03600072000
108000144000180000216000252000
0 20000 40000 60000 80000 100000Nseq
Run
ning
tim
e (s
ec) .
036007200
10800144001800021600
0 10 20 30 40 50Lseq
Run
ning
tim
e (s
ec) .
0
3600
7200
10800
14400
0 5 10 15 20
Iseq
Run
ning
tim
e (s
ec) .
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Overview
What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Case Study : Real data
Monthly Services to children with A&N report 992 sequences15 interpretable and useful patterns (RPT)(INV,FC)(FC) ..11.. (FC)
– 419 sequences (RPT)(INV,FC)(FC)(FC)
– 57 sequences (RPT)(INV,FC,T)(FC,T)(FC,HM)(FC)(FC,HM)
– 39 sequences
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Overview
What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Conclusion : why does it work well?
Robust on random & weak patterned noiseVery good at organizing sequencesLong sequence data that are not random have
unique signatures
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
What have I done?
defines a new model, Multiple Alignment Sequential Pattern Mining,
describes a novel solution ApproxMAP (for APPROXimate Multiple Alignment Pattern mining) – that introduces a new metric for itemsets– weighted sequences : a new representation of alignment
information, – and the effective use of strength cutoffs to control the level
of detail included in the consensus patterns designs a general evaluation method to assess the
quality of results from sequential pattern mining algorithms,
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
What have I done?
employs the evaluation method to run an extensive set of empirical evaluations of approxMAP on synthetic data,
employs the evaluation method to compare the effectiveness of approxMAP to the conventional methods based on support model,
derives the expected support of a random sequences under the null hypothesis of no pattern in the database to better understand the behavior of the support based methods, and
demonstrates the usefulness of approxMAP using real world data.
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Future Work
Sample based iterative clustering – Memory management
Distance metric– Multisets– Taxonomy tree
Strength cutoff– Automatic detection of customized
Local alignment
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Thank You ! Advisor
– Wei Wang (02-04)– James Coggins (00-02)– Prasun Dewan (99-00)– Kye Hedlund (96-99)– Jan Prins (95-96)
SW advisor– Dean Duncan (98-04)
Other people– Janet Jones– Kim Flair– Susan Paulsen
Fellow students– Priyank Porwal, Andrew
Leaver-Fay, Leland Smith
Committee– Stephen Aylward– Jan Prins– Andrew Nobel
Other faculty– Jian Pei– Jack Snoeyink– J. S. Marron– Stephen Pizer– Stephen Weiss
Colleagues– Sang-Uok Kum, Jisung Kim– Alexandra, Michelle, – Aron, Chris
Family– Sohmee, My mom, dad, sister
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>
Rw
Rw(Xw, Y) = [R’w* wX + n - wX] / n
R’w(Xw,Y)= [weight(X)+|Y|*wX -2*weight(XY)]
[ weight(X) + |Y|*wX ]
(|X|+|Y|-2*|XY|) (|X|+|Y|)
ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)
: 4(H:1,F:1): 2
(B:5,C:3,G:1,I:1) : 5
(A:1,D:4,E:3): 5
(H:1): 1
5