CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task:...
Transcript of CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task:...
©Jan-20 Christopher W. Clifton 120
Materials adapted from Profs. Jennifer Neville and Dan Goldwasser
CS37300:
Data Mining & Machine Learning
Pattern Mining
Prof. Chris Clifton
31 March 2020
Data mining components
• Task specification: Pattern discovery
• Data representation: Homogeneous IID data
• Knowledge representation
• Learning technique
©Jan-20 Christopher W. Clifton 220
Pattern discovery
• Models describe entire dataset (or large part of it)
• Pattern characterize local aspects of data
• Pattern: predicate that returns “true” for the instances in
the data where the pattern occurs and “false” otherwise
• Task: find descriptive associations between variables
Examples
• Supermarket transaction database
– 10% of the customers buy wine and cheese
• Telecommunications alarms database
– If alarms A and B occur within 30 seconds of each other then alarm C occurs within 60 seconds with p=0.5
• Web log dataset
– If a person visits the CNN website, there is a 60% chance the person will visit the ABC News website in the same month
©Jan-20 Christopher W. Clifton 320
Transactional Data
• Pattern Discovery often involves transactional data– Transactions viewed as IID
• Differences from “traditional” data– Volume
– Dimensionality
– Sparsity
• We’ll see why these matter
5
Transaction
ID
Beer Eggs Flour Toilet
Paper
1 0 1 1 1
2 1 1 0 1
3 0 1 0 1
4 0 1 1 1
5 0 0 0 1
Pattern in tabular data
• Primitive pattern: subset of all possible observations over variables X1,...,Xp
– If Xk is categorical then Xk=c is a primitive pattern
– If Xk is ordinal then Xk≤c is a primitive pattern
• Start from primitive patterns and combine using logical connectives such as AND and OR
– age<40 AND income<100,000
– chips=1 AND (beer=1 OR soda=1)
©Jan-20 Christopher W. Clifton 420
Pattern space
• Set of legal patterns; defined through set of primitive patterns and operators to combine primitives– Example: If variable X1,...,Xp are all binary we can define the space of
patterns to be all conjunctions of the form (Xi1=1) AND (Xi2=1) AND ... AND (Xik=1)
• Typically there is a generalization/specialization relationship between patterns– Pattern α is more general than pattern β, if whenever β occurs, α occurs
as well. This also means that pattern β is more specific than pattern α
– Examples: age<40 AND income<100,000 is more specific than age<40chips=1 is more general than chips=1 AND (beer=1 OR soda=1)
– This property is used during search
Pattern properties
• Patterns that occur frequently are often of interest
– Frequency fr(θ) of a pattern θ is defined as the relative number of observations in the dataset about which θ is true
– Also known as support
• May also be interested in:
– Novelty
– Understandability
– Informativeness
©Jan-20 Christopher W. Clifton 520
Pattern discovery task
• Find all “interesting” patterns in the data
• Approach: find all patterns that satisfy certain conditions
• Challenge: find the right balance between
– Pattern complexity
– Pattern understandability
– Computational complexity
Association Rule Mining
• Task:
– Find frequent patterns, associations, correlations, or causal structures among items
• Data:
– Transaction databases, relational database, or other information repositories
• Applications:
– Market basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification
©Jan-20 Christopher W. Clifton 620
Association Rule
• A rule is an expression of the form θ→φ
• Association rules:
– All variables are binary {A1,...,Ak} → {B1,...,Bh}
– Probabilistic statement about the co-occurrence of certain events in the database
• Mining rules
– Number of rules grows exponentially with dimensions
– How to find patterns in an efficient manner?
– How to determine which rules are interesting?
Association Rule Example
©Jan-20 Christopher W. Clifton 720
Sequential Pattern Mining
• Task:
– Find frequent substring patterns
• Data:
– Sequence data, biological data, text collections
• Applications:
– Bioinformatics, text analysis, clustering, classification
Sequential Patterns
• Substring
• Regular expression
• Episode
– Partially ordered collection of events occurring together
– Can take time or sequential ordering into account but is insensitive to intervening events
– Eg. headache followed by sense of disorientation within a given time period
©Jan-20 Christopher W. Clifton 820
Graph Mining
• Task:
– Find frequent subgraph patterns
• Data:
– Graph databases or relational databases
• Applications:
– Graph indexing, similarity search, clustering, classification
Subgraph Patterns
©Jan-20 Christopher W. Clifton 920
Frequent Pattern Mining
• Define pattern space
– Association rules, sequential patterns, subgraphs, …
• Identify frequent patterns
– Search techniques – next module
• Determine “interesting”
– Thursday
18
Learning
©Jan-20 Christopher W. Clifton 1020
Search problem
• Searching for all patterns is computationally intractable
• Consider market basket data where each row has 1000 binary variables
• How many possible patterns?– How many unique transactions?
21000
– How many subsets (patterns)? 2 (̂21000)
– The size decreases if you do not consider patterns with 0s... but if you also consider rules rather than just subsets, the size increases again
Transaction
ID
Beer Eggs Flour Toilet
Paper
1 0 1 1 1
2 1 1 0 0
3 0 1 0 1
4 0 1 1 1
5 0 0 0 1
Example: Lattice of Patterns
(general to specific)
©Jan-20 Christopher W. Clifton 1120
Solution
• Take advantage of the nature of the score function to
prune parts of the search space and reduce run time
• What is the score function?
– Patterns that occur frequently are often of interest.. thus score function often involves frequency
• The Apriori principle:
– Any subset of a frequent itemset must be frequent
– Thus we can prune any itemset who’s subsets are not frequent
Association Rules
• Data– Basket: customer transaction; items: products
– Basket: document; items: words
– Basket: web pages; items: links
• Find all rules of the form θ→φ that satisfy the following constraints:– Support of the rule is greater than threshold s
– Confidence of the rule is greater than threshold c
– E.g., 98% of people who purchase tires and auto accessories also have automotive service done
©Jan-20 Christopher W. Clifton 1220
Rule Evaluation
• Support (aka frequency)
– s(θ→φ) = fr(θ∧φ) / N
– Proportion of N items with antecedent θ and consequent φ
• Confidence (aka accuracy)
– c(θ→φ) = p(φ | θ) = fr(θ∧φ) / fr(θ)
– Proportion of items which have antecedent θ that also have consequent φ
Finding Frequent Itemsets
• Find sets of items with minimum support
• Support is monotonic
– A subset of a frequent itemset must also be frequent
– Eg. If {A,B} is a frequent itemset then both {A} and {B} are frequent itemsets as well
• Approach
– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
– Prune any sets of size k that are not frequent
©Jan-20 Christopher W. Clifton 1320
Example
/5
/5 /5
/5 /5 /5 /5
/5 /5
/5
support threshold = 0.3
3/5
Example
/5 /5 /5
/5 /5 /5
/5
©Jan-20 Christopher W. Clifton 1420
Materials adapted from Profs. Jennifer Neville and Dan Goldwasser
CS37300:
Data Mining & Machine Learning
Pattern Mining: Apriori Algorithm
Prof. Chris Clifton
2 April 2020
Apriori algorithm
• Input: data (D), minsup, minconf
• Output: All rules (R) with support ≥ minsup and confidence ≥ minconf
Apriori Algorithm ( D, minsup, minconf )
// Find all itemsets with support ≥ minsupL = FrequentItemsetGeneration ( D, minsup )
// Find all rules with confidence ≥ minconfR = RuleGeneration ( L, minconf )
Return R
©Jan-20 Christopher W. Clifton 1520
Itemset Support
{A} 75%
{B} 50%
{C} 50%
{D} 25%
{E} 25%
{F} 25%
{A,B} 25%
{A,C} 50%
{B,C} 25%
Example
Transaction ID Items Bought
1000 A, B, C
1001 A, C
1002 A, D
1003 B, E, F
Itemset Support
{A} 75%
{B} 50%
{C} 50%
{D} 25%
{E} 25%
{F} 25%
For rule A⇒C:confidence(A⇒C) = support({A,C}) / support({A}) = 66.6%
min support = 50%min confidence = 50%
Algorithm to find frequent itemsets
FrequentItemsetGeneration ( D, minsup )
Ck: candidate itemsets of size k
Lk: frequent itemsets of size k
L1 = {frequent single items}
for (k=1; Lk!=∅; k++)
Ck+1 = CandidateItemsetGeneration ( Lk, minsup )
for each transaction t in D
increment the count of all candidates in Ck+1 contained in t
Lk+1 = candidates in Ck+1 with minsup
Return ⋃k Lk
©Jan-20 Christopher W. Clifton 1620
Example
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
/4
Generating candidates
• Store the items Lk-1 in order (e.g., lexicographic)
CandidateItemsetGeneration ( Lk-1, minsup )
//step 1: self-joining Lk-1
insert into Ckselect p.item1, ..., p.itemm, q.itemm
from Lk-1 as p, Lk-1 as qwhere p.item1=q.item1, ..., p.itemm-1=q.itemm-1, p.itemm<q.itemm
//step 2: pruningFor all itemsets c in Ck
For all (k-1) subsets s of cIf s not in Lk-1 then delete c from Ck
©Jan-20 Christopher W. Clifton 1720
abcabdacdacebcd
{
Example
• L3 = {abc, abd, acd, ace, bcd}
• Self join
• Pruning
– acde is removed because ade in not in L3
• C4 = {abcd}
abcabdacdacebcd
abcabdacdacebcd
{ abcd
abcabdacdacebcd
{ acdeabcabdacdacebcd
{abcabdacdacebcd
{WHERE p.item1=q.item1 AND
p.item2=q.item2 AND p.item3<q.item3
SELECT p.item1, p.item2, p.item3, q.item3
Counting support
• Issues with counting support
– Total number of candidates may be very, very large
– One transaction may contain many candidates
• Approach #1:– Compare transaction to every candidate itemset
• Approach #2:
– Exploit lexicographic order to efficiently enumerate all k-itemsets for a given transaction
– Increment counts of appropriate itemset (sequential scan)
©Jan-20 Christopher W. Clifton 1820
Example enumeration
Counting support
• Approach #3:
– Use approach #2 but look up candidates efficiently in hash tree
• Store candidate itemsets in a hash-tree
• Leaf node contains a list of itemsets and counts
• Interior node contains a hash table
©Jan-20 Christopher W. Clifton 1920
Example hash tree
Rule generation
• Given a frequent itemset L, find all non-empty subsets f ⊂L such that f → (L - f) satisfies the minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules are:
• If |L|=k then there are 2k-2 candidate association rules (ignoring L → ∅ and ∅ → L)
ABCD ABDC ACDB BCDA
ABCD BACD CABD DABC
ABCD ACBD ADBC BCAD
BDAC CDAB
©Jan-20 Christopher W. Clifton 2020
Efficient rule generation
• In general, confidence is not monotonic
– c(ABC→D) can be larger or smaller than c(AB→D)
• But the confidence of rules generated from the same itemsetis monotonic with respect to the number of items in the consequent
– Recall that:c(θ→φ) = p(φ | θ)
– Consider frequent itemsetL={A,B,C,D}:
Pruning rules
©Jan-20 Christopher W. Clifton 2120
Algorithm to find rules with high confidence
Let Rm=confident rules with m variable consequents
Let Hm=candidate rules with m variable consequents
RuleGeneration ( L, minconf )
for (k=1; Lk!=∅; k++)
H1=candidate rules with single variable consequent from Lk
for (m=1; Hm!=∅; m++)
If k > m + 1:
Hm+1 = generate candidate rules from Rm
Rm+1 = select candidates in Hm+1 w ith minconf
Return ⋃m Rm
Complexity
• Apriori bottleneck is candidate itemset generation
– 104 1-itemsets will produce 107 candidate 2-itemsets
– To discover a frequent pattern of size 100 you need to generate 2100
≈ 1030 candidates
• Solutions
– Use alternative algorithms to search itemset lattice
– Use alternative data representations
• E.g., Mine patterns without candidate generation using FP-tree structure
©Jan-20 Christopher W. Clifton 2220
Association Rules
• Knowledge representation?– If-then rules
• Score function?– Support, confidence
• Search?– Exhaustive search
Returns all rules that exceed support and confidence thresholds, pruning (based on apriori principle) is used to eliminate portions of search space
• Optimal?– Yes. Guaranteed to find all rules that exceed specified thresholds.
Evaluation
• Association rules algorithms usually return many, many
rules
– Many are uninteresting or redundant (e.g., ABC→D and AB→D may have same support and confidence)
• How to quantify interestingness?
– Objective: statistical measures
– Subjective: unexpected and/or actionable patterns (requires domain knowledge)
©Jan-20 Christopher W. Clifton 2320
Objective Measures
• Given a rule XY, can compute statistics based on
contingency tables
• Contingency table for XY
Y ¬Y Total
X f11 f10 f1*
¬X f01 f00 f0*
Total f*1 f*0 |T|
f11: support of X and Y
f10: support of X and ¬Y
f01: support of ¬X and Y
f00: support of ¬X and ¬Y
Used to define various
measures
• Support, confidence, lift,
Gini, J-measure, etc.
Drawback of Support
• Support suffers from the rare item problem (Liu et
al.,1999)
– Infrequent items not meeting minimum support are ignored which is problematic if rare items are important
– E.g. rarely sold products which account for a large part of revenue or profit
• Support falls rapidly with itemset size. A threshold on
support favors short itemsets
©Jan-20 Christopher W. Clifton 2420
Drawback of Confidence
• Association Rule Tea Coffee
• Confidence = P(Coffee|Tea) = 0.75– P(Coffee) = 0.9
• Although confidence is high, the rule is misleading
– P(Coffee|¬Tea) = 0.93
50
Coffee ¬Coffee Total
Tea 15 5 20
¬Tea 75 5 80
Total 90 10 100
Statistical-based Measures
• 𝐿𝑖𝑓𝑡 =𝑃(𝑌|𝑋)
𝑃(𝑌)
• 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡 =𝑃 𝑋,𝑌
𝑃 𝑋 𝑃(𝑌)
• 𝑃𝑆 = 𝑃 𝑋,𝑌 − 𝑃 𝑋 𝑃 𝑌
• 𝜙 − 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 =𝑃 𝑋,𝑌 −𝑃 𝑋 𝑃(𝑌)
𝑃(𝑋) 1−𝑃 𝑋 𝑃 𝑌 1−𝑃 𝑌
©Jan-20 Christopher W. Clifton 2520
Lift Example
• Association Rule Tea Coffee
• Confidence = P(Coffee|Tea) = 0.75
– P(Coffee) = 0.9
• Lift = 0.75/0.9 ≈ 0.83 (<1, therefore negatively associated)
52
Coffee ¬Coffee Total
Tea 15 5 20
¬Tea 75 5 80
Total 90 10 100
Unexpectedness
• Model expectation of users (based on domain knowledge)
• Combine with evidence from data
©Jan-20 Christopher W. Clifton 2620
Statistical Measures:
Why don’t we use these?
• Support is monotonic
– P(Coffee˄Tea) ≤min( P(Coffee), P(Tea) )
– Enables apriori principle
• Other measures are not
– We use other measures in conjunction with support
Coffee ¬Coffee Total
Tea 15 5 20
¬Tea 75 5 80
Total 90 10 100
54