CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task:...

26
©Jan-20 Christopher W. Clifton 1 20 Materials adapted from Profs. Jennifer Neville and Dan Goldwasser CS37300: Data Mining & Machine Learning Pattern Mining Prof. Chris Clifton 31 March 2020 Data mining components Task specification: Pattern discovery Data representation: Homogeneous IID data Knowledge representation Learning technique

Transcript of CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task:...

Page 1: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 120

Materials adapted from Profs. Jennifer Neville and Dan Goldwasser

CS37300:

Data Mining & Machine Learning

Pattern Mining

Prof. Chris Clifton

31 March 2020

Data mining components

• Task specification: Pattern discovery

• Data representation: Homogeneous IID data

• Knowledge representation

• Learning technique

Page 2: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 220

Pattern discovery

• Models describe entire dataset (or large part of it)

• Pattern characterize local aspects of data

• Pattern: predicate that returns “true” for the instances in

the data where the pattern occurs and “false” otherwise

• Task: find descriptive associations between variables

Examples

• Supermarket transaction database

– 10% of the customers buy wine and cheese

• Telecommunications alarms database

– If alarms A and B occur within 30 seconds of each other then alarm C occurs within 60 seconds with p=0.5

• Web log dataset

– If a person visits the CNN website, there is a 60% chance the person will visit the ABC News website in the same month

Page 3: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 320

Transactional Data

• Pattern Discovery often involves transactional data– Transactions viewed as IID

• Differences from “traditional” data– Volume

– Dimensionality

– Sparsity

• We’ll see why these matter

5

Transaction

ID

Beer Eggs Flour Toilet

Paper

1 0 1 1 1

2 1 1 0 1

3 0 1 0 1

4 0 1 1 1

5 0 0 0 1

Pattern in tabular data

• Primitive pattern: subset of all possible observations over variables X1,...,Xp

– If Xk is categorical then Xk=c is a primitive pattern

– If Xk is ordinal then Xk≤c is a primitive pattern

• Start from primitive patterns and combine using logical connectives such as AND and OR

– age<40 AND income<100,000

– chips=1 AND (beer=1 OR soda=1)

Page 4: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 420

Pattern space

• Set of legal patterns; defined through set of primitive patterns and operators to combine primitives– Example: If variable X1,...,Xp are all binary we can define the space of

patterns to be all conjunctions of the form (Xi1=1) AND (Xi2=1) AND ... AND (Xik=1)

• Typically there is a generalization/specialization relationship between patterns– Pattern α is more general than pattern β, if whenever β occurs, α occurs

as well. This also means that pattern β is more specific than pattern α

– Examples: age<40 AND income<100,000 is more specific than age<40chips=1 is more general than chips=1 AND (beer=1 OR soda=1)

– This property is used during search

Pattern properties

• Patterns that occur frequently are often of interest

– Frequency fr(θ) of a pattern θ is defined as the relative number of observations in the dataset about which θ is true

– Also known as support

• May also be interested in:

– Novelty

– Understandability

– Informativeness

Page 5: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 520

Pattern discovery task

• Find all “interesting” patterns in the data

• Approach: find all patterns that satisfy certain conditions

• Challenge: find the right balance between

– Pattern complexity

– Pattern understandability

– Computational complexity

Association Rule Mining

• Task:

– Find frequent patterns, associations, correlations, or causal structures among items

• Data:

– Transaction databases, relational database, or other information repositories

• Applications:

– Market basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification

Page 6: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 620

Association Rule

• A rule is an expression of the form θ→φ

• Association rules:

– All variables are binary {A1,...,Ak} → {B1,...,Bh}

– Probabilistic statement about the co-occurrence of certain events in the database

• Mining rules

– Number of rules grows exponentially with dimensions

– How to find patterns in an efficient manner?

– How to determine which rules are interesting?

Association Rule Example

Page 7: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 720

Sequential Pattern Mining

• Task:

– Find frequent substring patterns

• Data:

– Sequence data, biological data, text collections

• Applications:

– Bioinformatics, text analysis, clustering, classification

Sequential Patterns

• Substring

• Regular expression

• Episode

– Partially ordered collection of events occurring together

– Can take time or sequential ordering into account but is insensitive to intervening events

– Eg. headache followed by sense of disorientation within a given time period

Page 8: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 820

Graph Mining

• Task:

– Find frequent subgraph patterns

• Data:

– Graph databases or relational databases

• Applications:

– Graph indexing, similarity search, clustering, classification

Subgraph Patterns

Page 9: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 920

Frequent Pattern Mining

• Define pattern space

– Association rules, sequential patterns, subgraphs, …

• Identify frequent patterns

– Search techniques – next module

• Determine “interesting”

– Thursday

18

Learning

Page 10: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1020

Search problem

• Searching for all patterns is computationally intractable

• Consider market basket data where each row has 1000 binary variables

• How many possible patterns?– How many unique transactions?

21000

– How many subsets (patterns)? 2 (̂21000)

– The size decreases if you do not consider patterns with 0s... but if you also consider rules rather than just subsets, the size increases again

Transaction

ID

Beer Eggs Flour Toilet

Paper

1 0 1 1 1

2 1 1 0 0

3 0 1 0 1

4 0 1 1 1

5 0 0 0 1

Example: Lattice of Patterns

(general to specific)

Page 11: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1120

Solution

• Take advantage of the nature of the score function to

prune parts of the search space and reduce run time

• What is the score function?

– Patterns that occur frequently are often of interest.. thus score function often involves frequency

• The Apriori principle:

– Any subset of a frequent itemset must be frequent

– Thus we can prune any itemset who’s subsets are not frequent

Association Rules

• Data– Basket: customer transaction; items: products

– Basket: document; items: words

– Basket: web pages; items: links

• Find all rules of the form θ→φ that satisfy the following constraints:– Support of the rule is greater than threshold s

– Confidence of the rule is greater than threshold c

– E.g., 98% of people who purchase tires and auto accessories also have automotive service done

Page 12: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1220

Rule Evaluation

• Support (aka frequency)

– s(θ→φ) = fr(θ∧φ) / N

– Proportion of N items with antecedent θ and consequent φ

• Confidence (aka accuracy)

– c(θ→φ) = p(φ | θ) = fr(θ∧φ) / fr(θ)

– Proportion of items which have antecedent θ that also have consequent φ

Finding Frequent Itemsets

• Find sets of items with minimum support

• Support is monotonic

– A subset of a frequent itemset must also be frequent

– Eg. If {A,B} is a frequent itemset then both {A} and {B} are frequent itemsets as well

• Approach

– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

– Prune any sets of size k that are not frequent

Page 13: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1320

Example

/5

/5 /5

/5 /5 /5 /5

/5 /5

/5

support threshold = 0.3

3/5

Example

/5 /5 /5

/5 /5 /5

/5

Page 14: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1420

Materials adapted from Profs. Jennifer Neville and Dan Goldwasser

CS37300:

Data Mining & Machine Learning

Pattern Mining: Apriori Algorithm

Prof. Chris Clifton

2 April 2020

Apriori algorithm

• Input: data (D), minsup, minconf

• Output: All rules (R) with support ≥ minsup and confidence ≥ minconf

Apriori Algorithm ( D, minsup, minconf )

// Find all itemsets with support ≥ minsupL = FrequentItemsetGeneration ( D, minsup )

// Find all rules with confidence ≥ minconfR = RuleGeneration ( L, minconf )

Return R

Page 15: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1520

Itemset Support

{A} 75%

{B} 50%

{C} 50%

{D} 25%

{E} 25%

{F} 25%

{A,B} 25%

{A,C} 50%

{B,C} 25%

Example

Transaction ID Items Bought

1000 A, B, C

1001 A, C

1002 A, D

1003 B, E, F

Itemset Support

{A} 75%

{B} 50%

{C} 50%

{D} 25%

{E} 25%

{F} 25%

For rule A⇒C:confidence(A⇒C) = support({A,C}) / support({A}) = 66.6%

min support = 50%min confidence = 50%

Algorithm to find frequent itemsets

FrequentItemsetGeneration ( D, minsup )

Ck: candidate itemsets of size k

Lk: frequent itemsets of size k

L1 = {frequent single items}

for (k=1; Lk!=∅; k++)

Ck+1 = CandidateItemsetGeneration ( Lk, minsup )

for each transaction t in D

increment the count of all candidates in Ck+1 contained in t

Lk+1 = candidates in Ck+1 with minsup

Return ⋃k Lk

Page 16: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1620

Example

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

/4

Generating candidates

• Store the items Lk-1 in order (e.g., lexicographic)

CandidateItemsetGeneration ( Lk-1, minsup )

//step 1: self-joining Lk-1

insert into Ckselect p.item1, ..., p.itemm, q.itemm

from Lk-1 as p, Lk-1 as qwhere p.item1=q.item1, ..., p.itemm-1=q.itemm-1, p.itemm<q.itemm

//step 2: pruningFor all itemsets c in Ck

For all (k-1) subsets s of cIf s not in Lk-1 then delete c from Ck

Page 17: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1720

abcabdacdacebcd

{

Example

• L3 = {abc, abd, acd, ace, bcd}

• Self join

• Pruning

– acde is removed because ade in not in L3

• C4 = {abcd}

abcabdacdacebcd

abcabdacdacebcd

{ abcd

abcabdacdacebcd

{ acdeabcabdacdacebcd

{abcabdacdacebcd

{WHERE p.item1=q.item1 AND

p.item2=q.item2 AND p.item3<q.item3

SELECT p.item1, p.item2, p.item3, q.item3

Counting support

• Issues with counting support

– Total number of candidates may be very, very large

– One transaction may contain many candidates

• Approach #1:– Compare transaction to every candidate itemset

• Approach #2:

– Exploit lexicographic order to efficiently enumerate all k-itemsets for a given transaction

– Increment counts of appropriate itemset (sequential scan)

Page 18: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1820

Example enumeration

Counting support

• Approach #3:

– Use approach #2 but look up candidates efficiently in hash tree

• Store candidate itemsets in a hash-tree

• Leaf node contains a list of itemsets and counts

• Interior node contains a hash table

Page 19: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 1920

Example hash tree

Rule generation

• Given a frequent itemset L, find all non-empty subsets f ⊂L such that f → (L - f) satisfies the minimum confidence requirement

– If {A,B,C,D} is a frequent itemset, candidate rules are:

• If |L|=k then there are 2k-2 candidate association rules (ignoring L → ∅ and ∅ → L)

ABCD ABDC ACDB BCDA

ABCD BACD CABD DABC

ABCD ACBD ADBC BCAD

BDAC CDAB

Page 20: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 2020

Efficient rule generation

• In general, confidence is not monotonic

– c(ABC→D) can be larger or smaller than c(AB→D)

• But the confidence of rules generated from the same itemsetis monotonic with respect to the number of items in the consequent

– Recall that:c(θ→φ) = p(φ | θ)

– Consider frequent itemsetL={A,B,C,D}:

Pruning rules

Page 21: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 2120

Algorithm to find rules with high confidence

Let Rm=confident rules with m variable consequents

Let Hm=candidate rules with m variable consequents

RuleGeneration ( L, minconf )

for (k=1; Lk!=∅; k++)

H1=candidate rules with single variable consequent from Lk

for (m=1; Hm!=∅; m++)

If k > m + 1:

Hm+1 = generate candidate rules from Rm

Rm+1 = select candidates in Hm+1 w ith minconf

Return ⋃m Rm

Complexity

• Apriori bottleneck is candidate itemset generation

– 104 1-itemsets will produce 107 candidate 2-itemsets

– To discover a frequent pattern of size 100 you need to generate 2100

≈ 1030 candidates

• Solutions

– Use alternative algorithms to search itemset lattice

– Use alternative data representations

• E.g., Mine patterns without candidate generation using FP-tree structure

Page 22: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 2220

Association Rules

• Knowledge representation?– If-then rules

• Score function?– Support, confidence

• Search?– Exhaustive search

Returns all rules that exceed support and confidence thresholds, pruning (based on apriori principle) is used to eliminate portions of search space

• Optimal?– Yes. Guaranteed to find all rules that exceed specified thresholds.

Evaluation

• Association rules algorithms usually return many, many

rules

– Many are uninteresting or redundant (e.g., ABC→D and AB→D may have same support and confidence)

• How to quantify interestingness?

– Objective: statistical measures

– Subjective: unexpected and/or actionable patterns (requires domain knowledge)

Page 23: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 2320

Objective Measures

• Given a rule XY, can compute statistics based on

contingency tables

• Contingency table for XY

Y ¬Y Total

X f11 f10 f1*

¬X f01 f00 f0*

Total f*1 f*0 |T|

f11: support of X and Y

f10: support of X and ¬Y

f01: support of ¬X and Y

f00: support of ¬X and ¬Y

Used to define various

measures

• Support, confidence, lift,

Gini, J-measure, etc.

Drawback of Support

• Support suffers from the rare item problem (Liu et

al.,1999)

– Infrequent items not meeting minimum support are ignored which is problematic if rare items are important

– E.g. rarely sold products which account for a large part of revenue or profit

• Support falls rapidly with itemset size. A threshold on

support favors short itemsets

Page 24: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 2420

Drawback of Confidence

• Association Rule Tea Coffee

• Confidence = P(Coffee|Tea) = 0.75– P(Coffee) = 0.9

• Although confidence is high, the rule is misleading

– P(Coffee|¬Tea) = 0.93

50

Coffee ¬Coffee Total

Tea 15 5 20

¬Tea 75 5 80

Total 90 10 100

Statistical-based Measures

• 𝐿𝑖𝑓𝑡 =𝑃(𝑌|𝑋)

𝑃(𝑌)

• 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡 =𝑃 𝑋,𝑌

𝑃 𝑋 𝑃(𝑌)

• 𝑃𝑆 = 𝑃 𝑋,𝑌 − 𝑃 𝑋 𝑃 𝑌

• 𝜙 − 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 =𝑃 𝑋,𝑌 −𝑃 𝑋 𝑃(𝑌)

𝑃(𝑋) 1−𝑃 𝑋 𝑃 𝑌 1−𝑃 𝑌

Page 25: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 2520

Lift Example

• Association Rule Tea Coffee

• Confidence = P(Coffee|Tea) = 0.75

– P(Coffee) = 0.9

• Lift = 0.75/0.9 ≈ 0.83 (<1, therefore negatively associated)

52

Coffee ¬Coffee Total

Tea 15 5 20

¬Tea 75 5 80

Total 90 10 100

Unexpectedness

• Model expectation of users (based on domain knowledge)

• Combine with evidence from data

Page 26: CS37300: Data Mining & Machine Learning · 2020. 3. 31. · Sequential Pattern Mining • Task: –Find frequent substring patterns • Data: –Sequence data, biological data, text

©Jan-20 Christopher W. Clifton 2620

Statistical Measures:

Why don’t we use these?

• Support is monotonic

– P(Coffee˄Tea) ≤min( P(Coffee), P(Tea) )

– Enables apriori principle

• Other measures are not

– We use other measures in conjunction with support

Coffee ¬Coffee Total

Tea 15 5 20

¬Tea 75 5 80

Total 90 10 100

54