1 December 12, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques —...

1

April 21, 2023 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques

— Chapter 3 —

Association Rules

2

Mining Frequent Patterns, Association

• Basic concepts

• Apriori

• FP-Growth

• Mining Multi-Dimensional Association

• Exercises

3

What Is Frequent Pattern Analysis?

• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that

occurs frequently in a data set

• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent

itemsets and association rule mining

• Motivation: Finding inherent regularities in data– What products were often purchased together?— cheese and chips?!

– What are the subsequent purchases after buying a PC?

– What kinds of DNA are sensitive to this new drug?

– Can we automatically classify web documents?

• Applications– Basket data analysis, cross-marketing, catalog design, Web log (click stream) analysis, and DNA

sequence analysis.

4

Why Is Freq. Pattern Mining Important?

• Forms the foundation for many essential data mining tasks

– Association, correlation, and causality analysis

– Sequential, structural (e.g., sub-graph) patterns

– Pattern analysis in spatiotemporal, multimedia, time-series, and stream

data

– Classification: associative classification

– Cluster analysis: frequent pattern-based clustering

– Semantic data compression

5

Basic Concepts: Frequent Patterns and Association Rules

• Itemset X = {x1, …, xk}

• Find all the rules X Y with minimum support and confidence

– support, s, probability that a transaction contains X Y

– confidence, c, conditional probability that a transaction having X also contains Y

Let supmin = 50%, confmin = 50%

Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}Association rules:

A D (60%, 100%)D A (60%, 75%)

Customerbuys chips

Customerbuys both

Customerbuys cheese

Transaction-id Items bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F

6

Closed Patterns and Max-Patterns

• Number of sub-patterns, e.g., {a1, …, a100} contains 2100 – 1 =

1.27*1030 sub-patterns! (Why?)

• Solution: Mine closed patterns and max-patterns instead

• An itemset X is closed if X is frequent and there exists no super-

pattern Y כ X, with the same support as X

• An itemset X is a max-pattern if X is frequent and there exists

no frequent super-pattern Y כ X

7

Closed Patterns and Max-Patterns

• Exercise. DB = {<a1, …, a100>, < a1, …, a50>} – Min_sup = 1.

• What is the set of closed itemset?– <a1, …, a100>: 1

– < a1, …, a50>: 2

• What is the set of max-pattern?– <a1, …, a100>: 1

8

Scalable Methods for Mining Frequent Patterns

• The downward closure property of frequent patterns

– Any subset of a frequent itemset must be frequent

– If {cheese, chips, nuts} is frequent, so is {cheese, chips}

– i.e., every transaction having {cheese, chips, nuts} also contains {cheese, chips}

• Scalable mining methods:

– Apriori (Agrawal & Srikant@VLDB’94)

– Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)

9

Apriori: A Candidate Generation-and-Test Approach

• Apriori is a classic algorithm for learning association rules

• Apriori pruning principle: If there is any itemset which is infrequent, its

superset should not be generated/tested!

• Method:

– Initially, scan DB once to get frequent 1-itemset

– Generate length (k+1) candidate itemsets from length k frequent itemsets

– Test the candidates against DB

– Terminate when no frequent or candidate set can be generated

10

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

11

Apriori Algorithm - An ExampleAssume minimum support = 2

12

Apriori Algorithm - An Example

The final “frequent” item sets are those remaining in L2 and L3.However, {2,3}, {2,5}, and {3,5} are all contained in the larger item set {2, 3, 5}. Thus, the final group of item sets reported by Apriori are {1,3} and {2,3,5}. These are the only item sets from which we will generate association rules.

13

Generating Association Rulesfrom Frequent Itemsets

• Only strong association rules are generated• Frequent itemsets satisfy minimum support

threshold• Strong rules are those that satisfy minimum

confidence threshold

• confidence(A ==> B) = Pr(B | A) =

( )

( )

support A B

support A

For each frequent itemset, f, generate all non-empty subsets of fFor every non-empty subset s of f do if support(f)/support(s) min_confidence then output rule s ==> (f-s)end

14

Generating Association Rules(Example Continued)• Item sets: {1,3} and {2,3,5}

Candidate rules for {1,3}

Candidate rules for {2,3,5}

Rule Conf. Rule Conf. Rule Conf.

{1}{3} 2/2 = 1.0 {2,3}{5} 2/2 = 1.00 {2}{5} 3/3 = 1.00

{3}{1} 2/3 = 0.67 {2,5}{3} 2/3 = 0.67 {2}{3} 2/3 = 0.67

{3,5}{2} 2/2 = 1.00 {3}{2} 2/3 = 0.67

{2}{3,5} 2/3 = 0.67 {3}{5} 2/3 = 0.67

{3}{2,5} 2/3 = 0.67 {5}{2} 3/3 = 1.00

{5}{2,3} 2/3 = 0.67 {5}{3} 2/3 = 0.67

Assuming a min. confidence of 75%, the final set of rules reported by Apriori are: {1}{3}, {3,5}{2}, {5}{2} and {2}{5}

15

The Apriori Algorithm

• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

16

Important Details of Apriori

• How to generate candidates?– Step 1: self-joining Lk

– Step 2: pruning

• How to count supports of candidates?• Example of Candidate-generation

– L3={abc, abd, acd, ace, bcd}

– Self-joining: L3*L3

• abcd from abc and abd• acde from acd and ace

– Pruning:

• acde is removed because ade is not in L3

– C4={abcd}

17

How to Generate Candidates?

• Suppose the items in Lk-1 are listed in an order

• Step 1: self-joining Lk-1 insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

• Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck

18

Challenges of Frequent Pattern Mining

• Challenges– Multiple scans of transaction database

– Huge number of candidates

– Tedious workload of support counting for candidates

• Improving Apriori: general ideas– Reduce passes of transaction database scans

– Shrink number of candidates

– Facilitate support counting of candidates

19

Construct FP-tree from a Transaction Database

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Sort frequent items in frequency descending order

3. Scan DB again, construct FP-tree

20

Construct FP-tree from a Transaction Database

L=[I2:7,I1:6,I3:6,I4:2,I5:2] if Min Sup Count =2

21

Benefits of the FP-tree Structure

• Completeness – Preserve complete information for frequent pattern mining– Never break a long pattern of any transaction

• Compactness– Reduce irrelevant info—infrequent items are gone– Items in frequency descending order: the more frequently occurring, the more

likely to be shared– Never be larger than the original database (not count node-links and the count

field)

22

Mining Frequent Patterns With FP-trees

• Idea: Frequent pattern growth– Recursively grow frequent patterns by pattern and database partition

• Method – For each frequent item, construct its conditional pattern-base, and then its

conditional FP-tree– Repeat the process on each newly created conditional FP-tree – Until the resulting FP-tree is empty, or it contains only one path—single path will

generate all the combinations of its sub-paths, each of which is a frequent pattern

23

Scaling FP-growth by DB Projection

• FP-tree cannot fit in memory?—DB projection

• First partition a database into a set of projected DBs

• Then construct and mine FP-tree for each projected DB

24

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime

(se

c.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

25

Multiple-Level Association Rules

• Items often form a hierarchy– Items at the lower level are expected to have lower

support– Rules regarding itemsets at appropriate levels could be

quite useful– Transaction database can be encoded based on dimensions

and levelsFood

Milk Bread

Skim 2% Wheat White

26

Mining Multi-Level Associations

• A top_down, progressive deepening approach

– First find high-level strong rules:

• milk ->bread [20%, 60%]– Then find their lower-level “weaker” rules:

• 2% milk ->wheat bread [6%, 50%]– When one threshold set for all levels; if support too high then it

is possible to miss meaningful associations at low level; if support too low then possible generation of uninteresting rules

• different minimum support thresholds across multi-levels lead to different algorithms (e.g., decrease min-support at lower levels)

27

Quantitative Association Rules

Handling quantitative rules may require mapping of the continuous variables into Boolean

28

Mapping Quantitative to Boolean

• One possible solution is to map the problem to the Boolean association rules:– discretize a non-categorical attribute to intervals, e.g.,

Age [20,29], [30,39],...– categorical attributes: each value becomes one item– non-categorical attributes: each interval becomes one

item

29

Example in Web Content Mining

• Documents Associations– Find (content-based) associations among documents in a collection– Documents correspond to items and words correspond to transactions– Frequent itemsets are groups of docs in which many words occur in

common• Term Associations

– Find associations among words based on their occurrences in documents– similar to above, but invert the table (terms as items, and docs as

transactions)

Doc 1 Doc 2 Doc 3 . . . Doc nbusiness 5 5 2 . . . 1capital 2 4 3 . . . 5fund 0 0 0 . . . 1

. . . . . . . .

. . . . . . . .

. . . . . . . .invest 6 0 0 . . . 3

30

Example in Web Usage Mining

• Association Rules in Web Transactions– discover affinities among sets of Web page references across user sessions

• Examples– 60% of clients who accessed /products/, also accessed

/products/software/webminer.htm– 30% of clients who accessed /special-offer.html, placed an online

order in /products/software/– Actual Example from IBM official Olympics Site:

• {Badminton, Diving} ==> {Table Tennis} [conf69.7%,sup0.35%]

• Applications– Use rules to serve dynamic, customized contents to users– prefetch files that are most likely to be accessed– determine the best way to structure the Web site (site optimization)– targeted electronic advertising and increasing cross sales

31

Example in Web Usage Mining

• Association Rules From Cray Research Web Site

• Design “suggestions”

Conf supp Association Rule82.8 3.17 /PUBLIC/product-info/T3E

===>/PUBLIC/product-info/T3E/CRAY_T3E.html

90 0.14 /PUBLIC/product-info/J90/J90.html,/PUBLIC/product-info/T3E===>/PUBLIC/product-info/T3E/CRAY_T3E.html

97.2 0.15 /PUBLIC/product-info/J90,/PUBLIC/product-info/T3E/CRAY_T3E.html,/PUBLIC/product-info/T90,===>/PUBLIC/product-info/T3E,/PUBLIC/sc.html

32

تکليف

FP-Growth و Apriori با مشخصات زير ، دو روش IBMبا استفاده از مجموعه داده 1.

را پياده سازي کرده و نتايج را با هم مقايسه نماييد . ( زمان و مقياس پذيري )

– The number of transactions =100000, The average, transaction Length= 10, The

average frequent pattern leng

بررسی یک مقاله در ارتباط با وب کاوی که از قوانین انجمنی یا کاوش 2.

الگوهای تکراری استفاده کرده است. (آمادگی برای ارائه)

FP-Growth و Apriori یک نمونه آزمایش با روش Wekaبا استفاده از نرم اقزار 3.

استفاده نمائید) UCIانجام دهید (از دیتاستهای

1 December 12, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques —...

Documents

Transcript of 1 December 12, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques —...