1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber,...

36
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved.

Transcript of 1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber,...

11

Data Mining: Concepts and

Techniques (3rd ed.)

— Chapter 6 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &

Simon Fraser University

©2013 Han, Kamber & Pei. All rights reserved.

2

Chapter 6: Mining Frequent Patterns, Association and Correlations: Basic

Concepts and Methods

Basic Concepts

Frequent Itemset Mining Methods

Which Patterns Are Interesting?—Pattern

Evaluation Methods

Summary

3

What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences,

substructures, etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the

context of frequent itemsets and association rule mining

Motivation: Finding inherent regularities in data

What products were often purchased together?— Beer and

diapers?!

What are the subsequent purchases after buying a PC?

Can we automatically classify web documents?

Applications

Basket data analysis, cross-marketing, catalog design, sale

campaign analysis, Web log (click stream) analysis, and DNA

sequence analysis.

4

Market Basket Analysis

5

Why Is Freq. Pattern Mining Important?

Freq. pattern: An intrinsic and important property of datasets

Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia,

time-series, and stream data Classification: frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications

6

Basic Concepts: Frequent Patterns

itemset: A set of one or more items

k-itemset X = {x1, …, xk} (absolute) support, or,

support count of X: Frequency or occurrence of an itemset X

(relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)

An itemset X is frequent if X’s support is no less than a minsup threshold

Customerbuys diaper

Customerbuys both

Customerbuys beer

Tid Items bought

10 Beer, Nuts, Diaper

20 Beer, Coffee, Diaper

30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk

50 Nuts, Coffee, Diaper, Eggs, Milk

7

Basic Concepts: Association Rules

Find all the rules X Y with minimum support and confidence support, s, probability that

a transaction contains X Y

confidence, c, conditional probability that a transaction having X also contains Y

Let minsup = 50%, minconf = 50%Freq. Pat.: Beer:3, Nuts:3, Diaper:4,

Eggs:3, {Beer, Diaper}:3

Customerbuys diaper

Customerbuys both

Customerbuys beer

Nuts, Eggs, Milk40

Nuts, Coffee, Diaper, Eggs, Milk50

Beer, Diaper, Eggs30

Beer, Coffee, Diaper20

Beer, Nuts, Diaper10

Items boughtTid

Association rules: (many more!) Beer Diaper (60%, 100%) Diaper Beer (60%, 75%)

8

Closed Patterns and Max-Patterns

A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (100

1) + (1002) + … +

(110000) = 2100 – 1 = 1.27*1030 sub-patterns!

Solution: Mine closed frequent itemsets and maximal frequent itemsets instead

An itemset X is closed if X is frequent and there exists no super-itemset Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)

An itemset X is a maximal if X is frequent and there exists no super-itemset Y כ X and Y is frequent

(proposed by Bayardo @ SIGMOD’98) Closed pattern is a lossless compression of freq.

patterns

9

Closed Patterns and Max-Patterns

Exercise: Suppose a DB contains only two transactions <a1, …, a100>, <a1, …, a50>

Let min_sup = 1 What is the set of closed frequent itemsets?

{a1, …, a100}: 1

{a1, …, a50}: 2

What is the set of maximal frequent itemsets? {a1, …, a100}: 1

What is the set of all itemsets? {a1}: 2, …, {a1, a2}: 2, …, {a1, a51}: 1, …, {a1, a2, …, a100}: 1

A big number: 2100 - 1

10

Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic

Concepts and Methods

Basic Concepts

Frequent Itemset Mining Methods

Which Patterns Are Interesting?—Pattern

Evaluation Methods

Summary

11

Scalable Frequent Itemset Mining Methods

Apriori: A Candidate Generation-and-Test

Approach

Improving the Efficiency of Apriori

FPGrowth: A Frequent Pattern-Growth

Approach

Frequent Pattern Mining with Vertical Data

Format

12

Apriori Algorithm Proposed by R. Agrawal and R. Srikanth in 1994 Name of the algorithm is based on the fact that the

algorithm uses prior knowledge of the frequent itemset properties

Apriori employs an iterative approach known as a level-wise search where k-itemsets are used to explore (k+1)-itemsets

13

Apriori Method

Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length

k frequent itemsets Terminate when no frequent or candidate set can be

generated To improve the efficiency of the level-wise generation

of frequent itemsets, the apriori property is used to reduce the search space

All non-empty subsets of a frequent itemset must also be frequent

If {beer, diaper, nuts} is frequent, so is {beer, diaper}

14

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C22nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

15

The Apriori Algorithm (Pseudo-Code)

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do increment the count of all candidates in Ck+1

that are contained in t Lk+1 = candidates in Ck+1 with min_support

endreturn k Lk;

16

Implementation of Apriori

How to generate candidates? Step 1: self-joining Lk

Step 2: pruning Example of Candidate-generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd acde from acd and ace

Pruning: acde is removed because ade is not in L3

C4 = {abcd}

17

Generating Association Rules from Frequent Itemsets

Method For each frequent itemset l, generate all nonempty subsets of l For every nonempty subset of l, output the rule s -> (l - s) if

(support_coutn(l) / support_count(s)) >=min_conf

Because the rules are generated from frequent itemsets, each one automatically satisfies the minimum support

18

Generating Association Rules Example

Example: Given the following table and the frequent itemset X = {I1, I2, I5}, generate the association rules.

The nonempty subsets of X are: {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}

The resulting association rules are:

If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because these are the only ones generated that are strong

19

Scalable Frequent Itemset Mining Methods

Apriori: A Candidate Generation-and-Test

Approach

Improving the Efficiency of Apriori

FPGrowth: A Frequent Pattern-Growth Approach

Frequent Pattern Mining with Vertical Data Format

20

Further Improvement of the Apriori Method

Major computational challenges

Multiple scans of transaction database

Huge number of candidates

Tedious workload of support counting for

candidates

Improving Apriori: general ideas

Reduce passes of transaction database scans

Shrink number of candidates

Facilitate support counting of candidates

Partition: Scan Database Only Twice

Scan 1: partition database and find local frequent patterns

Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski and S. Navathe, VLDB’95

22

Hash-based Technique: Reduce the Number of Candidates

A k-itemset whose corresponding hashing bucket count is

below the threshold cannot be frequent

J. Park, M. Chen, and P. Yu. An effective hash-based algorithm

for mining association rules. SIGMOD’95

23

Sampling for Frequent Patterns

Select a sample of original database, mine

frequent patterns within sample using Apriori

H. Toivonen. Sampling large databases for

association rules. In VLDB’96

24

Dynamic Itemset Counting: Reduce Number of Scans

Database is partitioned into blocks marked by start point

New candidate itemsets can be added at any start point,

unlike in Apriori, which determines new candidate

itemsets only before each complete database scan

The technique uses count-so-far as the lower bound of

the actual count

If the count-so-far passes the minimum support, the

itemset is added into the frequent itemset

S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic

itemset counting and implication rules for market basket

data. In SIGMOD’97

25

Apriori Example

Consider the following database with five

transactions. Let min_sup = 60% and min_conf =

80%.

Find all the frequent itemsets using Apriori method

List all the strong association rules matching the

following metarule

26

Scalable Frequent Itemset Mining Methods

Apriori: A Candidate Generation-and-Test

Approach

Improving the Efficiency of Apriori

FPGrowth: A Frequent Pattern-Growth Approach

Frequent Pattern Mining with Vertical Data Format

27

Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation

Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they are frequent– Generation of candidate itemsets is expensive(in both

space and time)– Support counting is expensive

• Subset checking (computationally expensive)• Multiple Database scans (I/O)

FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach:– Step 1: Build a compact data structure called the FP-tree

• Built using 2 passes over the data-set.– Step 2: Extracts frequent itemsets directly from the FP-tree

28

Step 1: FP-Tree Construction

FP-Tree is constructed using 2 passes over the data-set:Pass 1:– Scan data and find support for each item.– Discard infrequent items.– Sort frequent items in decreasing order

based on their support.Use this order when building the FP-Tree, so common prefixes can be shared.

Step 1: FP-Tree Construction

Pass 2:Nodes correspond to items and have a counter1. FP-Growth reads 1 transaction at a time and

maps it to a path2. Fixed order is used, so paths can overlap when

transactions share items (when they have the same prefix ).– In this case, counters are incremented

3. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines)– The more paths that overlap, the higher the

compression. FP-tree may fit in memory.4. Frequent itemsets extracted from the FP-Tree.

30

Step 2: The Frequent Pattern Growth Mining Method

Method For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree Repeat the process on each newly created

conditional FP-tree Until the resulting FP-tree is empty, or it

contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

31

Scalable Frequent Itemset Mining Methods

Apriori: A Candidate Generation-and-Test

Approach

Improving the Efficiency of Apriori

FPGrowth: A Frequent Pattern-Growth Approach

Frequent Pattern Mining with Vertical Data Format

32

Mining by Exploring Vertical Data Format

Minimum support = 2

33

Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic

Concepts and Methods

Basic Concepts

Frequent Itemset Mining Methods

Which Patterns Are Interesting?—Pattern

Evaluation Methods

Summary

34

Interestingness Measure: Correlations (Lift)

play basketball eat cereal [40%, 66.7%]

play basketball not eat cereal [20%, 33.3%]

Measure of dependent/correlated events: lift

If lift =1, occurrence of itemset A is independent of occurrence

of itemset B, otherwise, they are dependent and correlated

89.05000/3750*5000/3000

5000/2000),( CBlift Basketba

llNot basketball

Sum (row)

Cereal 2000 1750 3750

Not cereal

1000 250 1250

Sum(col.) 3000 2000 5000

)()(

)(

BPAP

BAPlift

33.15000/1250*5000/3000

5000/1000),( CBlift

35

Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic

Concepts and Methods

Basic Concepts

Frequent Itemset Mining Methods

Which Patterns Are Interesting?—Pattern

Evaluation Methods

Summary

36

Summary

Basic concepts: association rules, support-confident framework, closed and max-patterns

Scalable frequent pattern mining methods Apriori (Candidate generation & test) Fpgrowth Vertical format approach (ECLAT,

CHARM, ...) Which patterns are interesting?

lift