Huffman Codes and Asssociation Rules (II)

73
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science

description

Lecture 15. Huffman Codes and Asssociation Rules (II). Prof. Sin-Min Lee Department of Computer Science. Huffman Code Example. Given: ABCDE 31246 By using an increasing algorithm (changing from smallest to largest), it changes to: BCADE 12346. - PowerPoint PPT Presentation

Transcript of Huffman Codes and Asssociation Rules (II)

Page 1: Huffman Codes and Asssociation Rules (II)

Huffman Codes and Asssociation Rules (II)

Prof. Sin-Min Lee

Department of Computer Science

Page 2: Huffman Codes and Asssociation Rules (II)

Huffman Code Example

• Given: A B C D E

3 1 2 4 6

By using an increasing algorithm (changing from smallest to largest), it changes to:

B C A D E

1 2 3 4 6

Page 3: Huffman Codes and Asssociation Rules (II)

Huffman Code Example – Step 1

• Because B and C are the lowest values, they can be appended. The new value is 3

3

BC

Page 4: Huffman Codes and Asssociation Rules (II)

Huffman Code Example – Step 2

• Reorder the problem using the increasing algorithm again. This gives us:

BC A D E

3 3 4 6

Page 5: Huffman Codes and Asssociation Rules (II)

Huffman Code Example – Step 3

• Doing another append will give:

6

A

3

BC

Page 6: Huffman Codes and Asssociation Rules (II)

Huffman Code Example – Step 4

• From the initial BC A D E code we get:

D E ABC

4 6 6

D E BCA

4 6 6

D ABC E

4 6 6

D BCA E

4 6 6

E BCD A

E AD BC

A ED BC

BC ED A

Page 7: Huffman Codes and Asssociation Rules (II)

Huffman Code Example – Step 5

• Taking derivates from the previous step, we get:

D E BCA

4 6 6

E DBCA

6 10

DABC E

10 6

D E ABC

4 6 6

AE D BC

D BCAE

A ED BC

BC ED A

Page 8: Huffman Codes and Asssociation Rules (II)

Huffman Code Example – Step 6

• Taking derivates from the previous step, we get:

BCA D E

6 4 6

E DBCA

6 10

E DABC

40 10

ABC D E

6 4 6

BCE

D A

BC BDA

AE

D BC

A EDBC

Page 9: Huffman Codes and Asssociation Rules (II)

Huffman Code Example – Step 7

• After the previous step, we’re supposed to map a 1 to each right branch and a 0 to each left branch. The results of the codes are:

E = 0D = 10B = 1110C = 1111A = 111

ED A

B C

E = 0D = 10A = 110B = 1110C = 1111

AE

D

B CA = 0B = 010C = 011D = 10E = 11

A ED

B C

B = 000C = 001A = 01D = 10E = 11

BDA

B C

Page 10: Huffman Codes and Asssociation Rules (II)

Example• Items={milk, coke, pepsi, beer, juice}.

• Support = 3 baskets.

B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}

B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

• Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.

Page 11: Huffman Codes and Asssociation Rules (II)

Association Rules

• Association rule R : Itemset1 => Itemset2– Itemset1, 2 are disjoint and Itemset2 is non-

empty– meaning: if transaction includes Itemset1 then

it also has Itemset2

• Examples– A,B => E,C– A => B,C

Page 12: Huffman Codes and Asssociation Rules (II)

Example

B1 = {m, c, b} B2 = {m, p, j}

B3 = {m, b} B4 = {c, j}

B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

• An association rule: {m, b} → c.– Confidence = 2/4 = 50%.

+__ +

Page 13: Huffman Codes and Asssociation Rules (II)
Page 14: Huffman Codes and Asssociation Rules (II)
Page 15: Huffman Codes and Asssociation Rules (II)
Page 16: Huffman Codes and Asssociation Rules (II)
Page 17: Huffman Codes and Asssociation Rules (II)

From Frequent Itemsets to Association Rules

• Q: Given frequent set {A,B,E}, what are possible association rules? – A => B, E

– A, B => E

– A, E => B

– B => A, E

– B, E => A

– E => A, B

– __ => A,B,E (empty rule), or true => A,B,E

Page 18: Huffman Codes and Asssociation Rules (II)

Classification vs Association Rules

Classification Rules• Focus on one target

field• Specify class in all

cases• Measures: Accuracy

Association Rules• Many target fields• Applicable in some

cases• Measures: Support,

Confidence, Lift

Page 19: Huffman Codes and Asssociation Rules (II)

Rule Support and Confidence• Suppose R : I => J is an association rule

– sup (R) = sup (I J) is the support count • support of itemset I J (I or J)

– conf (R) = sup(J) / sup(R) is the confidence of R• fraction of transactions with I J that have J

• Association rules with minimum support and count are sometimes called “strong” rules

Page 20: Huffman Codes and Asssociation Rules (II)

Association Rules Example: • Q: Given frequent set {A,B,E}, what

association rules have minsup = 2 and minconf= 50% ?

A, B => E : conf=2/4 = 50%

A, E => B : conf=2/2 = 100%

B, E => A : conf=2/2 = 100%

E => A, B : conf=2/2 = 100%

Don’t qualify

A =>B, E : conf=2/6 =33%< 50%

B => A, E : conf=2/7 = 28% < 50%

__ => A,B,E : conf: 2/9 = 22% < 50%

TID List of items

1 A, B, E

2 B, D

3 B, C

4 A, B, D

5 A, C

6 B, C

7 A, C

8 A, B, C, E

9 A, B, C

Page 21: Huffman Codes and Asssociation Rules (II)

Find Strong Association Rules

• A rule has the parameters minsup and minconf:– sup(R) >= minsup and conf (R) >= minconf

• Problem:– Find all association rules with given minsup

and minconf

• First, find all frequent itemsets

Page 22: Huffman Codes and Asssociation Rules (II)

Finding Frequent Itemsets

• Start by finding one-item sets (easy)

• Q: How?

• A: Simply count the frequencies of all items

Page 23: Huffman Codes and Asssociation Rules (II)

Finding itemsets: next level

• Apriori algorithm (Agrawal & Srikant) • Idea: use one-item sets to generate two-item sets,

two-item sets to generate three-item sets, …– If (A B) is a frequent item set, then (A) and (B) have to

be frequent item sets as well!

– In general: if X is frequent k-item set, then all (k-1)-item subsets of X are also frequent

Compute k-item set by merging (k-1)-item sets

Page 24: Huffman Codes and Asssociation Rules (II)
Page 25: Huffman Codes and Asssociation Rules (II)

Finding Association Rules

• A typical question: “find all association rules with support ≥ s and confidence ≥ c.”– Note: “support” of an association rule is the support

of the set of items it mentions.

• Hard part: finding the high-support (frequent ) itemsets.– Checking the confidence of association rules

involving those sets is relatively easy.

Page 26: Huffman Codes and Asssociation Rules (II)

Naïve Algorithm

• A simple way to find frequent pairs is:– Read file once, counting in main memory the

occurrences of each pair.• Expand each basket of n items into its n (n -1)/2

pairs.

• Fails if #items-squared exceeds main memory.

Page 27: Huffman Codes and Asssociation Rules (II)
Page 28: Huffman Codes and Asssociation Rules (II)

C1 L1 C2 L2 C3Filter Filter ConstructConstruct

Firstpass

Secondpass

Page 29: Huffman Codes and Asssociation Rules (II)

Fast Algorithms for Mining Association Rules, by Rakesh Agrawal and Ramakrishan Sikant, IBM Almaden Research Center

[Agrawal, Srikant 94]

Page 30: Huffman Codes and Asssociation Rules (II)
Page 31: Huffman Codes and Asssociation Rules (II)
Page 32: Huffman Codes and Asssociation Rules (II)
Page 33: Huffman Codes and Asssociation Rules (II)
Page 34: Huffman Codes and Asssociation Rules (II)
Page 35: Huffman Codes and Asssociation Rules (II)
Page 36: Huffman Codes and Asssociation Rules (II)
Page 37: Huffman Codes and Asssociation Rules (II)
Page 38: Huffman Codes and Asssociation Rules (II)
Page 39: Huffman Codes and Asssociation Rules (II)

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

TID Set-of-itemsets

100 { {1},{3},{4} }

200 { {2},{3},{5} }

300 { {1},{2},{3},{5} }

400 { {2},{5} }

Itemset Support

{1} 2

{2} 3

{3} 3

{5} 3

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

TID Set-of-itemsets

100 { {1 3} }

200 { {2 3},{2 5} {3 5} }

300 { {1 2},{1 3},{1 5},{2 3}, {2 5}, {3 5} }

400 { {2 5} }

Itemset Support

{1 3} 2

{2 3} 3

{2 5} 3

{3 5} 2

itemset

{2 3 5}

TID Set-of-itemsets

200 { {2 3 5} }

300 { {2 3 5} }

Itemset Support

{2 3 5} 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3

Page 40: Huffman Codes and Asssociation Rules (II)
Page 41: Huffman Codes and Asssociation Rules (II)
Page 42: Huffman Codes and Asssociation Rules (II)
Page 43: Huffman Codes and Asssociation Rules (II)

Dynamic ProgrammingDynamic Programming ApproachApproachWant proof of Want proof of principle of optimalityprinciple of optimality and and overlapping overlapping subproblemssubproblems

Principle of OptimalityPrinciple of OptimalityThe optimal solution to LThe optimal solution to Lkk includes the optimal includes the optimal

solution of Lsolution of Lkk-1-1

Proof by contradictionProof by contradiction

Overlapping SubproblemsOverlapping SubproblemsLemma of every subset of a frequent item set is a Lemma of every subset of a frequent item set is a frequent item setfrequent item setProof by contradictionProof by contradiction

Page 44: Huffman Codes and Asssociation Rules (II)
Page 45: Huffman Codes and Asssociation Rules (II)
Page 46: Huffman Codes and Asssociation Rules (II)
Page 47: Huffman Codes and Asssociation Rules (II)
Page 48: Huffman Codes and Asssociation Rules (II)
Page 49: Huffman Codes and Asssociation Rules (II)
Page 50: Huffman Codes and Asssociation Rules (II)
Page 51: Huffman Codes and Asssociation Rules (II)
Page 52: Huffman Codes and Asssociation Rules (II)
Page 53: Huffman Codes and Asssociation Rules (II)
Page 54: Huffman Codes and Asssociation Rules (II)
Page 55: Huffman Codes and Asssociation Rules (II)
Page 56: Huffman Codes and Asssociation Rules (II)

The Apriori Algorithm: Example

• Consider a database, D , consisting of 9 transactions.

• Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )

• Let minimum confidence required is 70%.

• We have to first find out the frequent itemset using Apriori algorithm.

• Then, Association rules will be generated using min. support & min. confidence.

TID List of Items

T100 I1, I2, I5

T100 I2, I4

T100 I2, I3

T100 I1, I2, I4

T100 I1, I3

T100 I2, I3

T100 I1, I3

T100 I1, I2 ,I3, I5

T100 I1, I2, I3

Page 57: Huffman Codes and Asssociation Rules (II)

Step 1: Generating 1-itemset Frequent Pattern

Itemset

Sup.Count

{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2

Itemset

Sup.Count

{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2

• In the first iteration of the algorithm, each item is a member of the set of candidate.

• The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets satisfying minimum support.

Scan D for count of each candidate

Compare candidate support count with minimum support count

C1 L1

Page 58: Huffman Codes and Asssociation Rules (II)

Step 2: Generating 2-itemset Frequent Pattern

Itemset

{I1, I2}

{I1, I3}

{I1, I4}

{I1, I5}

{I2, I3}

{I2, I4}

{I2, I5}

{I3, I4}

{I3, I5}

{I4, I5}

Itemset

Sup.Count

{I1, I2}

4

{I1, I3}

4

{I1, I4}

1

{I1, I5}

2

{I2, I3}

4

{I2, I4}

2

{I2, I5}

2

{I3, I4}

0

{I3, I5}

1

{I4, I5}

0

Itemset

SupCount

{I1, I2}

4

{I1, I3}

4

{I1, I5}

2

{I2, I3}

4

{I2, I4}

2

{I2, I5}

2

Generate C2

candidates from L1

C2

C2

L2

Scan D for count of each candidate

Compare candidate support count with minimum support count

Page 59: Huffman Codes and Asssociation Rules (II)

Step 2: Generating 2-itemset Frequent Pattern [Cont.]

• To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to generate a candidate set of 2-itemsets, C2.

• Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table).

• The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

• Note: We haven’t used Apriori Property yet.

Page 60: Huffman Codes and Asssociation Rules (II)

Step 3: Generating 3-itemset Frequent Pattern

Itemset

{I1, I2, I3}

{I1, I2, I5}

Itemset Sup.Count

{I1, I2, I3}

2

{I1, I2, I5}

2

Itemset SupCoun

t

{I1, I2, I3}

2

{I1, I2, I5}

2C3 C3L3

Scan D for count of each candidate

Compare candidate support count with min support count

Scan D for count of each candidate

• The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori Property.

• In order to find C3, we compute L2 Join L2.

• C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.

• Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck.

Page 61: Huffman Codes and Asssociation Rules (II)

Step 3: Generating 3-itemset Frequent Pattern [Cont.]• Based on the Apriori property that all subsets of a frequent itemset must also

be frequent, we can determine that four latter candidates cannot possibly be frequent. How ?

• For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.

• Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

• BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3.

• Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning.

• Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support.

Page 62: Huffman Codes and Asssociation Rules (II)

Step 4: Generating 4-itemset Frequent Pattern

• The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent.

• Thus, C4 = φ , and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm.

• What’s Next ? These frequent itemsets will be used to generate strong association rules ( where strong association rules satisfy both minimum support & minimum confidence).

Page 63: Huffman Codes and Asssociation Rules (II)

Step 5: Generating Association Rules from Frequent Itemsets

• Procedure:• For each frequent itemset “l”, generate all nonempty subsets

of l.• For every nonempty subset s of l, output the rule “s (l-s)” if

support_count(l) / support_count(s) >= min_conf where min_conf is minimum confidence threshold.

• Back To Example:We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.– Lets take l = {I1,I2,I5}. – Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2},

{I5}.

Page 64: Huffman Codes and Asssociation Rules (II)

Step 5: Generating Association Rules from Frequent Itemsets

[Cont.]• Let minimum confidence threshold is , say 70%.• The resulting association rules are shown below, each

listed with its confidence.– R1: I1 ^ I2 I5

• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%• R1 is Rejected.

– R2: I1 ^ I5 I2 • Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%• R2 is Selected.

– R3: I2 ^ I5 I1• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%• R3 is Selected.

Page 65: Huffman Codes and Asssociation Rules (II)

Step 5: Generating Association Rules from Frequent Itemsets

[Cont.]– R4: I1 I2 ^ I5

• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%• R4 is Rejected.

– R5: I2 I1 ^ I5• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%• R5 is Rejected.

– R6: I5 I1 ^ I2• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%• R6 is Selected. In this way, We have found three strong

association rules.

Page 66: Huffman Codes and Asssociation Rules (II)

ABCDE

ACDEB

ABCED

ACDBE

ADEBCCDEAB

ACEBD

BCEAD

ACEBD

ABECD

ABCED

Large itemset

Rules with minsup

Simple algorithm:

Fast algorithm:

ACEBD

ABCDE

ACDEB

ABCED

Example

Page 67: Huffman Codes and Asssociation Rules (II)
Page 68: Huffman Codes and Asssociation Rules (II)
Page 69: Huffman Codes and Asssociation Rules (II)
Page 70: Huffman Codes and Asssociation Rules (II)
Page 71: Huffman Codes and Asssociation Rules (II)
Page 72: Huffman Codes and Asssociation Rules (II)
Page 73: Huffman Codes and Asssociation Rules (II)