Lecture 8-9 Association Rule Mining.ppt
-
Upload
muhammad-usman -
Category
Documents
-
view
215 -
download
0
Transcript of Lecture 8-9 Association Rule Mining.ppt
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
1/21
Data Mining
Association Rules Mining
Frequent Itemset Mining
Support and Confidence Apriori Approach
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
2/21
Association rules define relationship of the form:
Read as A implies B, where A and B are sets ofbinary valued attributes represented in a data
set.
Association Rule Mining (ARM) is then the processof finding all the ARs in a given DB.
A B
Initial Definition of Association Rules
(ARs) Mining
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
3/21
Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by acustomer in a visit)
Find: all rules that correlate the presence of oneset of items with that of another set of items
E.g., 98% of students who study Databases and C++also study Algorithms
Applications Home Electronics * (What other products should
the store stocks up?)Attached mailing in direct marketing
Web page navigation in Search Engines (first page a->page b)
Text mining if IT companies -> Microsoft
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
4/21
D = A data set comprising n records and m
binary valued attributes.
I = The set of m attributes, {i1,i2, ,im},
represented in D.
Itemset = Some subset of I. Each record
in D is an itemset.
Some Notation
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
5/21
I = {a,b,c,d,e},
D = {{a,b,c},{a,b,d},{a,b,e},{a,c,d},{a,c,e},{a,d,e},{b,c,d},{b,c,e},
{b,d,e},{c,d,e}}
Given attributes which are not binaryvalued (i.e. either nominal or 10 c d e
or ranged) the attributes can be discretised sothat they are represented by a number of binary
valued attributes.
9 b d e8 b c e7 b c d6 a d e
5 a c e4 a c d3 a b e2 a b d1 a b cTID AttsExample DB
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
6/21
Association rules define relationship of the form:
Read asA implies B Such thatAI, BI, AB= (A and B are
disjoint) andABI. In other words an AR is made up of an itemset of
cardinality 2 or more.
A B
In depth Definition of ARs Mining
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
7/21
Given a database D we wish to find (Mine) all the
itemsets of cardinality 2 or more, contained in D,and then use these item sets to create associationrules of the form AB.
The number of potential itemsets of cardinality 2 or
more is:
2m-m-1
So know we do not want to find all the itemsets ofcardinality 2 or more, contained in D, we only wantto find the interesting itemsets of cardinality 2 or
more, contained in D.
If m=5, #potential itemsets = 26
If m=20, #potential itemsets = 1048556
ARM Problem Definition (1)
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
8/21
The most commonly used interestingness
measures are:
1. Support
2. Confidence
Association Rules Measurement
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
9/21
Itemset Support
Support: A measure of the frequency with which
an itemset occurs in a DB.
If an itemset has support higher than somespecified threshold we say that the itemset issupportedor frequent(some authors use the termlarge).
Support threshold is normally set reasonably low(say) 1%.
supp(A) = # records that contain A
m
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
10/21
Confidence
Confidence: A measure, expressed as a ratio, of
the support for an AR compared to the support ofits antecedent.
We say that we are confident in a rule if itsconfidence exceeds some threshold (normally setreasonably high, say, 80%).
conf(AB) = supp(AB)supp(A)
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
11/21
Rule Measures: Support and Confidence
Find all the rules X & Y Zwith
minimum confidence and support support, s, probability that atransaction contains {X Y Z}
confidence, c, conditional probabilitythat a transaction having {X Y} alsocontains Z
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Let minimum support 50%, andminimum confidence 50%,
we have A C (50%, 66.6%)
C A (50%, 100%)
Customer
buys Bread
Customer
buys both
Customerbuys Butter
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
12/21
Given a database D we wish to find all the
frequent itemsets (F) and then use this knowledgeto produce high confidence association rules.
Note: Finding F is the most computationally
expensive part, once we have the frequent setsgenerating ARs is straight forward
ARM Problem Definition (2)
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
13/21
a 6
b 6ab 3
c 6
ac 3
bc 3
abc 1
d 6
ad 6bd 3
abd 1
cd 3
acd 1bcd 1
abcd 0
e 6
ae 3
be 3
abe 1
ce 3ace 1
bce 1
abce 0
de 3ade 1
bde 1
abde 0
cde 1
acde 0
bcde 0
abcde 0
List all possible
combinations in anarray.
For each record:
1. Find all combinations.
2. For each combination
index into array and
increment support by
1.
Then generate rules
BRUTE FORCE
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
14/21
a 6
b 6
ab 3
c 6ac 3
bc 3
abc 1d 6
ad 6
bd 3
abd 1
cd 3
acd 1
bcd 1
abcd 0e 6
ae 3
be 3abe 1
ce 3
ace 1
bce 1
abce 0
de 3
ade 1
bde 1abde 0
cde 1
acde 0bcde 0
abcde 0
Support threshold = 5%
(count of 1.55)
Frequents Sets (F):
ab(3) ac(3) bc(3)
ad(3) bd(3) cd(3)
ae(3) be(3) ce(3)
de(3)
Rules:
ab conf=3/6=50%ba conf=3/6=50%
Etc.
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
15/21
Advantages:
1) Very efficient for data sets with small numbers ofattributes (
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
16/21
Association Rule Mining: A Road Map
Boolean vs. quantitative associations (Based
on the types of values handled) buys(x, SQLServer) ^ buys(x, DMBook) ->buys(x,
DBMiner) [0.2%, 60%]
age(x, 30..39) ^ income(x, 42..48K) -> buys(x, PC)
[1%, 75%]
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
17/21
Mining Association RulesAn Example
For ruleAC:
support = support({AC}) = 50%
confidence = support({AC})/support({A}) = 66.6%
TheApriori principle:
Any subset of a frequent itemset must be frequent
Transaction ID Items Bought
2000 A,B,C1000 A,C
4000 A,D
5000 B,E,F
Frequent Itemset Support
{A} 75%
{B} 50%
{C} 50%{A,C} 50%
Min. support 50%
Min. confidence 50%
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
18/21
Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of items that
have minimum support
A subset of a frequent itemset must also be a frequent
itemset
i.e., if {AB} isa frequent itemset, both {A} and {B} should be afrequent itemset
Iteratively find frequent itemsets with cardinality from 1
to k (k-itemset)
Use the frequent itemsets to generate association
rules.
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
19/21
The Apriori AlgorithmExample
TID Items100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset{1 2}
{1 3}
{1 5}
{2 3}
{2 5}{3 5}
itemset sup{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
C3
L3itemset
{2 3 5}Scan D itemset sup
{2 3 5} 2
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
20/21
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size kLk: frequent itemset of size k
L1= {frequent items};for(k= 1; L
k
!=; k++) do beginCk+1= candidates generated from Lk;
for each transaction tin database do
increment the count of all candidates in Ck+1
that are contained in tLk+1 = candidates in Ck+1with min_supportend
returnkLk;
-
7/28/2019 Lecture 8-9 Association Rule Mining.ppt
21/21
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcdfrom abcand abd acdefrom acdand ace
Pruning:
acdeis removed because adeis not in L3
C4={abcd}