FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES
description
Transcript of FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES
1
FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING
OF FUZZY ASSOCIATE RULES
By H.N.A. Pham, T.W. Liao, and E. TriantaphyllouDepartment of Industrial Engineering
3128 CEBA BuildingLouisiana State University
Baton Rouge, LA 70803-6409Email: [email protected], [email protected], and
2
Introduction
Background
A fuzzy approach for mining associate rules
Experimental evaluation
Conclusions
Outline
3
Introduction• Associate analysis is a new and attractive
research area in data mining
• The Apriori algorithm (R. Agrawal, IBM 1993) is a key technique for Associate analysis
• Though the Apriori principle allows us to considerably reduce the search space, the technique still requires a huge computation, particularly for large databases
• This research proposes an approach for finding fuzzy sets for quantitative attributes in a database by using clustering techniques and then employs techniques for mining of fuzzy Associate rules .
4
Introduction
Background Associate rules and the Apriori
algorithm Necessity to find fuzzy sets for
quantitative attributes
A fuzzy approach for fuzzy mining associate rulesExperimental evaluation
Conclusions
Outline
5
Associate rules: Market basket analysis
• Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” (in the form X Y, where X and Y are sets of items)
• I = {I1=beer, I2=cake, I3=onigiri}
• A transactional database
• An Associate rule: {I1} {I3}
How often people buy candy and beer together?
TID1: {I1, I2, I3} TID2: {I1, I2} TID3: {I2, I3} TID4: {I2} TID5: {I1, I2}
6
Rule measures: Support and Confidence
Associate rule: X Y
support s = probability that a transaction contains X and Y
confidence c = conditional probability that a transaction having X also contains Y
A C (s=50%, c=66.6%)
C A (s=50%, c=100%)
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Customer buys onigiri
Customer buys both Customerbuys beer
7
Associate mining: the Apriori algorithm
It is composed of two steps:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count
2. Generate strong Associate rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence
(Agrawal, 1993)
8
Associate mining: the Apriori principle
For rule A C support = support({A and C}) = 50%
confidence = support({A and C})/support({A}) = 66.6%
The Apriori principle: Any subset of a frequent itemset must be
frequent (if an itemset is not frequent, neither are its
supersets)
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
9
The Apriori algorithm: Finding frequent itemsets using candidate generation
1. Find the frequent itemsets: the sets of items that have support higher than the minimum support A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemsets
Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-
itemset) from candidate itemsets Ck (Lk Ck)
2. Use the frequent itemsets to generate Associate rules.
C1 … Li-1 Ci Li Ci+1 … Lk
10
Example (min_sup_count = 2)
TID List of items_IDs
T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3
Itemset Sup.Count
{I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2
C1
Itemset Sup.Count
{I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2
L1
Transactional data
Scan D for count of each candidate
Compare candidate support count with minimum support count
11
Example (min_sup_count = 2)
Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5}
C2
Scan D for count of each candidate
Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0
C2Compare candidate support count with minimum support count
Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2
L2
Generate candidates C3 from L2by using the Apriori principle
Itemset
{I1, I2, I3} {I1, I2, I5}
Scan D for count of each candidate
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
C3
Compare candidate support count with minimum support count
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
L3
Generate candidates C2 from L1by using the Apriori principle
12
Necessity to find fuzzy sets for quantitative attributes
Transaction ID Age Married NumCars
100 33 Yes 2
200 39 Yes 2
300 35 No 1
400 20 No 0
A quantitative associate rule with min_sup= min_conf =50%
(Age = 33 or 39) and (Married = Yes) -> (NumCars =2)
A quantitative associate rule with min_sup= min_conf=50%
(Age = 33..39) and (Married = Yes) -> (NumCars =2)
A fuzzy associate rule with min_sup= min_conf =50%
(Age = middle-aged) and (Married = Yes) -> (NumCars =2)
13
Solution: Shape boundary intervals
It is composed of two steps:
1. Partition the attribute domains into small intervals and combine adjacent intervals into larger ones such that the combined intervals will have enough supports
2. Replace the original attribute by its attribute-interval pairs, the quantitative problem can be transformed to a Boolean one.
(Srikant and Agrawal, 1996)
14
Example: Shape boundary intervals
Transaction ID Age Married NumCars
100 33 Yes 2
200 39 Yes 2
300 35 No 1
400 20 No 0
Yes
No
No
No
Age: 18-30
No
Yes
Yes
Yes
Age: 31-39
No
No
Yes
Yes
Married
Yes
Yes
No
No
NumCars:0-1
No400
No300
Yes200
Yes100
NumCars:2-3Transaction ID
• Algorithms ignore or over-emphasize the elements near the boundary of the intervals in the mining process
• The use of shape boundary interval is also not intuitive with respect to human perception
15
Solution: Experts
• An user or expert must provide to this algorithm the required fuzzy sets
of the quantitative attributes and their corresponding membership
functions
• Fuzzy sets and their corresponding
membership functions provided by experts may
not be suitable for mining fuzzy Associate rules
in the database
16
Solution: Fuzzy sets for quantitative attributesIt is composed of three steps:
Step 1: Transform the original database into positive integer
Step 2: For each attribute
Cluster values of the attribute ith into k medoids
Classify the attribute ith into k fuzzy sets
Generate membership functions for each fuzzy set
End for
Step 3: Transform the database based on fuzzy sets(Ada, 1998)
Lose association between attributes in the mining approach
17
Introduction
Background
A fuzzy approach for fuzzy mining associate rules
Fuzzy approach
Fuzzy mining associate rules
Experimental evaluation
Conclusions
Outline
18
Fuzzy approachIt is composed of five steps:
Step 1: Transform the original database into one with positive integers
Step 2: Cluster values of attributes into k medoids.
Step 3: Classify attributes into k fuzzy sets
Step 4: Generate membership functions for each fuzzy set
Step 5: Transform the database based on fuzzy sets
19
Fuzzy approach: Step 2
Clustering:
• The clustering method considers the search
space of a database with n attributes as an n-
dimensional space
• Use the Matlab fuzzy tool box Do not lose association between attributes in the mining approach
20
Fuzzy approach: Step 3
Classify:• Let {m1, m2, …, mk} be k medoids found from step 2, where
mi = {ai1, ai2, …, ain} is the medoid ith. • Let the attribute jth have a range [minj, maxj] and {a1j, a2j, …,
akj} be set of mid-points of the attribute jth. The k fuzzy sets of this attribute will be ranged in
[minj, a2j], [a1j, a3j], …, [a(i-1)j, a(i+1)j], …, and [a(k-1)j, maxj]
m1 a11… aj1
… a1n
… … .. … … …
mk ak1… ajn akn
minj maxj
a(i-
1)j
aij a(i+1)j
Fuzzy set
21
Fuzzy approach: Step 4
Generate membership functions (triangular function):
ortherwise
xaifa
x
axifa
x
xaif
axf
jjk
jk
j
j
jk
jjj
k
j
jk
jjk
jij
,0
max,max
max
min,min
min
,1
)max,,min:(
)2
()1
2(
22
2
2
22
Fuzzy approach: Step 5
Transform the database based on fuzzy sets:
• Let Tij be the value of the ith transaction at the jth attribute
Tij = fuzzy label ith if fij(Tij) = max(fkj(Tij))
23
Example of fuzzy approach
ID Salary IQ
1 10000 120
2 7000 100
3 30000 183
4 9000 110
5 15000 140
6 20000 165
7 5000 85
3000015000 – 32000
High_S
150007000 – 20000
Medium_S
70004000 – 10000
Low_S
Mid-pointRangeFuzzy label
183140 – 200
High_I
140100 – 165
Medium_I
10050 – 120Low_I
Mid-pointRangeFuzzy label
7
6
5
4
3
2
1
ID
Low_ILow_S
Medium_IMedium_S
Medium_IMedium_S
Low_ILow_S
High_IHigh_S
Low_ILow_S
Low_ILow_S
IQSalary
7
6
5
4
3
2
1
ID
0.310.14
0.740.56
0.740.83
0.860.86
0.670.37
0.830.71
0.80.71
IQ’s membership
Salary’s membership
Step 2
Steps 3, 4, 5
24
Fuzzy mining Associate rules
(Attilia, 2000)
It is composed of two steps:1. Find all itemsets that have fuzzy support
(FS<X,A>) above the user specified minimum support. These itemsets are called frequent itemsets.
2. Use the frequent itemsets to generate the desired rules. Let X and Y be frequent itemsets. We can determine if the rule X => Y holds by computing the fuzzy confidence FC<<X,A>,<Y,B>> and this value is larger than the user specified minimum confidence value.
25
Fuzzy mining Associate rules - cont
D
xtAaxFS Tt
jijxjXjAX
i
).,(
,
Ttjijx
Xx
Ttjijz
Zz
ij
j
ij
j
ztAam
ztCcm
).,(
).,( FC B Y,,A X,
• D = {t1, t2, …, tn}: transactions• <X,A> with X is attributes and A is the corresponding fuzzy sets in X • Z = X U Y, C = A U B
26
Introduction
Background
A fuzzy approach for fuzzy mining associate rules
Experimental evaluation
Conclusions
Outline
27
Experiments: Synthetic datasets
• Using synthetic datasets of varying sizes:
Name |D| |T| Size (MB)
D100k.T10 100K 10 3M
D100k.T20 100K 20 6M
D320k.T30 320K 30 18M
|D| = Number of transactions|T| = Average amount of items on transactions
28
Experiment environment• Software
Database : Microsoft Access 2003 Language: C++ and Visual Basic, Matlab Platform: Windows
• Hardware PC Pentium IV-2.66 GMhz, RAM 1GB
29
Evaluate mean of rulesFrom database Salary and IQ, we have rules from the approach with minimum support=43% and minimum confidence = 50% as follows:Rule 1: If 1st variable is low approximately 7000 [ 4000, 10000]
then 2nd variable is low approximately 100 [50, 120]Rule 2: If 1st variable is medium approximately 15000 [7000, 20000]
then 2nd variable is medium approximately 140 [ 100, 165]
the Apriori algorithm Mining quantitative algorithm with fuzzy approach
No frequent Itemsets Frequent Itemset 11st variable is low approximately 7000 [4000, 10000], 2nd variable is low approximately 100 [50, 120]Frequent Itemset 21st variable is medium approximately 15000 [7000, 20000] , 2nd variable is medium approximately 140 [ 100, 165]
Minimum support = 43%
30
Evaluate mean of rules - cont
the Apriori algorithm Mining quantitative algorithm
Frequent Itemset 11st variable is 5000, 2nd variable is 85Frequent Itemset 21st variable is 7000, 2nd variable is 100Frequent Itemset 31st variable is 9000, 2nd variable is 110Frequent Itemset 41st variable is 10000, 2nd variable is 120Frequent Itemset 51st variable is 15000, 2nd variable is 140Frequent Itemset 61st variable is 20000, 2nd variable is 165Frequent Itemset 71st variable is 30000, 2nd variable is 183
Frequent Itemset 11st variable is low approximately 7000
[ 4000, 10000], 2nd variable is low approximately 100 [50, 120]
Frequent Itemset 21st variable is high approximately 30000
[15000, 32000] , 2nd variable is high approximately 183 [140, 200]
Frequent Itemset 31st variable is medium approximately
15000 [7000, 20000] , 2nd variable is medium approximately 140 [ 100, 165]
minimum support = 15%
31
Evaluate fuzziness
7
6
5
4
3
2
1
ID
0.310.14
0.740.56
0.740.83
0.860.86
0.670.37
0.830.71
0.80.71
IQ’s membership
Salary’s membership
7
6
5
4
3
2
1
ID
0.510.34
0.840.66
0.840.83
0.90.9
0.670.57
0.930.91
0.850.74
IQ’s membership
Salary’s membership
Ada New approach
Using the Yager’s fuzziness with p = 1
• Ada_fuzziness_Salary ≈ 0.357 ≤ NewApproach_fuzziness_Salary ≈ 0.425• Ada_fuzziness_IQ ≈ 0.51 ≤ NewApproach_fuzziness_IQ ≈ 0.59
The new approach is fuzzier than Ada
n
i
AAp
p XiXiAADASupp
AADAf
1
~~1 )(()
~,
~(,
)~
(
)~
,~
(1)
~(
32
Evaluate fuzziness - cont
Ada’s approach New approach
Frequent Itemset 11st variable is low approximately 5000
[ 4000, 10000], 2nd variable is low approximately 85 [50, 120]
Frequent Itemset 21st variable is high approximately 20000
[15000, 32000] , 2nd variable is high approximately 165 [140, 200]
Frequent Itemset 31st variable is medium approximately
10000 [7000, 20000] , 2nd variable is medium approximately 120 [ 100, 165]
Frequent Itemset 11st variable is low approximately 7000
[ 4000, 10000], 2nd variable is low approximately 100 [50, 120]
Frequent Itemset 21st variable is high approximately 30000
[15000, 32000] , 2nd variable is high approximately 183 [140, 200]
Frequent Itemset 31st variable is medium approximately
15000 [7000, 20000] , 2nd variable is medium approximately 140 [ 100, 165]
minimum support = 15%In Ada’s Approach, mid points of ranges are moved out centre values. This leads to change mean of frequent itemsets.
33
Execution time (sec.) with different minimum support thresholds
Name Min_sup = 35% Min_sup = 40% Min_sup = 50%
Apriori Fuzzy* Apriori Fuzzy * Apriori Fuzzy *
D100k.T30 80860 42558 4158 1980 485 244
D100k.T20 155440 77720 30005 15792 27012 13506
D320k.T30 329532 147673 69011 28425 52322 20259
*: do not include the transfer time
Name Transferring time a database into fuzzy sets
D100k.T30 95
D100k.T20 5062
D320k.T30 9112
34
Execution time (sec.) with different minimum support thresholds - cont
Min_sup=35%
0
50000
100000
150000
200000
250000
300000
350000
1 2 3
Fuzzy
Apriori
Min_sup=40%
0
10000
20000
30000
40000
50000
60000
70000
80000
1 2 3
Apriori
Fuzzy
Min_sup=50%
0
10000
20000
30000
40000
50000
60000
1 2 3
Apriori
Fuzzy
•Execution time (transfer + mining time) of the fuzzy method is better than the Apriori.
•Moreover, mean of rules is more “Understandable”
35
Conclusions• Proposed an approach to find fuzzy sets for
quantitative attributes for mining associate rules
• An experimental evaluation shows that the mean of rules and execution time when using the fuzzy approach in mining Associate rules are better than that of other algorithms
• Future work: Improve the fuzzy mining approach Develop incremental algorithms for associate
analysis using Support Vector Machines
36
THANK YOUH.N.A. Pham, T.W. Liao, and E.
TriantaphyllouDepartment of Industrial Engineering
3128 CEBA BuildingLouisiana State University
Baton Rouge, LA 70803-6409Email: [email protected], [email protected], and