University of Technology -HoChiMinh City (HCMUT) Faculty ...
1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information...
-
Upload
cecelia-luther -
Category
Documents
-
view
215 -
download
0
Transcript of 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information...
1
A distributed method for mining association rules
Pham Nguyen Anh Huy*Department of Information Technology
Vietnam National University of HoChiMinh city
presented by Ho Tu BaoSchool of Knowledge Science
Japan Advanced Institute of Science and Technology
(*work done during 3 months of the author JSPS’s fellowship in JAIST)
2
Introduction
Background
A distributed Apriori algorithm using mobile agents
Experimental evaluation
Conclusion
Outline
3
IntroductionAssociation analysis is a new and attractive research area in data mining
Apriori algorithm (R. Agrawal, IBM 1993) is a key technique for association analysis
Though the apriori principle allows us to considerably reduce the search space, the technique still requires a huge computation, particularly for large database
This research proposes a distributed version of Apriori algorithm using mobile agents. The experiments show that we can reduce computation time when using computers in a distributed computing environment.
4
Introduction
Background Association rules and Apriori
algorithm Mobile agents and Aglets
A distributed Apriori algorithm using mobile agents
Experimental evaluation
Conclusion
Outline
5
Association rules: Market basket analysis
Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” (in the form X Y, where X and Y are sets of items)
I = {I1=beer, I2=cake, I3=onigiri}
Transactional database
An association rule {I1} {I3}
How often people buy onigiri and beer together?
TID1: {I1, I2, I3} TID2: {I1, I2} TID3: {I2, I3} TID4: {I2} TID5: {I1, I2}
6
Rule measures: Support and Confidence
Association rule X Y
support s = probability that a transaction contains X and Y
confidence c = conditional probability that a transaction having X also contains Y
A C (s=50%, c=66.6%)
C A (s=50%, c=100%)
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Customer buys onigiri
Customer buys both Customerbuys beer
7
Association mining: Apriori algorithm
It is composed of two steps:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count
2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence
(Agrawal, R., 1993)
8
Association mining: Apriori principle
For rule A C support = support({A and C}) = 50%
confidence = support({A and C})/support({A}) = 66.6%
The Apriori principle: Any subset of a frequent itemset must be
frequent (if an itemset is not frequent, its supersets
are not)
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
9
The Apriori algorithm: Finding frequent itemsets using candidate generation
1. Find the frequent itemsets: the sets of items that have support higher than the minimum support A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-
itemset) by from candidate itemsets Ck (Lk Ck)
2. Use the frequent itemsets to generate association rules.
C1 … Li-1 Ci Li Ci+1 … Lk
10
Example (min_sup_count = 2)
TID List of items_IDs
T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3
Itemset Sup.Count
{I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2
C1
Itemset Sup.Count
{I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2
L1
Transactional data
Scan D for count of each candidate
Compare candidate support count with minimum support count
11
Example (min_sup_count = 2)
Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5}
C2
Scan D for count of each candidate
Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0
C2Compare candidate support count with minimum support count
Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2
L2
Generate candidates C3 from L2using Apriori principle
Itemset
{I1, I2, I3} {I1, I2, I5}
Scan D for count of each candidate
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
C3
Compare candidate support count with minimum support count
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
L3
Generate candidates C2 from L1using Apriori principle
12
Agents and Mobile agents
An agent is a computation entity that:
Acts on behalf of other entities in autonomous fashion.
Performs its actions with some level of pro-activity and re-activeness.
Exhibits some level of the key attributes of co-operation.
Mobile network agents are programs that:
can migrate from system to system within a network environment
Performs some processing at each host
Agent decides when and where to move next
How does it move? Save state Transport saved state to next
system Resume execution of saved
state
13
Distributed Computing using Mobile Programs
14
Mobile agent toolsNo. Name Description Developer Language Application
1 Concordia Framework for agent development Mitsubishi E.I.T. Java Mobile computing, Data base
2 Aglet Java Class libraries IBM, Tokyo Java Internet3 Agent Tcl Transportable agent system R. Gray, U Dart. Tcl Tk Information management
4 Odyssey Set of Java Class libraries General Magic Telescript Electronic commerce
5 OAA Open Agent Architecture SRI International, AI C, C-Lisp, Java, VB General purpose
6 Ara Agent for Remote Action U Kaiserslautern C/C++, Tcl, Java Partially connected c. D.D.B.
7 Tacoma Tromso and Cornel Moving Agent Norway & Cornell C, UNIX-based, Client/Server model issues / OS support
8 Voyager Platform for distributed applic. ObjectSpace Java Support for agent systems
9 AgentSpace Agent building platform Ichiro Sato, O. U. Java General purpose
10 Mole First Java-Based MA system Stuttgart U. Germany Java, UNIX-based General purpose
11 MOA Mobile Object and Agents OpenGroup, UK Java General purpose
12 Kali Scheme Distributed impl. of Scheme NEC Research I. Scheme Distributed data mining, load balancing
13 The Tube mobile code system David Halls, UK Scheme Remote execution of Scheme
14 Ajanta Network mobile object Minoseta U. Java General purpose
15 Knowbots Research infrastructure of MA CNRI Python Distributed systems / Internet
16 AgentSpace Mobile agent framework Alberto Sylva Java Support for dynamic and dist. Appl.
17 Plangent Intelligent Agent system Toshiba Corporation Java Intelligent tasks
18 JATlite Java Agent framework dev /KQML Standford U. Java Information retrievial, Interface agent
19 Kafka Multiagent libraries for Java Fujitsu Lab. Japan Java UNIX based General purpose
20 Messengers Autonomous messages UCI C (Messenger-C) General purpose
15
What are Aglets ?Aglets (Agile Applets) are Java objects that can move from one host on the Internet to another, and perform arbitrary operations within the security limits.
When an Aglet moves it takes along its program code as well as its data.
The Aglets framework is implemented by the Aglets Software Development Kit (ASDK) from IBM. It is an environment for programming mobile Internet Agent in Java.
16
Aglets at RuntimeCurrently aglets use the Agent Transfer Protocol (ATP) as a default implementation of the communication layer (ATP is modeled after HTTP)
Used on the Tahiti aglet server
Use the Aglets Server Interface to write application capable of hosting, receiving and dispatching aglets
17
Introduction
Background
A distributed Apriori algorithm using the mobile agents
Experimental evaluation
Conclusion
Outline
18
A distributed Apriori algorithm (1) spawn n slave processes; (2) divide database into partitions (3) distribute partitions to each slave process
Master process
1. send global candidate (k-1)-itemsets Ck-1 to each slave process
4. wait and receive local supports, count global supports for global candidate (k-1)-itemsets Ck-1
5. compute frequent (k-1)-itemsets Lk-1,
and send clusters of frequent (k-1)-itemsets Lk-1 to slave processes
8. wait and receive local candidate k-itemsets from slave processes
9. unionize local candidate k-itemsets and prune to form global candidate k-itemsets.
1
2 Slave processes
2. receive the global candidate (k-1)-itemsets Ck-1
3. count local supports for global candidate (k-1)-itemsets Ck-1, and
send local supports to the master process.
6. receive frequent (k-1)-itemsets Lk-1 from the master process
7. generate local candidate k-itemsets and send these local candidate k-itemsets to the master process
19
A distributed Apriori algorithm
SEND global candidate(k-1) itemsets Ck-1
COUNT and SEND local supports for global candidate (k-1)-itemsets(counting support Aglets)
COUNT global supports for global candidate (k-1)-itemsets Ck-1
UNIONIZE local candidate k-itemsets and PRUNE to form global candidate k-itemsets Ck
JOIN and SEND local candidate k-itemsets(Aprio_gen Aglet)
...
…
e.g.,{AB}
2
3
1
8FIND and SEND frequent (k-1)-itemsets Lk-1
DB1
DB2
DBn
...
DB1
DB2
DBn
master slaves master slaves master
DB DB DB
…
20
Global support count & Global candidate itemsets
X is a candidate itemset, global support count of X is
The set of global candidate k-itemsets GCk formed by local candidate k-itemsets
GLk formed by Apriori-gen with ID segment (p, q) of GLk-1
GLk = {GCk ׀ GCk.G-Supp G-Min-Supp}
n
iikk LSuppXGSuppX
1
..
n
i kk LCGC1
),( 1 qpGLgenAprioriGL kik
21
Introduction
Background
A distributed Apriori algorithm using the mobile agents
Experimental evaluation
Conclusion
Outline
22
Experiments: Synthetic datasets
Using synthetic datasets of varying sizes:
Name |D| |T| Size (MB)
D100k.T30 100K 30 3M
D100k.T100 100K 100 10M
D320k.T150 320K 150 48M
|D| Number of transactions|T| Average amount of items on transactions
23
Experiment environmentSoftware Database : Oracle server Language: Java – JDK1.3-Sun Mobile agents: Aglet- IBM Protocol traffic: ATP – Aglet Transfer Protocol Platform: Windows
Hardware PC Petium3-300 Mhz, RAM 128MB 15 machines (at Knowledge Science Center, JAIST)
24
Execution time (sec.) with different minimum support thresholds
Name 1 slave 5 slave 10 slaves 15 slavesD100k.T30 4,158 1,980 1,149 988D100k.T100 30,005 15,792 7,978 5,843D320k.T150 69,011 28,425 18,349 12,854
Name 1 slave 5 slave 10 slaves 15 slavesD100k.T30 485 244 218 95D100k.T100 27,012 13,506 7,047 5,062D320k.T150 52,322 20,259 11,979 9,112
Name 1 slave 5 slave 10 slaves 15 slavesD100k.T30 80,860 42,558 22,079 15,838D100k.T100 155,440 77,720 41,080 28,062D320k.T150 329,532 147,673 76,432 53,318
35%
40%
50%
25
Execution time with min_sup 35%
Time (sec.) with min_sup = 35%
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
1 slave 5 slave 10 slaves 15 slaves
D100k.T30
D100k.T100
D320k.T150
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
1 slave 5 slave 10 slaves 15 slaves
D100k.T30
D100k.T100
D320k.T150
26
Execution time with min_sup 40%
Time (sec.) with min_sup = 40%
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
1 slave 5 slave 10 slaves 15 slaves
D100k.T30
D100k.T100
D100k.T150
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
1 slave 5 slave 10 slaves 15 slaves
D100k.T30
D100k.T100
D320k.T150
27
Execution time with min_sup 50%
Time (sec.) with min_sup 50%
0
10000
20000
30000
40000
50000
60000
1 slave 5 slave 10 slaves 15 slaves
D100k.T30
D100k.T100
D320k.T150Time (sec.) with min_sup 50%
0
10000
20000
30000
40000
50000
60000
1 slave 5 slave 10 slaves 15 slaves
D100k.T30
D100k.T100
D320k.T150
28
Rate of execution timeNb of slaves minsup 35% minsup 45% minsup 50% minsup avg
1 1 1 1 15 1.9 2.1 2 210 3.6 3.6 2.2 3.115 5.1 4.2 5.1 4.8
average rate
0
1
2
3
4
5
6
1 slave 5 slaves 10 slaves 15 slaves
average rate
The rate between execution time and number of slaves is nearly linear
29
ConclusionProposed a distributed apriori algorithm for mining association rule
Experimental evaluation show that when the number of slaves increases the execution time decreases nearly linear
Future work: Segment both the master and GLk for support
counts Develop incremental algorithms for association
analysis using the MA technology