1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information...

29
1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh city presented by Ho Tu Bao School of Knowledge Science Japan Advanced Institute of Science and Technology work done during 3 months of the author JSPS’s fellowship in JAIST)

Transcript of 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information...

Page 1: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

1

A distributed method for mining association rules

Pham Nguyen Anh Huy*Department of Information Technology

Vietnam National University of HoChiMinh city

presented by Ho Tu BaoSchool of Knowledge Science

Japan Advanced Institute of Science and Technology

(*work done during 3 months of the author JSPS’s fellowship in JAIST)

Page 2: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

2

Introduction

Background

A distributed Apriori algorithm using mobile agents

Experimental evaluation

Conclusion

Outline

Page 3: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

3

IntroductionAssociation analysis is a new and attractive research area in data mining

Apriori algorithm (R. Agrawal, IBM 1993) is a key technique for association analysis

Though the apriori principle allows us to considerably reduce the search space, the technique still requires a huge computation, particularly for large database

This research proposes a distributed version of Apriori algorithm using mobile agents. The experiments show that we can reduce computation time when using computers in a distributed computing environment.

Page 4: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

4

Introduction

Background Association rules and Apriori

algorithm Mobile agents and Aglets

A distributed Apriori algorithm using mobile agents

Experimental evaluation

Conclusion

Outline

Page 5: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

5

Association rules: Market basket analysis

Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” (in the form X Y, where X and Y are sets of items)

I = {I1=beer, I2=cake, I3=onigiri}

Transactional database

An association rule {I1} {I3}

How often people buy onigiri and beer together?

TID1: {I1, I2, I3} TID2: {I1, I2} TID3: {I2, I3} TID4: {I2} TID5: {I1, I2}

Page 6: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

6

Rule measures: Support and Confidence

Association rule X Y

support s = probability that a transaction contains X and Y

confidence c = conditional probability that a transaction having X also contains Y

A C (s=50%, c=66.6%)

C A (s=50%, c=100%)

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Customer buys onigiri

Customer buys both Customerbuys beer

Page 7: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

7

Association mining: Apriori algorithm

It is composed of two steps:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence

(Agrawal, R., 1993)

Page 8: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

8

Association mining: Apriori principle

For rule A C support = support({A and C}) = 50%

confidence = support({A and C})/support({A}) = 66.6%

The Apriori principle: Any subset of a frequent itemset must be

frequent (if an itemset is not frequent, its supersets

are not)

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Page 9: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

9

The Apriori algorithm: Finding frequent itemsets using candidate generation

1. Find the frequent itemsets: the sets of items that have support higher than the minimum support A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-

itemset) by from candidate itemsets Ck (Lk Ck)

2. Use the frequent itemsets to generate association rules.

C1 … Li-1 Ci Li Ci+1 … Lk

Page 10: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

10

Example (min_sup_count = 2)

TID List of items_IDs

T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3

Itemset Sup.Count

{I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2

C1

Itemset Sup.Count

{I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2

L1

Transactional data

Scan D for count of each candidate

Compare candidate support count with minimum support count

Page 11: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

11

Example (min_sup_count = 2)

Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5}

C2

Scan D for count of each candidate

Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0

C2Compare candidate support count with minimum support count

Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2

L2

Generate candidates C3 from L2using Apriori principle

Itemset

{I1, I2, I3} {I1, I2, I5}

Scan D for count of each candidate

Itemset Sc

{I1, I2, I3} 2 {I1, I2, I5} 2

C3

Compare candidate support count with minimum support count

Itemset Sc

{I1, I2, I3} 2 {I1, I2, I5} 2

L3

Generate candidates C2 from L1using Apriori principle

Page 12: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

12

Agents and Mobile agents

An agent is a computation entity that:

Acts on behalf of other entities in autonomous fashion.

Performs its actions with some level of pro-activity and re-activeness.

Exhibits some level of the key attributes of co-operation.

Mobile network agents are programs that:

can migrate from system to system within a network environment

Performs some processing at each host

Agent decides when and where to move next

How does it move? Save state Transport saved state to next

system Resume execution of saved

state

Page 13: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

13

Distributed Computing using Mobile Programs

Page 14: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

14

Mobile agent toolsNo. Name Description Developer Language Application

1 Concordia Framework for agent development Mitsubishi E.I.T. Java Mobile computing, Data base

2 Aglet Java Class libraries IBM, Tokyo Java Internet3 Agent Tcl Transportable agent system R. Gray, U Dart. Tcl Tk Information management

4 Odyssey Set of Java Class libraries General Magic Telescript Electronic commerce

5 OAA Open Agent Architecture SRI International, AI C, C-Lisp, Java, VB General purpose

6 Ara Agent for Remote Action U Kaiserslautern C/C++, Tcl, Java Partially connected c. D.D.B.

7 Tacoma Tromso and Cornel Moving Agent Norway & Cornell C, UNIX-based, Client/Server model issues / OS support

8 Voyager Platform for distributed applic. ObjectSpace Java Support for agent systems

9 AgentSpace Agent building platform Ichiro Sato, O. U. Java General purpose

10 Mole First Java-Based MA system Stuttgart U. Germany Java, UNIX-based General purpose

11 MOA Mobile Object and Agents OpenGroup, UK Java General purpose

12 Kali Scheme Distributed impl. of Scheme NEC Research I. Scheme Distributed data mining, load balancing

13 The Tube mobile code system David Halls, UK Scheme Remote execution of Scheme

14 Ajanta Network mobile object Minoseta U. Java General purpose

15 Knowbots Research infrastructure of MA CNRI Python Distributed systems / Internet

16 AgentSpace Mobile agent framework Alberto Sylva Java Support for dynamic and dist. Appl.

17 Plangent Intelligent Agent system Toshiba Corporation Java Intelligent tasks

18 JATlite Java Agent framework dev /KQML Standford U. Java Information retrievial, Interface agent

19 Kafka Multiagent libraries for Java Fujitsu Lab. Japan Java UNIX based General purpose

20 Messengers Autonomous messages UCI C (Messenger-C) General purpose

Page 15: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

15

What are Aglets ?Aglets (Agile Applets) are Java objects that can move from one host on the Internet to another, and perform arbitrary operations within the security limits.

When an Aglet moves it takes along its program code as well as its data.

The Aglets framework is implemented by the Aglets Software Development Kit (ASDK) from IBM. It is an environment for programming mobile Internet Agent in Java.

Page 16: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

16

Aglets at RuntimeCurrently aglets use the Agent Transfer Protocol (ATP) as a default implementation of the communication layer (ATP is modeled after HTTP)

Used on the Tahiti aglet server

Use the Aglets Server Interface to write application capable of hosting, receiving and dispatching aglets

Page 17: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

17

Introduction

Background

A distributed Apriori algorithm using the mobile agents

Experimental evaluation

Conclusion

Outline

Page 18: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

18

A distributed Apriori algorithm (1) spawn n slave processes; (2) divide database into partitions (3) distribute partitions to each slave process

Master process

1. send global candidate (k-1)-itemsets Ck-1 to each slave process

4. wait and receive local supports, count global supports for global candidate (k-1)-itemsets Ck-1

5. compute frequent (k-1)-itemsets Lk-1,

and send clusters of frequent (k-1)-itemsets Lk-1 to slave processes

8. wait and receive local candidate k-itemsets from slave processes

9. unionize local candidate k-itemsets and prune to form global candidate k-itemsets.

1

2 Slave processes

2. receive the global candidate (k-1)-itemsets Ck-1

3. count local supports for global candidate (k-1)-itemsets Ck-1, and

send local supports to the master process.

6. receive frequent (k-1)-itemsets Lk-1 from the master process

7. generate local candidate k-itemsets and send these local candidate k-itemsets to the master process

Page 19: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

19

A distributed Apriori algorithm

SEND global candidate(k-1) itemsets Ck-1

COUNT and SEND local supports for global candidate (k-1)-itemsets(counting support Aglets)

COUNT global supports for global candidate (k-1)-itemsets Ck-1

UNIONIZE local candidate k-itemsets and PRUNE to form global candidate k-itemsets Ck

JOIN and SEND local candidate k-itemsets(Aprio_gen Aglet)

...

e.g.,{AB}

2

3

1

8FIND and SEND frequent (k-1)-itemsets Lk-1

DB1

DB2

DBn

...

DB1

DB2

DBn

master slaves master slaves master

DB DB DB

Page 20: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

20

Global support count & Global candidate itemsets

X is a candidate itemset, global support count of X is

The set of global candidate k-itemsets GCk formed by local candidate k-itemsets

GLk formed by Apriori-gen with ID segment (p, q) of GLk-1

GLk = {GCk ׀ GCk.G-Supp G-Min-Supp}

n

iikk LSuppXGSuppX

1

..

n

i kk LCGC1

),( 1 qpGLgenAprioriGL kik

Page 21: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

21

Introduction

Background

A distributed Apriori algorithm using the mobile agents

Experimental evaluation

Conclusion

Outline

Page 22: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

22

Experiments: Synthetic datasets

Using synthetic datasets of varying sizes:

Name |D| |T| Size (MB)

D100k.T30 100K 30 3M

D100k.T100 100K 100 10M

D320k.T150 320K 150 48M

|D| Number of transactions|T| Average amount of items on transactions

Page 23: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

23

Experiment environmentSoftware Database : Oracle server Language: Java – JDK1.3-Sun Mobile agents: Aglet- IBM Protocol traffic: ATP – Aglet Transfer Protocol Platform: Windows

Hardware PC Petium3-300 Mhz, RAM 128MB 15 machines (at Knowledge Science Center, JAIST)

Page 24: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

24

Execution time (sec.) with different minimum support thresholds

Name 1 slave 5 slave 10 slaves 15 slavesD100k.T30 4,158 1,980 1,149 988D100k.T100 30,005 15,792 7,978 5,843D320k.T150 69,011 28,425 18,349 12,854

Name 1 slave 5 slave 10 slaves 15 slavesD100k.T30 485 244 218 95D100k.T100 27,012 13,506 7,047 5,062D320k.T150 52,322 20,259 11,979 9,112

Name 1 slave 5 slave 10 slaves 15 slavesD100k.T30 80,860 42,558 22,079 15,838D100k.T100 155,440 77,720 41,080 28,062D320k.T150 329,532 147,673 76,432 53,318

35%

40%

50%

Page 25: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

25

Execution time with min_sup 35%

Time (sec.) with min_sup = 35%

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

1 slave 5 slave 10 slaves 15 slaves

D100k.T30

D100k.T100

D320k.T150

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

1 slave 5 slave 10 slaves 15 slaves

D100k.T30

D100k.T100

D320k.T150

Page 26: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

26

Execution time with min_sup 40%

Time (sec.) with min_sup = 40%

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

1 slave 5 slave 10 slaves 15 slaves

D100k.T30

D100k.T100

D100k.T150

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

1 slave 5 slave 10 slaves 15 slaves

D100k.T30

D100k.T100

D320k.T150

Page 27: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

27

Execution time with min_sup 50%

Time (sec.) with min_sup 50%

0

10000

20000

30000

40000

50000

60000

1 slave 5 slave 10 slaves 15 slaves

D100k.T30

D100k.T100

D320k.T150Time (sec.) with min_sup 50%

0

10000

20000

30000

40000

50000

60000

1 slave 5 slave 10 slaves 15 slaves

D100k.T30

D100k.T100

D320k.T150

Page 28: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

28

Rate of execution timeNb of slaves minsup 35% minsup 45% minsup 50% minsup avg

1 1 1 1 15 1.9 2.1 2 210 3.6 3.6 2.2 3.115 5.1 4.2 5.1 4.8

average rate

0

1

2

3

4

5

6

1 slave 5 slaves 10 slaves 15 slaves

average rate

The rate between execution time and number of slaves is nearly linear

Page 29: 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

29

ConclusionProposed a distributed apriori algorithm for mining association rule

Experimental evaluation show that when the number of slaves increases the execution time decreases nearly linear

Future work: Segment both the master and GLk for support

counts Develop incremental algorithms for association

analysis using the MA technology