Apriori data mining in the cloud

42
APRIORI DATA MINING IN THE CLOUD Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute

description

 

Transcript of Apriori data mining in the cloud

Page 1: Apriori data mining in the cloud

APRIORI DATA MINING IN THE CLOUD

Case study: d60 Raptor smartAdvisor

Jan NeerbekAlexandra Institute

Page 2: Apriori data mining in the cloud

2

Agenda

· d60: A cloud/data mining case

· Cloud

· Data Mining

· Market Basket Analysis

· Large data sets

· Our solution

Page 3: Apriori data mining in the cloud

3

Alexandra Institute

The Alexandra Institute is a non-profit company that works with application-oriented IT research.

Focus is pervasive computing, and we activate the business potential of our members and customers through research-based userdriven innovation.

Page 4: Apriori data mining in the cloud

4

The case: d60

· Danish company

· A similar products recommendation engine

· d60 was outgrowing their servers (late 2010)

· They saw a potential in moving to Azure

Page 5: Apriori data mining in the cloud

5

The setup

InternetWebshops

Log shoppingpatterns

Do data mining

ProductRecommendations

Page 6: Apriori data mining in the cloud

6

The cloud potential

· Elasticity

· No upfront server cost

· Cheaper licenses

· Faster calculations

Page 7: Apriori data mining in the cloud

7

Challenges

· No SQL Server Analysis Services (SSAS)

· Small compute nodes

· Partioned database (50GB)

· SQL server ingress/outgress access is slow

Page 8: Apriori data mining in the cloud

8

The cloud

Node

Node

Node

Node

Node

Node Node

Page 9: Apriori data mining in the cloud

9

The cloud and services

Node

Node

Node

Node

Node

Node Node

Data layerservice

MessagingService

Page 10: Apriori data mining in the cloud

10

Data layer service

· Application specific (schema/layout)

· SQL, table or other

· Easy a bottleneck

· Can be difficult to scale

Data layerservice

Page 11: Apriori data mining in the cloud

11

Messaging serviceTask Queues

· Standard data structure

· Build-in ordering (FIFO)

· Can be scaled

· Good for asynchronous messages

MessagingService

Page 12: Apriori data mining in the cloud

12

DATA MINING

Page 13: Apriori data mining in the cloud

13

Data mining

Data mining is the use of automated data analysis techniques to uncover relationships among data items

Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals

[about.com/wikipedia.org]

Page 14: Apriori data mining in the cloud

14

Market basket analysis

Customer1

Avocado

Milk

Butter

Potatoes

Customer2

Milk

Diapers

Avocado

Beer

Customer3

Beef

Lemons

Beer

Chips

Customer4

Cereal

Beer

Beef

Diapers

Page 15: Apriori data mining in the cloud

15

Market basket analysis

Customer1

Avocado

Milk

Butter

Potatoes

Customer2

Milk

Diapers

Avocado

Beer

Customer3

Beef

Lemons

Beer

Chips

Customer4

Cereal

Beer

Beef

Diapers

Itemset (Diapers, Beer) occur 50%

Frequency threshold parameterFind as many frequent itemsets as possible

Page 16: Apriori data mining in the cloud

16

Market basket analysis

Popular effective algorithm: FP-growth Based on data structure FP-tree

Requires all data in near-memory Most research in distributed models has been for

cluster setups

Page 17: Apriori data mining in the cloud

17

Building the FP-tree(extends the prefix-tree structure)

Avocado

Butter

Milk

Potatoes

Customer1

Avocado

Milk

Butter

Potatoes

Page 18: Apriori data mining in the cloud

18

Building the FP-tree

Customer2

Milk

Diapers

Avocado

Beer

Avocado

Butter

Milk

Potatoes

Page 19: Apriori data mining in the cloud

19

Building the FP-tree

Customer2

Milk

Diapers

Avocado

Beer

Avocado

Butter

Milk

Potatoes

Beer

Diapers

Milk

Page 20: Apriori data mining in the cloud

20

Building the FP-tree

Customer2

Milk

Diapers

Avocado

Beer

Avocado

Butter

Milk

Potatoes

Beer

Diapers

Milk

Page 21: Apriori data mining in the cloud

21

Building the FP-tree

Avocado

Butter

Milk

Potatoes

Beer

Diapers

Milk

Beef

Beer

Chips

Lemon

Cereal

Diapers

Page 22: Apriori data mining in the cloud

22

FP-growth

Grows the frequent itemsets, recusively

FP-growth(FP-tree tree){ … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }

Page 23: Apriori data mining in the cloud

23

FP-growth algorithmDivide and Conquer

Traverse tree

Avocado

Butter

Milk

Potatoes

Beer

Diapers

Milk

Beef

Beer

Chips

Lemon

Cereal

Diapers

Page 24: Apriori data mining in the cloud

24

FP-growth algorithmDivide and Conquer

Generate sub-trees

Avocado

Butter

Milk

Potatoes

Beer

Diapers

Milk

Beef

Beer

Chips

Lemon

Cereal

Diapers

Page 25: Apriori data mining in the cloud

25

FP-growth algorithmDivide and Conquer

Call recursively

Avocado

Butter

Milk

Potatoes

Beer

Diapers

Milk

Beef

Beer

Chips

Lemon

Cereal

Diapers

Avocado

Butter Beer

Diapers

Page 26: Apriori data mining in the cloud

26

FP-growth algorithmMemory usage

The FP-tree does not fit in local memory; what to do?

· Emulate Distributed Shared Memory

Page 27: Apriori data mining in the cloud

27

Distributed Shared Memory?

· To add nodes is to add memory

· Works best in tightly coubled setups, with low-lantency, high-speed networks

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

Shared Memory

Network

Page 28: Apriori data mining in the cloud

28

FP-growth algorithmMemory usage

The FP-tree does not fit in local memory; what to do?

· Emulate Distributed Shared Memory

· Optimize your data structures

· Buy more RAM

· Get a good idea

Page 29: Apriori data mining in the cloud

29

Get a good idea

· Database scans are serial and can be distributed

· The list of items used in the recursive calls uniquely determines what part of data we are looking at

Page 30: Apriori data mining in the cloud

30

Get a good idea

Avocado

Butter

Milk

Potatoes

Beer

Diapers

Milk

Beef

Beer

Chips

Lemon

Cereal

Diapers

Avocado

Butter Beer

Diapers

Page 31: Apriori data mining in the cloud

31

Get a good idea

Milk

Butter, MilkAvocado

Butter Beer

Diapers

Avocado

Avocado

Beer

Diapers,MilkThese are postfix paths

Page 32: Apriori data mining in the cloud

32

BRINGING IT ALL TOGETHER

Page 33: Apriori data mining in the cloud

33

Buckets

· Use postfix paths for messaging

· Working with buckets

Transactions

Items

Page 34: Apriori data mining in the cloud

34

FP-growth revisited

FP-growth(FP-tree tree){ … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }

Replaced with postfix

Done in parallel

Done in parallel

Done in parallel

Page 35: Apriori data mining in the cloud

35

Communication

Node Node

Node Node

Data layer

Page 36: Apriori data mining in the cloud

36

Revised Communication

Node Node

Node Node

Data layerMQ

Page 37: Apriori data mining in the cloud

37

Running FP-growth

Distribute buckets

Count items(with postfix size=n)

Collect counts(per postfix)Call recursive

Standard FP-growth

Page 38: Apriori data mining in the cloud

38

Running FP-growth

Distribute buckets

Count items(with postfix size=n)

Collect counts(per postfix)Call recursive

Standard FP-growth

Page 39: Apriori data mining in the cloud

39

Collecting what we have learned

· Message-driven work, using message-queue

· Peer-to-peer for intermediate results

· Distribute data for scalability (buckets)

· Small messages (list of items)

· Allow us to distribute FP-growth

Page 40: Apriori data mining in the cloud

40

Advantages

· Configurable work sizes

· Good distribution of work

· Robust against computer failure

· Fast!

Page 41: Apriori data mining in the cloud

41

So what about performance?

1 2 4 800:00:00

00:29:59

00:59:59

01:29:59

01:59:59

02:29:59

02:59:59

03:29:59

03:59:59

04:29:59

Message-driven FP-growth

FP-growth

Total node time

Page 42: Apriori data mining in the cloud

42

Thank you!

Questions?