Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

38
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Transcript of Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Page 1: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Learning with Hadoop – A case study on MapReduce

based Data MiningEvan Xiang, HKUST

1

Page 2: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

2

Page 3: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Introduction to Hadoop

Hadoop Map/Reduce is

a java based software framework for easily writing applications

which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware

in a reliable, fault-tolerant manner.

3

Page 4: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Job submission node

Slave node

TaskTracker DataNode

HDFS master

JobTracker NameNode

Slave node

TaskTracker DataNode

Slave node

TaskTracker DataNode

Client

Hadoop Cluster Architecture

4From Jimmy Lin’s slides

Page 5: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Hadoop HDFS

5

Page 6: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Hadoop Cluster Rack Awareness

6

Page 7: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Hadoop Development Cycle

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

7From Jimmy Lin’s slides

Page 8: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

8From Jimmy Lin’s slides

Page 9: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

High-level MapReduce pipeline

9

Page 10: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Detailed Hadoop MapReduce data flow

10

Page 11: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

11

Page 12: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Word Count with MapReduce

1one 1

1two 1

1fish 2

one fish, two fishDoc 1

2red 1

2blue 1

2fish 2

red fish, blue fishDoc 2

3cat 1

3hat 1

cat in the hatDoc 3

1fish 4

1one 11two 1

2red 1

3cat 12blue 1

3hat 1

Shuffle and Sort: aggregate values by keys

Map

Reduce

12From Jimmy Lin’s slides

Page 13: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

13

Page 14: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

14

Calculating document pairwise similarity

Trivial Solution

load each vector o(N) times load each term o(dft

2) times

scalable and efficient solutionfor large collections

Goal

From Jimmy Lin’s slides

Page 15: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

15

Better Solution

Load weights for each term onceEach term contributes o(dft

2) partial scores

Each term contributes only if appears in

From Jimmy Lin’s slides

Page 16: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

16

reduce

Decomposition

Load weights for each term onceEach term contributes o(dft

2) partial scores

Each term contributes only if appears in

map

From Jimmy Lin’s slides

Page 17: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

17

Standard Indexing

tokenize

tokenize

tokenize

tokenize

combine

combine

combine

doc

doc

doc

doc

posting list

posting list

posting list

Shuffling

group values by: terms

(a) Map (b) Shuffle (c) Reduce

From Jimmy Lin’s slides

Page 18: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Inverted Indexing with MapReduce

1one 1

1two 1

1fish 2

one fish, two fishDoc 1

2red 1

2blue 1

2fish 2

red fish, blue fishDoc 2

3cat 1

3hat 1

cat in the hatDoc 3

1fish 2 2 2

1one 11two 1

2red 1

3cat 12blue 1

3hat 1

Shuffle and Sort: aggregate values by keys

Map

Reduce

18From Jimmy Lin’s slides

Page 19: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

19

Indexing (3-doc toy collection)Clinton

Barack

Cheney

Obama

Indexing

2

1

1

1

1

ClintonObamaClinton 1

1

ClintonCheney

ClintonBarackObama

ClintonObamaClinton

ClintonCheney

ClintonBarackObama

From Jimmy Lin’s slides

Page 20: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

20

Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs

Clinton

Barack

Cheney

Obama

2

1

1

1

1

1

1

2

2

1

11

2

2 2

2

1

13

1

How to deal with the long list?

From Jimmy Lin’s slides

Page 21: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

21

Page 22: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

PageRank

PageRank – an information propagation model

Intensive access of neighborhood list

22

Page 23: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

PageRank with MapReduce

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

n2 n4 n3 n5 n1 n2 n3n4 n5

n2 n4n3 n5n1 n2 n3 n4 n5

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Map

Reduce

How to maintain the graph structure?

From Jimmy Lin’s slides

Page 24: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

24

Page 25: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

K-Means Clustering

25

Page 26: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

K-Means Clustering with MapReduce

26

Mapper_iMapper_i-1 Mapper_i+1

Reducer_i Reducer_i+1Reducer_i-1

How to set the initial centroids is very important!Usually we set the centroids using Canopy Clustering.

Each Mapper loads a set of data samples,

and assign each sample to a nearest centroid

Each Mapper needs to keep a copy of

centroids

1 3 42

1 32

1 42

1 3 42

4

3

23 4

3 2 4

[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]

Page 27: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

27

Page 28: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Matrix Factorization for Link Prediction

In this task, we observe a sparse matrix X R∈ m×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U R∈ k×m and an item factor matrix V R∈ k×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:

28

Page 29: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Given X and V, updating U:

Similarly, given X and U, we can alternatively update V

Solving Matrix Factorization via Alternative Least Squares

29

Xm

n

ui

Vk

n

A

k k

k

k

k

k

b

k

Page 30: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

MapReduce for ALS

30

Mapper_i

Reducer_i

Mapper_i

Reducer_i

Group rating data in X using

for item j

Group features in V

using for item j

Align ratings and features for item j, and make a copy of Vj for

each observe xij

Stage 1 Stage 2

Rating for item j

Features for item j

i-1 Vj

i Vj

i+1 Vj

Group rating data in X using

for user i

i Vj

i Vj+2

i+1 Vj

Standard ALS:Calculate A and b,

and update Ui

Page 31: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

31

Page 32: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Cluster Coefficient

32

In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.

How to maintain the Tier-2 neighbors?

[D. J. Watts and Steven Strogatz (June 1998).

"Collective dynamics of 'small-world' networks".

Nature 393 (6684): 440–442]

Page 33: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Cluster Coefficient with MapReduce

33

Mapper_i

Reducer_i

Mapper_i

Reducer_i

Stage 1 Stage 2

Calculate the cluster coefficient

BFS based method need three stages, but actually we only need two!

Page 34: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Resource Entries to ML labsMahout

Apache’s scalable machine learning librariesJimmy Lin’s Lab

iSchool at the University of MarylandJimeng Sun & Yan Rong’s Collections

IBM TJ Watson Research CenterEdward Chang & Yi Wang

Google Beijing

34

Page 35: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Advanced Topics in Machine Learning with MapReduce

35

Probabilistic Graphical models

Gradient based optimization methods

Graph Mining

Others…

Page 36: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Some Advanced TipsDesign your algorithm with a divide and

conquer manner

Make your functional units loosely dependent

Carefully manage your memory and disk storage

Discussions…

36

Page 37: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

37

Page 38: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Q&A

Why not MPI? Hadoop is Cheap in everything…D.P.T.H…

What’s the advantages of Hadoop? Scalability!

How do you guarantee the model equivalence? Guarantee equivalent/comparable function logics

How can you beat “large memory” solution? Clever use of Sequential Disk Access

38