Experimental Evaluation of Multi-Round Matrix Multiplication on ...

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Experimental Evaluation of Multi-Round MatrixMultiplication on MapReduce

Matteo Ceccarello1 Francesco Silvestri1

Dip. di Ingegneria dell’Informazione,Universita di Padova, Italy

5/1/2015 - ALENEX 2015


Why matrix multiplication?

I MapReduce is a common choice when dealing with Big Data

I The computation is organized in rounds

Map → Shuffle → Reduce

Monolithic algorithms

Execute in a constant numberof rounds

Multi-round algorithms

The number of roundsdepends on the input size


Why matrix multiplication?

Not a performance benchmark

We are interested in investigating the behaviour of multi-roundalgorithms w.r.t monolithic ones.

Why multi round?

Service market Start/stop computations waiting for better prices

Resource requirements Monolithic algorithms may require toomany resources

Fault tolerance Easier checkpointing


Outline

1. MapReduce

2. Our results

3. Algorithm

4. ImplementationI Map and ReduceI Partitioner

5. ExperimentsI In-house clusterI Amazon EMR

6. Conclusions


Previous work

I Original workI “MapReduce: Simplified Data Processing on Large Clusters”

[Dean, Ghemawat 2004]

I ModelsI “A model of computation for MapReduce” [Karloff, Suri,

Vassilvitskii 2010]I “Sorting, searching, and simulation in the MapReduce

framework” [Goodrich, Sitchinava, Zhang 2011]I “Space-round tradeoffs for MapReduce computations”

[Pietracaprina, Pucci, Riondato, Silvestri, Upfal 2012]

I Experimental workI “HAMA: An Efficient Matrix Computation with the

MapReduce Framework“ [Sangwon, Yoon, Jaehong,Seongwook, Jin-Soo, Seungryoul]


MapReduce - Model

MR(m, ρ) [Pietracaprina+ 2012]

I m local memory: constraints the amount of local work

I ρ replication: constrains the communication in a single round

I The complexity is measured in number of rounds


MapReduce - Round


Our results

I M3 - Matrix Multiplication on MapReduceI Hadoop library for dense and sparse matrix multiplication

I Multi-round algorithms have performance comparable withmonolithic ones

I Concentrating only on round number for performanceevaluation may not be the best strategy


Algorithm: 3D dense

Data representation

I Input matrix →√

n ×√

n, blocks of size√

m ×√

m

I Input represented as a collection of pairs

〈(i , `, j); Aij〉 for 0 ≤ i , j <√

n/m

Algorithm

I Cij =∑√n/m−1

h=0 Aih · Bih

I There are (n/m)3/2 products, partitioned into√

n/m groups

I Round r : groups G` for rρ ≤ ` < (r + 1)ρ


Algorithm: 3D dense

I Cij =∑√n/m−1

h=0 Aih · Bih

I (n/m)3/2 products ⇒√

n/m groups

I Round r : groups ∈[rρ, (r + 1)ρ

)


Complexity

Theorem 2 [Pietracaprina+ 2012]

The above MR(m, ρ) algorithm multiplies two√

n ×√

n densematrices in

R =

√n

ρ√

m+ 1

rounds.

ρ =

√n

m⇒ monolithic algorithm

ρ = 1 ⇒ multi-round algorithm


Algorithm: 2D dense (HAMA)

I Ci ,j = Ai · Bj

I The round complexity is higher

R =n

ρmvs.

√n

ρ√

m+ 1


Algorithm: 3D sparse

I Applies to Erdos-Renyi matrices

I The density is δ

I Adaptation of the 3D algorithm

I Can be extended to generic sparse matrices

TheoremThe above algorithm requires

R =δn3/4

(ρ√

m)+ 1

rounds, the expected shuffle size is 3ρδ2n3/2, and the expectedreducer size is 3m.


Implementation - Partitioner

reducer ID

coun

t

10

20

30

40

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

3D partitioner naive

(i , h, j)→ z = iρn

m+ jρ+ (h mod ρ) where

{ρ = replication

m = subproblem size


Questions

Q1: How does the subproblem size affect the performance?

Q2: Replication affects performance?

Q3: Which are the major factors affecting the running time?

Q4: Does the algorithm scale efficiently?

Q5: Is 3D better than 2D?

Q6: Does the sparse algorithm efficiently exploit the input sparsity?


Experimental setup

In-house cluster

I 16 machines

I 4-Core Intel I7 Nehalem @ 3.07GHz

I 12 GB RAM

I 6 disks of 1TB and 7200 RPM in RAID0

Amazon EMR

I c3.8xlargeI Compute-optimized instance

I i2.xlargeI Storage-optimized instance


Q1: Time vs. subproblem size

Increasing the subproblem size reduces the time

1000 1500 2000 2500 3000 3500 4000memory

0

1000

2000

3000

4000

5000

6000

time

(s)

Local memory size

max 16000max 32000min 16000min 32000


Q2: Time vs. replication

Multi round algorithms are comparable with monolithic ones

1 2 4 8replication

0

500

1000

1500

2000

2500

time

(s)

Replication vs. time (32k x 32k matrices)


Q3: Components cost

The most expensive component is communication

1 2 4 8replication (inverse of the number of rounds)

0

500

1000

1500

2000

2500

time

(s)

Components (32k x 32k matrices)

component

communicationcomputationinfrastructure


Q4: Scalability

The algorithm scales well with the number of machines

4 6 8 10 12 14 16machines

100

200

300

400

500

600

700

800

900

1000

time

(s)

Scalability (16k x 16k matrices)

replication

124


Q5: 3D vs. 2D

The 3D algorithm is faster than the 2D approach, as predicted by the analysis

0 2 4 6 8 10 12 14 16replication

0

500

1000

1500

2000

2500

time

(s)

Comparison between 2D and 3D (16k x 16k matrices)

algorithm

2D3D


Q6: Sparse

We can effectively leverage the sparsity of input matrices

0 2 4 6 8 10 12 14 16replication

0

200

400

600

800

1000

time

(s)

Sparse matrices

Dimension

2^202^222^24


A sneak peek at the future

Changing the platform makes space/round tradeoffs even more evident

1 2 4 8 16replication

0

1000

2000

3000

4000

5000

6000

7000

8000

time

(s)

Multiplication of 64k x 64k matrices on spark


Conclusions

Results

I Multi-round algorithms can be comparable with monolithicones

I Focusing only on round number may not lead to the bestperformance if it implies a large amount of communication

I Open source implementation:I M3 - Matrix Multiplication on MapReduceI http://www.dei.unipd.it/m3

Ongoing work

I Test algorithms on other MapReduce implementations (Spark)

http://www.dei.unipd.it/m3

Experimental Evaluation of Multi-Round Matrix Multiplication on ...

Documents

Transcript of Experimental Evaluation of Multi-Round Matrix Multiplication on ...