Experimental Evaluation of Multi-Round Matrix Multiplication on ...
Transcript of Experimental Evaluation of Multi-Round Matrix Multiplication on ...
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Experimental Evaluation of Multi-Round MatrixMultiplication on MapReduce
Matteo Ceccarello1 Francesco Silvestri1
Dip. di Ingegneria dell’Informazione,Universita di Padova, Italy
5/1/2015 - ALENEX 2015
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Why matrix multiplication?
I MapReduce is a common choice when dealing with Big Data
I The computation is organized in rounds
Map → Shuffle → Reduce
Monolithic algorithms
Execute in a constant numberof rounds
Multi-round algorithms
The number of roundsdepends on the input size
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Why matrix multiplication?
Not a performance benchmark
We are interested in investigating the behaviour of multi-roundalgorithms w.r.t monolithic ones.
Why multi round?
Service market Start/stop computations waiting for better prices
Resource requirements Monolithic algorithms may require toomany resources
Fault tolerance Easier checkpointing
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Outline
1. MapReduce
2. Our results
3. Algorithm
4. ImplementationI Map and ReduceI Partitioner
5. ExperimentsI In-house clusterI Amazon EMR
6. Conclusions
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Previous work
I Original workI “MapReduce: Simplified Data Processing on Large Clusters”
[Dean, Ghemawat 2004]
I ModelsI “A model of computation for MapReduce” [Karloff, Suri,
Vassilvitskii 2010]I “Sorting, searching, and simulation in the MapReduce
framework” [Goodrich, Sitchinava, Zhang 2011]I “Space-round tradeoffs for MapReduce computations”
[Pietracaprina, Pucci, Riondato, Silvestri, Upfal 2012]
I Experimental workI “HAMA: An Efficient Matrix Computation with the
MapReduce Framework“ [Sangwon, Yoon, Jaehong,Seongwook, Jin-Soo, Seungryoul]
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
MapReduce - Model
MR(m, ρ) [Pietracaprina+ 2012]
I m local memory: constraints the amount of local work
I ρ replication: constrains the communication in a single round
I The complexity is measured in number of rounds
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
MapReduce - Round
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Our results
I M3 - Matrix Multiplication on MapReduceI Hadoop library for dense and sparse matrix multiplication
I Multi-round algorithms have performance comparable withmonolithic ones
I Concentrating only on round number for performanceevaluation may not be the best strategy
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Algorithm: 3D dense
Data representation
I Input matrix →√
n ×√
n, blocks of size√
m ×√
m
I Input represented as a collection of pairs
〈(i , `, j); Aij〉 for 0 ≤ i , j <√
n/m
Algorithm
I Cij =∑√n/m−1
h=0 Aih · Bih
I There are (n/m)3/2 products, partitioned into√
n/m groups
I Round r : groups G` for rρ ≤ ` < (r + 1)ρ
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Algorithm: 3D dense
I Cij =∑√n/m−1
h=0 Aih · Bih
I (n/m)3/2 products ⇒√
n/m groups
I Round r : groups ∈[rρ, (r + 1)ρ
)
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Complexity
Theorem 2 [Pietracaprina+ 2012]
The above MR(m, ρ) algorithm multiplies two√
n ×√
n densematrices in
R =
√n
ρ√
m+ 1
rounds.
ρ =
√n
m⇒ monolithic algorithm
ρ = 1 ⇒ multi-round algorithm
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Algorithm: 2D dense (HAMA)
I Ci ,j = Ai · Bj
I The round complexity is higher
R =n
ρmvs.
√n
ρ√
m+ 1
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Algorithm: 3D sparse
I Applies to Erdos-Renyi matrices
I The density is δ
I Adaptation of the 3D algorithm
I Can be extended to generic sparse matrices
TheoremThe above algorithm requires
R =δn3/4
(ρ√
m)+ 1
rounds, the expected shuffle size is 3ρδ2n3/2, and the expectedreducer size is 3m.
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Implementation - Partitioner
reducer ID
coun
t
10
20
30
40
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
3D partitioner naive
(i , h, j)→ z = iρn
m+ jρ+ (h mod ρ) where
{ρ = replication
m = subproblem size
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Questions
Q1: How does the subproblem size affect the performance?
Q2: Replication affects performance?
Q3: Which are the major factors affecting the running time?
Q4: Does the algorithm scale efficiently?
Q5: Is 3D better than 2D?
Q6: Does the sparse algorithm efficiently exploit the input sparsity?
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Experimental setup
In-house cluster
I 16 machines
I 4-Core Intel I7 Nehalem @ 3.07GHz
I 12 GB RAM
I 6 disks of 1TB and 7200 RPM in RAID0
Amazon EMR
I c3.8xlargeI Compute-optimized instance
I i2.xlargeI Storage-optimized instance
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Q1: Time vs. subproblem size
Increasing the subproblem size reduces the time
1000 1500 2000 2500 3000 3500 4000memory
0
1000
2000
3000
4000
5000
6000
time
(s)
Local memory size
max 16000max 32000min 16000min 32000
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Q2: Time vs. replication
Multi round algorithms are comparable with monolithic ones
1 2 4 8replication
0
500
1000
1500
2000
2500
time
(s)
Replication vs. time (32k x 32k matrices)
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Q3: Components cost
The most expensive component is communication
1 2 4 8replication (inverse of the number of rounds)
0
500
1000
1500
2000
2500
time
(s)
Components (32k x 32k matrices)
component
communicationcomputationinfrastructure
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Q4: Scalability
The algorithm scales well with the number of machines
4 6 8 10 12 14 16machines
100
200
300
400
500
600
700
800
900
1000
time
(s)
Scalability (16k x 16k matrices)
replication
124
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Q5: 3D vs. 2D
The 3D algorithm is faster than the 2D approach, as predicted by the analysis
0 2 4 6 8 10 12 14 16replication
0
500
1000
1500
2000
2500
time
(s)
Comparison between 2D and 3D (16k x 16k matrices)
algorithm
2D3D
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Q6: Sparse
We can effectively leverage the sparsity of input matrices
0 2 4 6 8 10 12 14 16replication
0
200
400
600
800
1000
time
(s)
Sparse matrices
Dimension
2^202^222^24
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
A sneak peek at the future
Changing the platform makes space/round tradeoffs even more evident
1 2 4 8 16replication
0
1000
2000
3000
4000
5000
6000
7000
8000
time
(s)
Multiplication of 64k x 64k matrices on spark
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions
Conclusions
Results
I Multi-round algorithms can be comparable with monolithicones
I Focusing only on round number may not lead to the bestperformance if it implies a large amount of communication
I Open source implementation:I M3 - Matrix Multiplication on MapReduceI http://www.dei.unipd.it/m3
Ongoing work
I Test algorithms on other MapReduce implementations (Spark)