BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Implementing Generate-Test-and-AggregateAlgorithms on Hadoop
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4
1The Graduate University for Advanced Studies2,4National Institute of Informatics
3University of Tokyo
September 28, 2011
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
MapReduce
Computation in three phases: map, shuffle and reduce
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
Programming with MapReduce
Programmers need to implement the following classes (Hadoop)
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
Programming with MapReduce
The main difficulties of MapReduce Programming :
Nontrivial problems are usually difficult to be computed in adivide-and-conquer fashion
Efficiency of parallel algorithms is difficult to be obtained
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
Generate Test and Aggregate Algorithm
The Generate-Test-and-Aggregate (GTA for short) algorithmconsists of
generate can generate all possible solution candidates.
test filters the intermediate data.
aggregate computes a summary of valid intermediate data.
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
Generate Test and Aggregate Algorithm
The Generate-Test-and-Aggregate (GTA for short) algorithmconsists of
generate can generate all possible solution candidates.
test filters the intermediate data.
aggregate computes a summary of valid intermediate data.
GTA is a very useful and common strategy for a large class ofproblems
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
An Example: Knapsack Problem
Fill a knapsack with items, each of certain value and weight, such that
the total value of packed items is maximal while adhering to a weight
restriction of the knapsack.
picture from Wikipedia
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
E.g, there are 3 items: (1kg , $1), (1kg , $2), (2kg , $2)
sublists [(1kg , $1), (1kg , $2), (2kg , $2)]= *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(1kg , $1), (1kg , $2), (2kg , $2)],
[(1kg , $1), (2kg , $2)], [(1kg , $2)], [(1kg , $2), (2kg , $2)], [(2kg , $2)]+
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
Spouse the capacity of knapsack is 2 kg
filter *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(1kg , $1), (1kg , $2), (2kg , $2)],[(1kg , $1), (2kg , $2)], [(1kg , $2)], [(1kg , $2), (2kg , $2)], [(2kg , $2)]+
= *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(2kg , $2)], [(1kg , $2)]+
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
maxvalue *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(2kg , $2)], [(1kg , $2)]+= $3
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
This program is simple but inefficient because it generatesexponential intermediate data (2n).
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
Theorems of Gernerating Efficient Parallel GTA Programs
Efficient parallel programs can be derived from users’naive but correct programs in terms of a generate, a test, and anaggregate functions [Emoto et. al., 2011]
aggregate ◦ test ◦ generate ⇒ list homomorphism
List homomorphisms is a class of recursive functions which match very well
with the divide-and-conquer paradigm [Bird, 87; Cole, 95].
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
MapReduceGTA algorithmParallelization of GTA algorithm
The Emoto’s theorem is under the following assumptions:
aggregate is a semiring homomorphism.
test is a list homomorphism.
generate is a polymorphism over semiring structures.
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Motivation and Objective
The Emoto’s fusion theorem shows us a possible way tosystematically implement efficient parallel programs with GTAalgorithm
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Motivation and Objective
The Emoto’s fusion theorem shows us a possible way tosystematically implement efficient parallel programs with GTAalgorithm
We need to evaluate this approach byimplementing a practical library, which should
have easy-to-use programming interface help users designGTA algorithms
be able to generate efficient parallel programs on MapReduce(Hadoop)
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
System Overview
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
Implementation on HadoopWe implement the following classes:
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
h[ ] = id⊕h[a] = f a
h(x ++ y) = h x ⊕ h y
1 public interface MapReducer<Elem , Val , Res> {2 public Val identity ( ) ;3 public Val element ( Elem elem ) ;4 public Val combine ( Val left , Val right ) ;5 public Res postprocess ( Val val ) ;6 }
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
Aggregator defines a semiring homomorphism
(A,⊕,⊗)→ (S ,⊕′,⊗′)
1 public interface Aggregator<A ,S> {2 public S zero ( ) ;3 public S one ( ) ;4 public S singleton ( A a ) ;5 public S plus ( S left , S right ) ;6 public S times ( S left , S right ) ;7 }
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphismAggregator defines a semiring homomorphism
Test is almost list homomorphism, it inherits MapReducer
1 public interface Test<Elem , Key> extends MapReducer<Elem ,←↩Key , Boolean> {}
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Implementation on HadoopMapReducer is an Interface of list homomorphismAggregator defines a semiring homomorphismTest inherits MapReducer
Generator implements a MapReducer
polymorphic over semiring: Constructor
filter embedding: embed function return a new generator
1 public abstract class Generator<Elem , Single , Val , Res>2 implements MapReducer<Elem , Val , Res> {3 //The c o n t r a c t o r t a k e s an i n s t a n c e o f A g g r e g a t o r4 public Generator ( Aggregator< Single , Val> aggregator ) { . . .}56 // t a k e an i n s t a n c e o f Test and r e t u r n a new i n s t a n c e o f G e n e r a t o r7 public <Key> Generator<Elem , Single , WritableMap<Key , Val>,Res>8 embed ( final Test<Single , Key> test ) {9 final Generator<Elem , Single , Val , Res> base = this ;
10 return new Generator<Elem , Single , WritableMap<Key , Val>,Res>11 ( new Aggregator<Single , WritableMap<Key , Val>>(){ . . . }12 }13 public Val process ( List<Elem> list ) { . . . }14 . . .15 }
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Implementation on Hadoop
1 Users need to make their own Generator, Test, and Aggregatorby extending/implementing the library provided ones1
2 An instance of Generator will be created at run-time on eachworking-node, which is also an efficient list homomorphism
3 The instance list homomorphism can be executed by Hadoopin parallel
1Our library provides commonly used Generators and Aggregators.Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Java Codes
Let’s have a look at the actual implementation of GTA Knapsack...
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Performance Evaluation
Environment: hardware
We configured clusters with 2, 4, 8, 16, and 32 nodes (virtualmachines). Each computing/data node has one CPU (VM, [email protected], 1 core), 3 GB memory.
Test data
102 × 220 (≈ 108) knapsack items (3.2GB)
Each item’s weight is between 0 to 10 and the capacity of theknapsack is 100.
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Evaluation on Hadoop
The Knapsack program scales well when increasing nodes of cluster
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Conclusion
The implementation of GTA library on Hadoop can
hide the technical details of MapReduce(Hadoop)
automatically do parallelization and optimization
generate MapReduce programs which have good scalability
make coding, testing and code-reusing much simpler
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Future Work
Optimization of current framework to gain better performance
Extension of current framework
Other approaches of systematic parallel programming
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Thanks
Questions?The project is hosted onhttp://screwdriver.googlecode.com
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Appendix: The Computation on Semiring
Definition (Semiring)
Given a set S and two binary operations ⊕ and ⊗, the triple (S ,⊕,⊗) is called asemiring if and only if
(S ,⊕) is a commutative monoid with identity element id⊕
(S ,⊗) is a monoid with identity element id⊗
⊗ is associative and distributes over ⊕id⊕ is a zero of ⊗: id⊕ ⊗ a = a⊗ id⊕ = id⊕
(Int,+,×) is a semiring, (PositiveInt,+,max) is another semiring
Definition (Semiring homomorphism)
Given two semirings (S ,⊕,⊗) and (S ′,⊕′,⊗′), a function hom : S → S ′ is a semiringhomomorphism from (S,⊕,⊗) to (S ′,⊕′,⊗′), iff it is a monoid homomorphism from(S,⊕) to (S ′,⊕′) and also a monoid homomorphism from (S,⊗) to (S ′,⊗′).
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Theorem (Filter-Embedding Fusion)
Given a set A, a finite monoid (M,�), a monoid homomorphism hom from ([A],++ )
to (M,�), a semiring (S ,⊕,⊗), a semiring homomorphism aggregate from
(*[A]+,×++ ]) to (S,⊕,⊗), a function ok : M → Bool and a polymorphic semiring
generator generate, the following equation holds:
aggregate ◦ filter(ok ◦ hom)◦ generate],x++ (λx → *[x ]+)
= postprocessM ok◦ generate⊕M ,⊗M
(λx → aggregateM*[x ]+)
The result of fusion is an efficient algorithm in form of a listhomomorphism.
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursivefunctions.
Definition of List Homomorphism
If there is an associative operator �, such that for any list x andlist y
h (x ++ y) = h(x)� h(y).
Where ++ is the list concatenation and h [a] = f a, h(x)� id� = h(x), id� is an identity element of � .
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursivefunctions.
Definition of List Homomorphism
If there is an associative operator �, such that for any list x andlist y
h (x ++ y) = h(x)� h(y).
Where ++ is the list concatenation and h [a] = f a, h(x)� id� = h(x), id� is an identity element of � .
Instance of a list homomorphism
sum [a] = asum (x ++ y) = sum x + sum y .
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursivefunctions.
Definition of List Homomorphism
If there is an associative operator �, such that for any list x andlist y
h (x ++ y) = h(x)� h(y).
Where ++ is the list concatenation and h [a] = f a, h(x)� id� = h(x), id� is an identity element of � .
A list homomorphism can be automatically parallelized byMapReduce [Yu et. al., EuroPar11].
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
BackgroundMotivation and Objective
Design and implementationPerformance test
Conclusion and future work
Evaluation on Hadoop
We test 3.2GB data on {2 , 4, 8, 16, 32} nodes clusters and 32GB data on {32, 64} nodes clusters
2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes
time(sec.) 1602 882 482 317 961 511speedup – × 1.82 × 1.83 × 1.52 – × 1.88
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
Top Related