Download - Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Implementing Generate-Test-and-AggregateAlgorithms on Hadoop

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4

1The Graduate University for Advanced Studies2,4National Institute of Informatics

3University of Tokyo

September 28, 2011

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop




MapReduceGTA algorithmParallelization of GTA algorithm

MapReduce

Computation in three phases: map, shuffle and reduce






Programming with MapReduce

Programmers need to implement the following classes (Hadoop)






Programming with MapReduce

The main difficulties of MapReduce Programming :

Nontrivial problems are usually difficult to be computed in adivide-and-conquer fashion

Efficiency of parallel algorithms is difficult to be obtained






Generate Test and Aggregate Algorithm

The Generate-Test-and-Aggregate (GTA for short) algorithmconsists of

generate can generate all possible solution candidates.

test filters the intermediate data.

aggregate computes a summary of valid intermediate data.






Generate Test and Aggregate Algorithm

The Generate-Test-and-Aggregate (GTA for short) algorithmconsists of

generate can generate all possible solution candidates.

test filters the intermediate data.

aggregate computes a summary of valid intermediate data.

GTA is a very useful and common strategy for a large class ofproblems






An Example: Knapsack Problem

Fill a knapsack with items, each of certain value and weight, such that

the total value of packed items is maximal while adhering to a weight

restriction of the knapsack.

picture from Wikipedia







A knapsack program (GTA algorithm):

knapsack = maxvalue ◦ filter ◦ sublists









E.g, there are 3 items: (1kg , $1), (1kg , $2), (2kg , $2)

sublists [(1kg , $1), (1kg , $2), (2kg , $2)]= *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(1kg , $1), (1kg , $2), (2kg , $2)],

[(1kg , $1), (2kg , $2)], [(1kg , $2)], [(1kg , $2), (2kg , $2)], [(2kg , $2)]+









Spouse the capacity of knapsack is 2 kg

filter *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(1kg , $1), (1kg , $2), (2kg , $2)],[(1kg , $1), (2kg , $2)], [(1kg , $2)], [(1kg , $2), (2kg , $2)], [(2kg , $2)]+

= *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(2kg , $2)], [(1kg , $2)]+









maxvalue *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(2kg , $2)], [(1kg , $2)]+= $3









This program is simple but inefficient because it generatesexponential intermediate data (2n).






Theorems of Gernerating Efficient Parallel GTA Programs

Efficient parallel programs can be derived from users’naive but correct programs in terms of a generate, a test, and anaggregate functions [Emoto et. al., 2011]

aggregate ◦ test ◦ generate ⇒ list homomorphism

List homomorphisms is a class of recursive functions which match very well

with the divide-and-conquer paradigm [Bird, 87; Cole, 95].






The Emoto’s theorem is under the following assumptions:

aggregate is a semiring homomorphism.

test is a list homomorphism.

generate is a polymorphism over semiring structures.





Motivation and Objective

The Emoto’s fusion theorem shows us a possible way tosystematically implement efficient parallel programs with GTAalgorithm





Motivation and Objective

The Emoto’s fusion theorem shows us a possible way tosystematically implement efficient parallel programs with GTAalgorithm

We need to evaluate this approach byimplementing a practical library, which should

have easy-to-use programming interface help users designGTA algorithms

be able to generate efficient parallel programs on MapReduce(Hadoop)





System Overview


Implementation on HadoopWe implement the following classes:




Implementation on Hadoop

MapReducer is an Interface of list homomorphism

h[ ] = id⊕h[a] = f a

h(x ++ y) = h x ⊕ h y

1 public interface MapReducer<Elem , Val , Res> {2 public Val identity ( ) ;3 public Val element ( Elem elem ) ;4 public Val combine ( Val left , Val right ) ;5 public Res postprocess ( Val val ) ;6 }






MapReducer is an Interface of list homomorphism

Aggregator defines a semiring homomorphism

(A,⊕,⊗)→ (S ,⊕′,⊗′)

1 public interface Aggregator<A ,S> {2 public S zero ( ) ;3 public S one ( ) ;4 public S singleton ( A a ) ;5 public S plus ( S left , S right ) ;6 public S times ( S left , S right ) ;7 }






MapReducer is an Interface of list homomorphismAggregator defines a semiring homomorphism

Test is almost list homomorphism, it inherits MapReducer

1 public interface Test<Elem , Key> extends MapReducer<Elem ,←↩Key , Boolean> {}





Implementation on HadoopMapReducer is an Interface of list homomorphismAggregator defines a semiring homomorphismTest inherits MapReducer

Generator implements a MapReducer

polymorphic over semiring: Constructor

filter embedding: embed function return a new generator

1 public abstract class Generator<Elem , Single , Val , Res>2 implements MapReducer<Elem , Val , Res> {3 //The c o n t r a c t o r t a k e s an i n s t a n c e o f A g g r e g a t o r4 public Generator ( Aggregator< Single , Val> aggregator ) { . . .}56 // t a k e an i n s t a n c e o f Test and r e t u r n a new i n s t a n c e o f G e n e r a t o r7 public <Key> Generator<Elem , Single , WritableMap<Key , Val>,Res>8 embed ( final Test<Single , Key> test ) {9 final Generator<Elem , Single , Val , Res> base = this ;

10 return new Generator<Elem , Single , WritableMap<Key , Val>,Res>11 ( new Aggregator<Single , WritableMap<Key , Val>>(){ . . . }12 }13 public Val process ( List<Elem> list ) { . . . }14 . . .15 }






1 Users need to make their own Generator, Test, and Aggregatorby extending/implementing the library provided ones1

2 An instance of Generator will be created at run-time on eachworking-node, which is also an efficient list homomorphism

3 The instance list homomorphism can be executed by Hadoopin parallel

1Our library provides commonly used Generators and Aggregators.Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop




Java Codes

Let’s have a look at the actual implementation of GTA Knapsack...





Performance Evaluation

Environment: hardware

We configured clusters with 2, 4, 8, 16, and 32 nodes (virtualmachines). Each computing/data node has one CPU (VM, [email protected], 1 core), 3 GB memory.

Test data

102 × 220 (≈ 108) knapsack items (3.2GB)

Each item’s weight is between 0 to 10 and the capacity of theknapsack is 100.





Evaluation on Hadoop

The Knapsack program scales well when increasing nodes of cluster





Conclusion

The implementation of GTA library on Hadoop can

hide the technical details of MapReduce(Hadoop)

automatically do parallelization and optimization

generate MapReduce programs which have good scalability

make coding, testing and code-reusing much simpler





Future Work

Optimization of current framework to gain better performance

Extension of current framework

Other approaches of systematic parallel programming





Thanks

Questions?The project is hosted onhttp://screwdriver.googlecode.com





Appendix: The Computation on Semiring

Definition (Semiring)

Given a set S and two binary operations ⊕ and ⊗, the triple (S ,⊕,⊗) is called asemiring if and only if

(S ,⊕) is a commutative monoid with identity element id⊕

(S ,⊗) is a monoid with identity element id⊗

⊗ is associative and distributes over ⊕id⊕ is a zero of ⊗: id⊕ ⊗ a = a⊗ id⊕ = id⊕

(Int,+,×) is a semiring, (PositiveInt,+,max) is another semiring

Definition (Semiring homomorphism)

Given two semirings (S ,⊕,⊗) and (S ′,⊕′,⊗′), a function hom : S → S ′ is a semiringhomomorphism from (S,⊕,⊗) to (S ′,⊕′,⊗′), iff it is a monoid homomorphism from(S,⊕) to (S ′,⊕′) and also a monoid homomorphism from (S,⊗) to (S ′,⊗′).





Theorem (Filter-Embedding Fusion)

Given a set A, a finite monoid (M,�), a monoid homomorphism hom from ([A],++ )

to (M,�), a semiring (S ,⊕,⊗), a semiring homomorphism aggregate from

(*[A]+,×++ ]) to (S,⊕,⊗), a function ok : M → Bool and a polymorphic semiring

generator generate, the following equation holds:

aggregate ◦ filter(ok ◦ hom)◦ generate],x++ (λx → *[x ]+)

= postprocessM ok◦ generate⊕M ,⊗M

(λx → aggregateM*[x ]+)

The result of fusion is an efficient algorithm in form of a listhomomorphism.





List Homomorphism

List Homomorphism [Bird, 87; Cole,95] is a class of recursivefunctions.

Definition of List Homomorphism

If there is an associative operator �, such that for any list x andlist y

h (x ++ y) = h(x)� h(y).

Where ++ is the list concatenation and h [a] = f a, h(x)� id� = h(x), id� is an identity element of � .





List Homomorphism




h (x ++ y) = h(x)� h(y).


Instance of a list homomorphism

sum [a] = asum (x ++ y) = sum x + sum y .





List Homomorphism




h (x ++ y) = h(x)� h(y).


A list homomorphism can be automatically parallelized byMapReduce [Yu et. al., EuroPar11].





Evaluation on Hadoop

We test 3.2GB data on {2 , 4, 8, 16, 32} nodes clusters and 32GB data on {32, 64} nodes clusters

2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes

time(sec.) 1602 882 482 317 961 511speedup – × 1.82 × 1.83 × 1.52 – × 1.88