Using Derivation-Free Optimization in the Hadoop Cluster with Terasort

Using Derivation-Free Optimizationin the Hadoop Cluster

with TerasortRenato dos Santos Alves & Sarosh Farjam

Projeto de Experimentos ~ 03.7.2014

Sequence• Abstract• Introduction• Workload Analysis of Search Engines• Benchmarking Methodology and Decisions• Scaleable Data Generation Tool• Case Studies• Conclusions

Introduction• Implementation of the MapReduce cluster Benckmark

TeraSort by DFO method • Every interacting DFO method presents new values for

parameter configuration of Hadoop. • For these parameters, specified within the framework we

need to use a tool that assists in this cluster configuration to ensure proper implementation of TeraSort application.

• Chef server and Chef client

TeraSort BenchmarkTerasort includes 3 MapReduce

applications:● Teragen: generates the data.● Terasort: samples the input data

and uses them with MapReduce to sort the data.

● Teravalidate: validates the output is sorted

DFO Method• Derivative free optimization is a subject of mathematical

optimization. • It refers to problems for which derivative information is

unavailable or • methods that do not use derivatives.

• The derivative of a function of a real variable measures the sensitivity to change of a quantity (dependent variable) which is determined by another quantity (independent variable). E.g. the derivative of the position of a moving object with respect to time is the object's velocity.

Algorithm BOBYQA • BOBYQA (Bound Optimization BY Quadratic Approximation) is

a numerical optimization algorithm by Michael J. D. Powell.

• Name of Powell's Fortran 77 implementation of the algorithm.

• BOBYQA solves bound constrained optimization problems without using derivatives of the objective function, which makes it a derivative-free algorithm.

• The algorithm solves the problem using a trust region method that forms quadratic models by interpolation. One new point is computed on each iteration, usually by solving a trust region sub problem, subject to the bound constraints.

Algorithm COBYLA • Constrained optimization by linear approximation

(COBYLA) is a numerical optimization method for constrained problems where the derivative of the objective function is not known,

• invented by Michael J. D. Powell. • Powell invented COBYLA while working for Westland

Helicopters.• COBYLA proceeds by iteratively approximating the actual

objective function with linear programs.

Hadoop Environment • A physical cluster with 29 nodes was used, • A master Hadoop server (responsible for

implementing the JobTracker and NameNode services) • 28 Hadoop Slaves (dedicated to the implementation of

TaskTracker and DataNode services). • 2 Gigabit Ethernet to perform the connectivity

between the 29 nodes

Hadoop Environment • A front-end access to the cluster server, that

server is configured as a Chef Server also used to organize the executions of DFO TeraSort application is then characterized the synchronization functions of the DFO plays and updating parameter settings Hadoop based on each iteration of DFO TeraSort method.

Experiment Execution• Nemesis a server that is not part of the cluster is used as a front end for the

implementation TeraSort application, running the DFO method and updating settings Hadoop based on their output.

• The synchronization of executions TeraSort updates and Hadoop with the output of DFO method is performed by dfo_hadoop_terasort application executed on the front-end server.

• The implementation of dfo_hadoop_terasort application is supplied with a file that contains the initial values of the configuration parameters of Hadoop, restrictions so that these values do not reach unwanted data out value for the objective function, tolerance value for the restrictions and maximum amount of interactions. With the processing of the input file and the interaction with the Hadoop cluster is discovered which parameter values cause a greater impact for faster execution of TeraSort application, taking as output a file with the best configuration parameters of that.

Experiment Execution• As the cluster was composed of 28 servidoers slaves and each server

with two processors, for a total of 56 slots available processing was decided to maintain 10% of this total, available for tasks due to failures in implementation were spaced more than once. Therefore, we used about 100 Gigabyte generated by Hadoop Teragen.

• To confront the optimization of the execution time of Jobs, was executed two DFO BOBYCA And COBYLA method, aiming to identify which method best suits the application TeraSort forcenida by Hadoop ....

• Two runs with both algorithms and 50 iterations to identify at what time the executions were carried out can converge to a better runtime.

Switching COBYLA, BOBYQA• /* algoritmo COBYLA,Constrained Optimization BY

Linear Approximations */• opt = nlopt_create(NLOPT_LN_COBYLA, N);• //opt = nlopt_create(NLOPT_LN_BOBYQA, N);• nlopt_set_lower_bounds(opt, lb);• nlopt_set_upper_bounds(opt, ub);• nlopt_set_max_objective(opt, objetivo, NULL);

Experiment Execution

Commands• First - Generate Tera sort

• *Teragen will generate approximately 100 GB100 000 179 688 bytes

• $ hadoop jar $HADOOP_HOME/hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>

• $ hadoop jar hadoop-examples-1.0.4.jar teragen 1000000000 terasort-input

Commands• Second

• [hadoop@nemesis otimizacao]$ nohup sudo time ./dfo_hadoop_terasort < entrada > log_execucao_terasort &

Results • We used of DFO method with BOBYQA and COBYLA

algorithms• Presented the main difference in variation of

execution time of each iteration Jobs with dfo_hadoop_terasort application,

• it is characterized mainly, how they treat approximations of the points for the object function, the quadratic or linear form respectively.

TeraSort 50 aIterações Tempo F

1 1491

2 1501

3 1447

4 1889

5 2076

6 1466

7 1470

8 1319

9 1897

10 1611

11 1440

12 1588

13 1321

14 1897

15 1289

16 1704

17 1294

18 1313

19 1728

20 1971

21 1842

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 211000

1200

1400

1600

1800

2000

2200

2400

TeraSort 50 A using BOBYQAexecution progress

number of iterations

tim

e in

sec

onds

1289

TeraSort 50 cIterações Tempo F

1 15872 17213 14734 16695 18016 18337 14868 17099 1510

10 196211 198812 193413 189814 227715 1933

16 151617 160118 156119 163920 151521 150722 220523 183824 241925 174426 156627 161928 189029 1988

30 1875

31 1620

32 1780

33 1607

34 1536

35 1621

36 1580

37 1626

38 1675

39 2065

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 381000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

TeraSort 50 C using BOBYQA execution progress


tim

e in

sec

onds

1473

TeraSort 50 bIterações Tempo F

1 14372 13433 13354 12285 12136 12407 11988 12039 1231

10 117811 117412 118713 118614 120415 1128

16 1150

17 1190

18 1165

19 1190

20 1208

21 1204

22 1113

23 1171

24 1185

25 1190

26 1170

27 1155

28 1211

29 1159

30 1198

31 1206

32 1144

33 1177

34 1179

35 1232

36 1157

37 1201

38 1150

39 1195

40 1178

41 1237

42 1196

43 1233

44 1356

45 1400

46 1674

47 1424

48 1365

49 1366

50 1320

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 501000

1100

1200

1300

1400

1500

1600

1700

1800

TeraSort 50 B using COBYLA execution progress


tim

e in

sec

onds

1113

TeraSort 50 newIterações Tempo F

1 14422 12983 12854 12855 12746 13297 13438 13149 1289

10 1308

11 1304

12 1322

13 1345

14 1421

15 1369

16 1336

17 1348

18 1335

19 1333

20 1307

21 1367

22 1369

23 1352

24 1362

25 1390

26 1350

27 1324

28 1382

29 1347

30 1339

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 301000

1100

1200

1300

1400

1500

1600

1700

1800TeraSort 50 New using COBYLA execution progress


tim

e in

sec

onds

1274

1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829301000

1100

1200

1300

1400

1500

1600

1700

1800TeraSort 50 New using COBYLA

execution progressnumber of iterations

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49100011001200130014001500160017001800

TeraSort 50 B using COBYLA

execution progressnumber of iterations1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1000

1200

1400

1600

1800

2000

2200

2400TeraSort 50 A using BOBYQA


1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 3710001200140016001800200022002400260028003000

TeraSort 50 C using BOBYQA


The use of DFO method with BOBYQA and COBYLA algorithms and presents as main difference the variation of execution time of each iteration Jobs dfo_hadoop_terasort application, it is mainly how they are treated

approximations of the points for the object function the quadratic or linear form respectively.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 301000

1200

1400

1600

1800

2000

2200

2400

Difference between Algorithms COBYLA and BOBYQA

TeraSort 50 New/COBYLA TeraSort 50 B/COBYLA TeraSort 50 A/BOBYQA TeraSort 50 C/BOBYQA

tim

e in

sec

onds

Conclusion• The convergence of the total time proves to be more

stable in COBYLA and without many fluctuations when compared to BOBYQA algorithm.

• The Speedup BOBYQA algorithm in the execution of TeraSort application is 12% on average

• And the results reported by COBYLA algorithm, in the execution of TeraSort application demonstrates Speedup on average 21.15% over the initial settings of Hadoop and a greater optimization than the BOBYQA algorithm.

References • [1] O'Malley, O. (2008, May). TeraByte Sort on Apache Hadoop.

Retrieved from http://sortbenchmark.org/YahooHadoop.pdf • [2] Anand, A. (2009, May). Hadoop Sorts a Petabyte in 16:25 Hours

and a Terabyte in 62 Seconds. Retrieved from https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html

• [3] Gray, J. (n.d.). Sort Benchmark Home Page. Retrieved from http://sortbenchmark.org/

• [4] A Measure ofTransaction Processing Power. (1985) Datamation, 31 (7), 112-118.

• [5] Wikipedia; http://en.wikipedia.org/

Using Derivation-Free Optimization in the Hadoop Cluster with Terasort

Engineering

Transcript of Using Derivation-Free Optimization in the Hadoop Cluster with Terasort