WBDB 2014 Benchmarking Virtualized Hadoop Clusters

22
Benchmarking Virtualized Hadoop Clusters Todor Ivanov, Roberto V. Zicari Big Data Lab, Goethe University Frankfurt Alejandro Buchmann Database and Distributed Systems, TU Darmstadt 1 5th Workshop on Big Data Benchmarking 2014

Transcript of WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Page 1: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Benchmarking Virtualized Hadoop Clusters

Todor Ivanov, Roberto V. Zicari Big Data Lab, Goethe University Frankfurt

Alejandro Buchmann Database and Distributed Systems, TU Darmstadt

1 5th Workshop on Big Data Benchmarking 2014

Page 2: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Outline

• Virtualizing Hadoop

• Measuring Performance – Iterative Experimental Approach – Platform Setup – Experiments – Summary of Results

• Lessons Learned

• Next Steps

5th Workshop on Big Data Benchmarking 2014 2

Page 3: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Virtualizing Hadoop

• Motivation

– Hadoop-as-a-service (e.g. Amazon Elastic Map Reduce)

– Automated deployment and cost-effective management

– Dynamically scalable cluster size (e.g. # of nodes, resource allocation)

• Challenges

– I/O overhead

– Network overhead (message communication and data transfer)

• Related Work: virtualized vs. physical Hadoop Virtualized Hadoop has an estimated overhead ranging between 2-10%

(reported in [1], [2], [3])

5th Workshop on Big Data Benchmarking 2014 3

[1] Buell, J.: A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vSphere 5. Tech. White Pap. VMware Inc. (2011). [2] Buell, J.: Virtualized Hadoop Performance with VMware vSphere ®5.1. Tech. White Pap. VMware Inc. (2013). [3] Microsoft: Performance of Hadoop on Windows in Hyper-V Environments. Tech. White Pap. Microsoft. (2013).

Page 4: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Objectives of Our Research

Investigate and compare the performance between

standard and separated data-compute cluster configurations.

• How does the application performance change on a data-compute cluster?

• What type of applications are more suitable for data-compute clusters?

5th Workshop on Big Data Benchmarking 2014 4

Standard

Cluster Data-Compute Cluster

Page 5: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Methodology: Iterative Experimental Approach

I. Choose a Big Data Benchmark

II. Configure Hadoop Cluster

III. Perform Experiments

IV. Evaluate

Results

5th Workshop on Big Data Benchmarking 2014 5

Page 6: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step I: Intel HiBench

• Benchmark suite for Hadoop (developed by Intel in 2010) (Huang et al. [4])

• 4 categories, 10 workloads & 3 types

• Metrics: Time (Sec) & Throughput (Bytes/Sec)

Category No Workload Tools Type

Micro Benchmarks

1 Sort MapReduce IO Bound

2 WordCount MapReduce CPU Bound

3 TeraSort MapReduce Mixed

4 TestDFSIOEnhanced MapReduce IO Bound

Web Search 5 Nutch Indexing Nutch, Lucene Mixed

6 Page Rank Pegasus Mixed

Machine Learning 7 Bayesian Classification Mahout Mixed

8 K-means Clustering Mahout Mixed

Analytical Query 9 Join Hive Mixed

10 Aggregation Hive Mixed

5th Workshop on Big Data Benchmarking 2014 6

[4] Huang, S. et al.: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. Data Engineering Workshops (ICDEW), 2010

Page 7: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step II: Platform Setup

• Platform layer (Hadoop Cluster) – vSphere Big Data Extension integrating Serengeti Server (version 1.0) – VM template hosting CentOS – Apache Hadoop (version 1.2.1) with default parameters:

• 200MB Java Heap size • 64MB block size • 3 replication factor

• Management layer (Virtualization) – VMWare vSphere 5.1 – ESXi and vCenter Servers

• Hardware layer - Dell PowerEdge T420 server – 2 x Intel Xeon E5-2420 (1.9 GHz), 6 core CPUs – 32GB RAM – 4 x 1 TB, WD SATA disks

Hardware

Management (Virtualization)

Application (HiBench Benchmark)

Platform (Hadoop Cluster)

CPUs Memory Storage

5th Workshop on Big Data Benchmarking 2014 7

Page 8: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

(Known) Limitations

• Single physical server (no physical network)

• VMWare ESXi server hypervisor

• Testing with default configurations (Serengeti & Hadoop)

• Time constraints: – Input data sizes: 10/20/50GB

– 3 test repetitions

5th Workshop on Big Data Benchmarking 2014 8

Page 9: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step II: Comparison Factors

The number of utilized VMs in the compared clusters should be equal.

• Each additional VM increases the hypervisor overhead (reported in [2], [5], [6])

• Utilizing more VMs may improve the overall system performance [2]

The utilized hardware resources in a cluster should be equal.

5th Workshop on Big Data Benchmarking 2014 9

[2] Buell, J.: Virtualized Hadoop Performance with VMware vSphere ®5.1. Tech. White Pap. VMware Inc. (2013). [5] Li, J. et al.: Performance Overhead Among Three Hypervisors: An Experimental Study using Hadoop Benchmarks. Big Data (BigData Congress), 2013 [6] Ye, K. et al.: vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration. Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012

Page 10: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step II: Comparison Standard1/Data-Compute1

Standard

Cluster Data-Compute Cluster

1) of the utilized hardware resources 2) of the utilized VMs

∆ – difference in performance

5th Workshop on Big Data Benchmarking 2014 10

Page 11: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step II: Comparison Standard2/Data-Compute3

Standard Cluster Data-Compute

Cluster

1) of the utilized hardware resources 2) of the utilized VMs

∆ – difference in performance

5th Workshop on Big Data Benchmarking 2014 11

Page 12: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step II: Comparison Data-Compute1/2/3

Data-Compute Cluster Data-Compute

Cluster

1) of the utilized hardware resources

∆ – difference in performance

5th Workshop on Big Data Benchmarking 2014 12

Page 13: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step II: All Cluster Configurations

5th Workshop on Big Data Benchmarking 2014 13

Page 14: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step III & IV: CPU Bound - WordCount

• Configuration: 4 map/1 reduce tasks, 10/20/50 GB input data sizes

• Times normalized with respect to baseline Standard1

• 38-47% better performance for Data-Compute cluster

• Data-Compute1 (2CW & 1DW) ≈ Data-Compute2 (2CW & 2DW)

Equal Number of VMs

3 VMs 6 VMs

DataSize (GB)

Diff. (%) Standard1/

Data-Comp1

Diff. (%) Standard2/

Data-Comp3

10 -40 -38

20 -41 -42

50 -43 -47

5th Workshop on Big Data Benchmarking 2014 14

1.00 1.00 1.00

1.75 1.74 1.74

0.71 0.71 0.70 0.71 0.71 0.70

1.26 1.22 1.19

0

0.5

1

1.5

2

10 20 50Data Size (GB)

Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3

Rat

io t

o S

tan

dar

d1

Page 15: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step III & IV: Read I/O Bound – TestDFSIOEnh (1)

• Configuration: 100MB file size, 10/20/50 GB input data sizes

• Read times normalized with respect to baseline Standard1

• Standard1 (Standard Cluster) performs best

Equal Number of VMs

3 VMs 6 VMs

Data Size (GB)

Diff. (%) Standard1/

Data-Comp1

Diff. (%) Standard2/

Data-Comp3

10 68 -18

20 71 -30

50 73 -46 Rat

io t

o S

tan

dar

d1

5th Workshop on Big Data Benchmarking 2014 15

1.00 1.00 1.00

1.83 1.93 1.87

3.08 3.39

3.66

1.51 1.71 1.78

1.55 1.48 1.28

0.0

1.0

2.0

3.0

4.0

10 20 50Data Size (GB)

Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3

Page 16: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step III & IV: Read I/O Bound – TestDFSIOEnh (2)

• Configuration: 100MB file size, 10/20/50 GB input data sizes

• Read times normalized with respect to baseline Standard1

• Data-Comp1 (2CW & 1DW) > DC2 (2CW & 2DW) > DC3 (3CW & 3DW)

More data nodes improve read performance in a Data-Compute cluster.

Different Number of VMs

3 VMs 4 VMs

4 VMs 6 VMs

Data Size (GB)

Diff. (%) Data-

Comp1/2

Diff. (%) Data-

Comp2/3

10 -104 3

20 -99 -15

50 -106 -39

5th Workshop on Big Data Benchmarking 2014 16

1.00 1.00 1.00

1.83 1.93 1.87

3.08 3.39

3.66

1.51 1.71 1.78

1.55 1.48 1.28

0.0

1.0

2.0

3.0

4.0

10 20 50Data Size (GB)

Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3

Rat

io t

o S

tan

dar

d1

Page 17: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step III & IV: Write I/O Bound – TestDFSIOEnh (1)

• Configuration: 100MB file size, 10/20/50 GB input data sizes

• Write times normalized with respect to baseline Standard1

• Data-Compute cluster (Data-Comp1, Data-Comp3) performs better

Equal Number of VMs

3 VMs 6 VMs

Data Size (GB)

Diff. (%) Standard1/

Data-Comp1

Diff. (%) Standard2/

Data-Comp3

10 -10 4

20 -21 -14

50 -24 -1

5th Workshop on Big Data Benchmarking 2014 17

1.00 1.00 1.00

0.84

1.08 1.00

0.91 0.83 0.81

0.73 0.86

0.95 0.87

0.95 0.99

0.0

0.5

1.0

1.5

10 20 50Data Size (GB)

Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3

Rat

io t

o S

tan

dar

d1

Page 18: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Step III & IV: Write I/O Bound – TestDFSIOEnh (2)

• Configuration: 100MB file size, 10/20/50 GB input data sizes • Write times normalized with respect to baseline Standard1

• Data-Comp1 (2CW & 1DW) < Data-Comp3(3CW & 3DW) Having 2 extra Data Worker nodes increases the write overhead up to

19% in a Data-Compute cluster.

• Data-Comp3 (6VMs) outperforms Standard1 (3VMs)

Different Number of VMs

3 VMs 6 VMs

3 VMs 6 VMs

Data Size (GB)

Diff. (%) Data-

Comp1/3

Diff. (%) Standard1/

Data-Comp3

10 -4 -15

20 13 -6

50 19 -1

5th Workshop on Big Data Benchmarking 2014 18

1.00 1.00 1.00

0.84

1.08 1.00

0.91 0.83 0.81

0.73 0.86

0.95 0.87

0.95 0.99

0.0

0.5

1.0

1.5

10 20 50Data Size (GB)

Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3

Rat

io t

o S

tan

dar

d1

Page 19: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Summary of Results

• Compute-intensive (i.e. CPU bound) workloads are suitable for Data-Compute clusters. (up to 47% faster)

• Read-intensive (i.e. read I/O bound) workloads are suitable for Standard clusters.

– For Data-Compute clusters adding more data nodes improves the read performance. (up to 39% better e.g. Data-Compute2/Data-Compute3)

• Write-intensive (i.e. write I/O bound) workloads are suitable for Data-

Compute clusters. (up to 15% faster e.g. Standard1/Data-Compute3 )

– Lower number of data nodes result in better write performance.

5th Workshop on Big Data Benchmarking 2014 19

Page 20: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Lessons Learned

• Factors influencing cluster performance*:

– Overall number of virtual nodes (VMs) in a cluster

– Choosing cluster type (Standard or Data-Compute Hadoop cluster)

– Number of nodes for each type (compute and data nodes) in a Data-Compute cluster

* note: Limitations known! (slide 9)

5th Workshop on Big Data Benchmarking 2014 20

Page 21: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Next Steps

• Repeat the experiments on virtualized multi-node cluster

• Evaluate virtualized performance with other workloads

• Experiments with larger data sets

• Repeat the experiments using other hypervisors (e.g. OpenStack)

5th Workshop on Big Data Benchmarking 2014 21

Page 22: WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Thank you!

Questions & Feedback are very welcome!

Contact info:

Todor Ivanov [email protected] http://www.bigdata.uni-frankfurt.de/

5th Workshop on Big Data Benchmarking 2014 22