Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Hadoop Simulation and PerformanceApache Hadoop India Summit 2011

Ranjit Mathew, Yahoo! R & D India

Copyright © 2011 Yahoo! All rights reserved.

Overview

2

Introduction

GridMix3

PigMix2

Tips

Plans

Q & A

3

Introduction

Why?

4

Capacity Planning

Benchmarking

Comparative evaluation of releases

Basis for improvements

Debugging

Performance Evaluation Techniques

5

Analytical Modeling

› Use statistics, queuing theory, etc. to model system

› Use models to predict behavior

Simulation

› Simulate work-load based on representation or traces

› Benchmarking used to compare variants

Measurement

› Use metrics gathered from tools and logs

› Measure under peak, regular and light work-loads

Ref.: “The Art of Computer Systems Performance Analysis”, Raj K. Jain (Wiley, 1991)

Hadoop Performance Evaluation Tools

6

GridMix3

PigMix2

TeraSort / GraySort

DFSIO, NNBench, S-Live

HiBench

etc.

7

GridMix3

GridMix Evolution

8

GridMix1 (HADOOP-2369):

• Representative mix of Jobs

• mapreduce/src/benchmarks/gridmix

GridMix2 (HADOOP-3770):

• More configurable; uses JobControl

• mapreduce/src/benchmarks/gridmix2

GridMix3 (MAPREDUCE-776):

• Trace-based; better emulation-accuracy

• mapreduce/src/contrib/gridmix

Rumen (MAPREDUCE-751):

• Supporting tool for GridMix3 et al

• mapreduce/src/tools/org/apache/hadoop/tools/rumen

GridMix3

9

Macro benchmark for Hadoop

Trace-based submission of synthetic Jobs

Traces based on production clusters

Traces generated by Rumen

No access to original Job’s code or data

Emulates I/O and other aspects

Highly configurable

Rumen

10

Comprises:

› TraceBuilder - Job Traces from Job History and Configuration

› Folder - Scales Job Traces to a given time-window

Job Traces are in JSON format

Insulation for release-to-release changes in format and contents

Statistical information on Jobs in Trace

Provides API to access Job Traces

GridMix3 Flow

11

Job Histories

&

Configuration

Job TraceRumen

GridMix3

Data

Generator

Job

Submitter

Production Cluster

Benchmark Cluster

GridMix3 Architecture

12

GridMix3

JobStory

GridmixJob MapReduceJob

Status

Rumen

JobFactory JobSubmitter JobMonitor

JobTracker

Job

GridMix3 Emulation-Accuracy

13

Submission Policies and Job Types

14

Submission policy determines when Jobs are submitted:

› STRESS - Keep cluster under stress (but not overwhelm it)

› REPLAY - Faithful emulation of inter-job submission times

› SERIAL - Submit a Job only after the previous one finishes

Types of synthetic Jobs:

› LOADJOB - Emulates work-load from Job Trace

› SLEEPJOB - Do nothing for periods from Job Trace

15

PigMix2

PigMix Evolution

16

PigMix1:

• Representative mix of 12 Pig scripts and Java programs

• http://wiki.apache.org/pig/PigMix

• http://wiki.apache.org/pig/DataGeneratorHadoop

PigMix2 (PIG-200):

• Added 5 Pig scripts and Java programs

• Re-factored data-generation

PigMix2

17

Benchmark for Pig

Representative mix of 17 Pig scripts

Corresponding native MapReduce Java programs

Specifications-based input-data generator

PigMix2 Flow

18

Input Data PigMix2Data

Generator

Benchmark Cluster

19

Tips

Minimize Variance

20

Check hardware, especially for failing hard-drives

Use large data-sets to minimize effects of overheads

Beware of speculative execution

Set ipc.ping.interval to 5000 (HADOOP-5380)

Use appropriate PARALLEL clause in PigMix2 Pig scripts

Several runs needed for proper analysis

Apples to Apples Comparison

21

Benchmarking versus Production Cluster:

› Same hardware

› Same software stack

› Same configuration

› Similar networking

› Same size (might not be feasible)

Extrapolating results can be tricky

22

Plans

Future Work

23

Greater emulation-accuracy in GridMix3:

› Distributed Cache

› Compression

› CPU usage

› Memory usage

More comprehensive Job Traces from Rumen

Integration of PigMix2 with Pig Statistics

24

Q & A

ranjitmathew

senior principal engineer

[email protected]

Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Documents

Transcript of Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew