Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

25
Hadoop Simulation and Performance Apache Hadoop India Summit 2011 Ranjit Mathew, Yahoo! R & D India Copyright © 2011 Yahoo! All rights reserved.

Transcript of Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Page 1: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Hadoop Simulation and PerformanceApache Hadoop India Summit 2011

Ranjit Mathew, Yahoo! R & D India

Copyright © 2011 Yahoo! All rights reserved.

Page 2: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Overview

2

Introduction

GridMix3

PigMix2

Tips

Plans

Q & A

Page 3: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

3

Introduction

Page 4: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Why?

4

Capacity Planning

Benchmarking

Comparative evaluation of releases

Basis for improvements

Debugging

Page 5: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Performance Evaluation Techniques

5

Analytical Modeling

› Use statistics, queuing theory, etc. to model system

› Use models to predict behavior

Simulation

› Simulate work-load based on representation or traces

› Benchmarking used to compare variants

Measurement

› Use metrics gathered from tools and logs

› Measure under peak, regular and light work-loads

Ref.: “The Art of Computer Systems Performance Analysis”, Raj K. Jain (Wiley, 1991)

Page 6: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Hadoop Performance Evaluation Tools

6

GridMix3

PigMix2

TeraSort / GraySort

DFSIO, NNBench, S-Live

HiBench

etc.

Page 7: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

7

GridMix3

Page 8: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

GridMix Evolution

8

GridMix1 (HADOOP-2369):

• Representative mix of Jobs

• mapreduce/src/benchmarks/gridmix

GridMix2 (HADOOP-3770):

• More configurable; uses JobControl

• mapreduce/src/benchmarks/gridmix2

GridMix3 (MAPREDUCE-776):

• Trace-based; better emulation-accuracy

• mapreduce/src/contrib/gridmix

Rumen (MAPREDUCE-751):

• Supporting tool for GridMix3 et al

• mapreduce/src/tools/org/apache/hadoop/tools/rumen

Page 9: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

GridMix3

9

Macro benchmark for Hadoop

Trace-based submission of synthetic Jobs

Traces based on production clusters

Traces generated by Rumen

No access to original Job’s code or data

Emulates I/O and other aspects

Highly configurable

Page 10: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Rumen

10

Comprises:

› TraceBuilder - Job Traces from Job History and Configuration

› Folder - Scales Job Traces to a given time-window

Job Traces are in JSON format

Insulation for release-to-release changes in format and contents

Statistical information on Jobs in Trace

Provides API to access Job Traces

Page 11: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

GridMix3 Flow

11

Job Histories

&

Configuration

Job TraceRumen

GridMix3

Data

Generator

Job

Submitter

Production Cluster

Benchmark Cluster

Page 12: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

GridMix3 Architecture

12

GridMix3

JobStory

GridmixJob MapReduceJob

Status

Rumen

JobFactory JobSubmitter JobMonitor

JobTracker

Job

Page 13: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

GridMix3 Emulation-Accuracy

13

Page 14: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Submission Policies and Job Types

14

Submission policy determines when Jobs are submitted:

› STRESS - Keep cluster under stress (but not overwhelm it)

› REPLAY - Faithful emulation of inter-job submission times

› SERIAL - Submit a Job only after the previous one finishes

Types of synthetic Jobs:

› LOADJOB - Emulates work-load from Job Trace

› SLEEPJOB - Do nothing for periods from Job Trace

Page 15: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

15

PigMix2

Page 16: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

PigMix Evolution

16

PigMix1:

• Representative mix of 12 Pig scripts and Java programs

• http://wiki.apache.org/pig/PigMix

• http://wiki.apache.org/pig/DataGeneratorHadoop

PigMix2 (PIG-200):

• Added 5 Pig scripts and Java programs

• Re-factored data-generation

Page 17: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

PigMix2

17

Benchmark for Pig

Representative mix of 17 Pig scripts

Corresponding native MapReduce Java programs

Specifications-based input-data generator

Page 18: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

PigMix2 Flow

18

Input Data PigMix2Data

Generator

Benchmark Cluster

Page 19: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

19

Tips

Page 20: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Minimize Variance

20

Check hardware, especially for failing hard-drives

Use large data-sets to minimize effects of overheads

Beware of speculative execution

Set ipc.ping.interval to 5000 (HADOOP-5380)

Use appropriate PARALLEL clause in PigMix2 Pig scripts

Several runs needed for proper analysis

Page 21: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Apples to Apples Comparison

21

Benchmarking versus Production Cluster:

› Same hardware

› Same software stack

› Same configuration

› Similar networking

› Same size (might not be feasible)

Extrapolating results can be tricky

Page 22: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

22

Plans

Page 23: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

Future Work

23

Greater emulation-accuracy in GridMix3:

› Distributed Cache

› Compression

› CPU usage

› Memory usage

More comprehensive Job Traces from Rumen

Integration of PigMix2 with Pig Statistics

Page 24: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

24

Q & A

Page 25: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

ranjitmathew

senior principal engineer

[email protected]