Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew
-
Upload
yahoo-developer-network -
Category
Documents
-
view
2.287 -
download
3
Transcript of Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew
Hadoop Simulation and PerformanceApache Hadoop India Summit 2011
Ranjit Mathew, Yahoo! R & D India
Copyright © 2011 Yahoo! All rights reserved.
Overview
2
Introduction
GridMix3
PigMix2
Tips
Plans
Q & A
3
Introduction
Why?
4
Capacity Planning
Benchmarking
Comparative evaluation of releases
Basis for improvements
Debugging
Performance Evaluation Techniques
5
Analytical Modeling
› Use statistics, queuing theory, etc. to model system
› Use models to predict behavior
Simulation
› Simulate work-load based on representation or traces
› Benchmarking used to compare variants
Measurement
› Use metrics gathered from tools and logs
› Measure under peak, regular and light work-loads
Ref.: “The Art of Computer Systems Performance Analysis”, Raj K. Jain (Wiley, 1991)
Hadoop Performance Evaluation Tools
6
GridMix3
PigMix2
TeraSort / GraySort
DFSIO, NNBench, S-Live
HiBench
etc.
7
GridMix3
GridMix Evolution
8
GridMix1 (HADOOP-2369):
• Representative mix of Jobs
• mapreduce/src/benchmarks/gridmix
GridMix2 (HADOOP-3770):
• More configurable; uses JobControl
• mapreduce/src/benchmarks/gridmix2
GridMix3 (MAPREDUCE-776):
• Trace-based; better emulation-accuracy
• mapreduce/src/contrib/gridmix
Rumen (MAPREDUCE-751):
• Supporting tool for GridMix3 et al
• mapreduce/src/tools/org/apache/hadoop/tools/rumen
GridMix3
9
Macro benchmark for Hadoop
Trace-based submission of synthetic Jobs
Traces based on production clusters
Traces generated by Rumen
No access to original Job’s code or data
Emulates I/O and other aspects
Highly configurable
Rumen
10
Comprises:
› TraceBuilder - Job Traces from Job History and Configuration
› Folder - Scales Job Traces to a given time-window
Job Traces are in JSON format
Insulation for release-to-release changes in format and contents
Statistical information on Jobs in Trace
Provides API to access Job Traces
GridMix3 Flow
11
Job Histories
&
Configuration
Job TraceRumen
GridMix3
Data
Generator
Job
Submitter
Production Cluster
Benchmark Cluster
GridMix3 Architecture
12
GridMix3
JobStory
GridmixJob MapReduceJob
Status
Rumen
JobFactory JobSubmitter JobMonitor
JobTracker
Job
GridMix3 Emulation-Accuracy
13
Submission Policies and Job Types
14
Submission policy determines when Jobs are submitted:
› STRESS - Keep cluster under stress (but not overwhelm it)
› REPLAY - Faithful emulation of inter-job submission times
› SERIAL - Submit a Job only after the previous one finishes
Types of synthetic Jobs:
› LOADJOB - Emulates work-load from Job Trace
› SLEEPJOB - Do nothing for periods from Job Trace
15
PigMix2
PigMix Evolution
16
PigMix1:
• Representative mix of 12 Pig scripts and Java programs
• http://wiki.apache.org/pig/PigMix
• http://wiki.apache.org/pig/DataGeneratorHadoop
PigMix2 (PIG-200):
• Added 5 Pig scripts and Java programs
• Re-factored data-generation
PigMix2
17
Benchmark for Pig
Representative mix of 17 Pig scripts
Corresponding native MapReduce Java programs
Specifications-based input-data generator
PigMix2 Flow
18
Input Data PigMix2Data
Generator
Benchmark Cluster
19
Tips
Minimize Variance
20
Check hardware, especially for failing hard-drives
Use large data-sets to minimize effects of overheads
Beware of speculative execution
Set ipc.ping.interval to 5000 (HADOOP-5380)
Use appropriate PARALLEL clause in PigMix2 Pig scripts
Several runs needed for proper analysis
Apples to Apples Comparison
21
Benchmarking versus Production Cluster:
› Same hardware
› Same software stack
› Same configuration
› Similar networking
› Same size (might not be feasible)
Extrapolating results can be tricky
22
Plans
Future Work
23
Greater emulation-accuracy in GridMix3:
› Distributed Cache
› Compression
› CPU usage
› Memory usage
More comprehensive Job Traces from Rumen
Integration of PigMix2 with Pig Statistics
24
Q & A