sudoers: Benchmarking Hadoop with ALOJA
-
Upload
nicolas-poggi -
Category
Technology
-
view
470 -
download
0
Transcript of sudoers: Benchmarking Hadoop with ALOJA
Benchmarking Hadoop with ALOJA
Oct 6, 2015
by Nicolas Poggi @ni_po
sudoers Barcelona:
About Nicolas Poggi @ni_po
Work: Education:
Community:
Agenda Intro on Hadoop
Current scenario and problematic
ALOJA project
Open source tools
Benchmarking DEMO
Results
DEMO results online
Open questions and comments
Intro: Hadoop design and ecosystem
Hadoop design
Hadoop designed to solve complex data Structured and non structured
With [close to] linear scalability
Simplifying the programming model From MPI, OpenMP, CUDA, …
Operates as a blackbox for data analysts
Image source: Hadoop, the definitive guide
Hadoop parameters > 100+ tunable parameters
mapred.map/reduce.tasks.speculative.execution
obscure and interrelated
io.sort.mb 100 (300)
io.sort.record.percent 5% (15%)
io.sort.spill.percent 80% (95 – 100%)
Number of Mappers and Reducers
Rule of thumb 0.5 - 2 per CPU core
Hadoop stack for tuning
Image source: Intel® Distribution for Apache Hadoop
Hadoop highly-scalable but… Not a high-performance solution!
Requires Design,
Clusters, topology clusters
Setup, OS, Hadoop config
and tuning required Iterative approach
Time consuming
And extensive benchmarking!
Hadoop ecosystem
Large and spread
Dominated by big players
Custom patches
Default values not ideal
Product claims
Cloud vs. On-premise
IaaS
PaaS
EMR, HDInsight
Needs standardization and auditing!
DATA
Product claims Needs auditing!
Too many choices?
Remote volumes
-
-
Rotational HDDs
JBODs
Large VMs
Small VMs
Gb Ethernet
InfiniBand
RAID
Cost
Performance
On-Premise
Cloud
And where is my system configuration positioned on
each of these axes?
High availability
Replication
+
+
Project ALOJA
Open initiative to produce mechanisms for an automated characterization of cost-effectiveness
of Big Data deployments
Results from of a growing need of the community to understand job execution details and create transparency
Explore different configuration deployment options and their tradeoffs Both software and hardware
Cloud services and on-premise
Seeks to provide knowledge, tools, and an online service to with which users make better informed decisions
reduce the TCO for their Big Data infrastructures
Guide the future development and deployment of Big Data clusters and applications
Challenges, options, and implementation
Challenges (circa end 2013) Test different clusters architectures
On-premise Commodity, high-end, appliance, low-power
Cloud IaaS 32 different VMs in Azure, similar in other
providers
Cloud PaaS HDInsight, EMR, CloudBigData
Different access level Full admin, user-only, request-to-install,
everything ready, queuing systems (SGE)
Different versions Hadoop, JVM, Spark, Hive, etc…
Dev environments and testing Big Data usually requires a cluster to
develop and test
Benchmarking vs. Production envs Need to compare different executions
Not how the systems are doing now This is the main diff with prod products
Dada does not change (non-OLTP) Temporary data for benchmarks vs. Important data
Fast iteration vs. Reliability Iterates configurations vs. fixed config
Many fast, experimental changes
Security can be relaxed Management for Hadoop
Vendor lock-in Lack of systems support (azure, on-prem, low-power) Hadoop is our use case, not the only one
Leave no traces on the benchmarked system
Available options: (circa end 2013) Deployment
jclouds foreman Puppet Ambari
Config and deploy Ambari (hadoop only) Use Configuration
Management (CM) Puppet, chef, ansible…
Monitoring Ganglia, Zabbix Amabari Cloudera Manager Kibana, GraphD…
Problems All systems though for PROD
Not for comparison
No Azure support Many different packages No one-fits-all solution
Solution Custom implementation Based in simple components Wrapping commands
ALOJA Platform main components
2 Online Repository
•Explore results
•Execution details
•Cluster details
•Costs
•Data sharing
3 Web Analytics
•Data views and evaluations
•Aggregates
•Abstracted Metrics
•Job characterization
•Machine Learning
•Predictions and clustering
1 Big Data Benchmarking
•Deploy & Provision
•Conf Management
•Parameter selection & Queuing
•Perf counters
•Low-level instrumentation
•App logs
17
NGINX, PHP, MySQL
BASH, Unix tools, CLIs R, SQL, JS
Workflow in ALOJA Cluster(s) definition
• VM sizes
• # nodes
• OS, disks
• Capabilities
Execution plan
• Start cluster
• Exec Benchmarks
• Gather results
• Cleanup
Import data
• Convert perf metric
• Parse logs
• Import into DB
Evaluate data
• Data views in Vagrant VM
• Or http://hadoop.bsc.es
PA and KD •Predictive
Analytics
•Knowledge Discovery
Historic Repo
(in progress)
Cluster and node definitions
Clusters (Azure example) Node (Web in Rackspace) #load AZURE defaults
source "$CONF_DIR/azure_defaults.conf"
clusterName="al-08"
numberOfNodes="8"
vmSize=“Large”
#details
vmCores="4"
vmRAM="7" #in GB
#costs
clusterCostHour="1.584" #0.176 * 9 clusterType="IaaS"
clusterDescription="A3 type VMs"
#load node defaults
source “$CONF_DIR/node_defaults.conf"
defaultProvider="rackspace"
vm_name="aloja-web"
vmSize='io1-30'
attachedVolumes="2"
diskSize="1023"
# Node roles (install functions)
extraLocalCommands="
vm_install_webserver;
vm_install_repo 'provider/rackspace';
install_ganglia_gmond;
config_ganglia_gmond 'aloja-web-rackspace' 'aloja-web';
install_percona /scratch/attached/2/mysql;"
Commands and providers
Provisioning commands Providers
Connect
Node and Cluster
Uses SSH proxies automatically
Deploy
Start, Stop
Delete
Nodes and clusters
On-premise Custom settings for
clusters Multiple disk types
Different architectures
Cloud IaaS Azure, OpenStack,
Rackspace, AWS (testing)
Cloud PaaS HDInsight, CloudBigData,
EMR soon
Code at: https://github.com/Aloja/aloja/tree/master/aloja-deploy
Running benchmarks in ALOJA Example of submitting a job to run:
https://github.com/Aloja/aloja/blob/master/aloja-bench/run_benchs.sh
To queue jobs and control results: https://github.com/Aloja/aloja/blob/master/shell/exeq.sh
Benchmarking results
ALOJA Online Benchmark Repository Entry point for explore the results collected from the executions
Index of executions Quick glance of executions
Searchable, Sortable
Execution details Performance charts and histograms
Hadoop counters
Jobs and task details
Data management of benchmark executions Data importing from different clusters
Execution validation
Data management and backup
Cluster definitions Cluster capabilities (resources) Cluster costs
Sharing results Download executions
Add external executions
Documentation and References Papers, links, and feature documentation
Available at: http://aloja.bsc.es
Impact of SW configurations in Speedup (4 node clusters)
Number of mappers Compression algorithm
No comp.
ZLIB
BZIP2
snappy
4m
6m
8m
10m
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Impact of HW configurations in Speedup
Disks and Network Cloud remote volumes
Local only
1 Remote
2 Remotes
3 Remotes
3 Remotes
/tmp local
2 Remotes /tmp local
1 Remotes
/tmp local
HDD-ETH
HDD-IB
SSD-ETH
SDD-IB
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Speedup: all disk configurations SSD vs JBOD For DFSIOE read, DFSIOE write, and Terasort
URL: http://hadoop.bsc.es/configimprovement?datefrom=&dateto=&benchs%5B%5D=dfsioe_read&benchs%5B%5D=dfsioe_write&benchs%5B%5D=terasort&id_clusters%5B%5D=21&nets%5B%5D=None&disks%5B%5D=HD2&disks%5B%5D=HD3&disks%5B%5D=HD4&disks%5B%5D=HD5&disks%5B%5D=HDD&disks%5B%5D=HS5&disks%5B%5D=RL1&disks%5B%5D=RL2&disks%5B%5D=RL3&disks%5B%5D=RL4&disks%5B%5D=RL5&disks%5B%5D=RL6&disks%5B%5D=RR1&disks%5B%5D=SS2&disks%5B%5D=SSD&mapss%5B%5D=None&comps%5B%5D=None&replications%5B%5D=None&blk_sizes%5B%5D=None&iosfs%5B%5D=None&iofilebufs%5B%5D=None&datanodess%5B%5D=None&bench_types%5B%5D=HDI&bench_types%5B%5D=HiBench&vm_sizes%5B%5D=None&vm_coress%5B%5D=None&vm_RAMs%5B%5D=None&hadoop_versions%5B%5D=None&types%5B%5D=None&filters%5B%5D=valid&filters%5B%5D=filters&allunchecked=
2 SSDs
5 SATA 1 SSD /tmp
1 SSD
1 SATA
2 SATA
3 SATA
4 SATA
5 SATA
Higher is better
Fastest config
High capacity and fast
High capacity but slow
Speedup by disk configuration in the Cloud (higher is better)
URL
http://104.130.159.92/configimprovement?benchs%5B%5D=terasort&disks%5B%5D=HDD&disks%5B%5D=RL1&disks%5B%5D=RL2&disks%5B%5D=RL3&disks%5B%5D=RR1&disks%5B%5D=RR2 &disks%5B%5D=RR3&disks%5B%5D=RR4&disks%5B%5D=RR5&disks%5B%5D=RR6&disks%5B%5D=RS1&disks%5B%5D=RS6&disks%5B%5D=SSD&bench_types%5B%5D =HiBench&filters%5B%5D=valid&filters%5B%5D=filters&allunchecked=&selected-groups=disk&datefrom=&dateto=&minexetime=150&maxexetime=1500
1-6 remotes
1 and 6 remotes with /tmp on SSD
SSD only
Higher is better
VM Size comparison (Azure) Lower is better
Preview: Cost/Performance Scalability
This shows a sample of a new screen (with sample data) to find the most cost-effective cluster size X axis number of datanodes (cluster size Left Y Execution time (lower is better) Right Y Execution cost
Execution time Execution cost
Recommended size
InfiniBand + SDD (LOCAL)
GbE SDD + (LOCAL) CLOUD (local disk /tmp and HDFS)
CLOUD (/tmp in Local Disk, HDFS in Blob storage 1-3 devices)
CLOUD (/tmp and HDFS in Blob storage 1-3 devices)
InfiniBand + SATA disks (LOCAL)
GbE+ SATA disks (LOCAL)
Price
Performance
Cost-effectiveness On-premise vs. Cloud)
Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Open questions: is BASH good enough?
PROs CONs and Alternatives
Simple and Fast Well known
(basics at least)
Easy to hack
Most of the work requires running sys commands
Custom implementation problems Missing some systems
Too simple, missing: objects, inheritance,
types, data structures, testing
Python? Perl?
Puppet? Ansible?
We’ll stick to bash for now..
What’s missing for incubating in Apache?
More info: ALOJA Benchmarking platform and online repository
http://aloja.bsc.es
Benchmarking Big Data by Nicolas Poggi http://www.slideshare.net/ni_po/benchmarking-hadoop
Big Data Benchmarking Community (BDBC) mailing list (~200 members from ~80organizations) http://clds.sdsc.edu/bdbc/community
Workshop Big Data Benchmarking (WBDB) Next: http://clds.sdsc.edu/wbdb2015.ca
SPEC Research Big Data working group http://research.spec.org/working-groups/big-data-working-group.html
Slides and video: Michael Frank on Big Data benchmarking
http://www.tele-task.de/archive/podcast/20430/
Tilmann Rabl Big Data Benchmarking Tutorial http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl
@BDOOP_BCN
More info: http://aloja.bsc.es
or join BDOOP group http://www.meetup.com/Barcelona-BigData-Perfomance-and-
Operations
Oct 06, 2015