The state of SQL-on-Hadoop in the Cloud
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
296 -
download
1
Transcript of The state of SQL-on-Hadoop in the Cloud
![Page 1: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/1.jpg)
The state of SQL-on-Hadoop in the Cloud
By Nicolas Poggi
Lead researcher – Big Data Frameworks
Data Centric Computing (DCC) Research Group
Hadoop Summit Melbourne – August 2016
![Page 2: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/2.jpg)
Agenda
• Intro on BSC and ALOJA• motivation
• PaaS services overview• Instances comparison
• SW and HW specs
• SQL Benchmark• Test methodology
• Evaluations• Execution times
• Data size scalability
• Price / Performance
• PaaS evolution overtime• SW and HW improvements
• Summary• Lessons learned
• Conclusions & future work
2
![Page 3: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/3.jpg)
Barcelona Supercomputing Center (BSC)
• Spanish national supercomputing center 22 years history in: • Computer Architecture, networking and distributed systems research• Based at BarcelonaTech University (UPC)• Led by Mateo Valero:
• ACM fellow, Eckert-Mauchly 2007, Google 2009 , Seymour Cray 2015 awards
• Large ongoing life science computational projects• With industry and academia• Active research staff with 1000+ publications
• Prominent body of research activity around Hadoop• 2008-2013: SLA Adaptive Scheduler, Accelerators, Locality Awareness,
Performance Management. 7+ publications• 2013-Present: Cost-efficient upcoming Big Data architectures (ALOJA)
• Open model focus: No patents, public IP, publications (5+), and open source
![Page 4: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/4.jpg)
ALOJA: towards cost-effective Big Data
• Open research project for automating characterization and optimization of Big Data deployments
• Open source Benchmarking-to-Insights platform and tools
• Largest benchmarking public repository• Over 80,000 job runs, and +100 HW configs tested (2014-2016)
• Community collaboration with industry and academia
• Preliminary to this study:• Big Data Benchmark Compendium (TPC-TC `15)
• The Benefits of Hadoop as PaaS (Hadoop Summit EU `16)
http://aloja.bsc.es
Big Data Benchmarking
Online Repository
Web / ML
Analytics
![Page 5: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/5.jpg)
Motivation of SQL-on-Hadoop study
• Extend the ALOJA platform to survey popular PaaS SQL Big Data Cloud solutions using Hive [to begin]
• First approach to services, from an end-user’s perspective• Using the public cloud (and pricing), online docs, and resources• Medium size test deployments and data (8 data-nodes, up to 1TB)
• Evaluate and compare out-of-the-box (default VMs and config)
• Architectural differences, readiness, competitive advantages, • Scalability, Price and Performance
Disclaimer: snapshot of the out-of-the-box price and performance during March-July 2016. Performance and especially costs change often. We use non-discounted pricing. I/O costs are complex to estimate for a single benchmark.
5
![Page 6: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/6.jpg)
Platform-as-a-Service Big Data
• Cloud-based managed Hadoop services• Ready to use Hive, spark, …
• Simplified management
• Deploys in minutes, on-demand, elastic• You select the instance and
• the number of processing nodes
• Pay-as-you-go, pay-what-you-process models
• Optimized for general purpose• Fined tuned to the cloud provider architecture
6
![Page 7: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/7.jpg)
Surveyed Hadoop/Hive PaaS services
• Amazon Elastic Map Reduce (EMR)• Released: Apr 2009• OS: Amazon Linux AMI 4.4 (RHEL-like)• SW stack: EMR (custom, 4.7*)• Instances:
• m3.xlarge and m4.xlarge
• Google Cloud DataProc (CDP)• Released: Feb 2016• OS: Debian GNU/Linux 8.4• SW stack: (custom, v1)• Instances:
• n1-standard-4 and n1-standard-8
• Azure HDInsight (HDI) • Released: Oct 2013• OS: Windows Server and Ubuntu 14.04.5 LTS• SW stack: HDP based (v 2.3 and 2.4)• Instances:
• A3s, D3s v1-2, and D4s v1-2
• Rackspace Cloud Big Data (CBD) • Released: ~ Oct 2013• OS: CentOS 7• SW stack: HDP (2.3)• API: OpenStack (+ Lava)• Instances:
• Hadoop 1-7, 1-15, 1-30, On Metal 40
We selected defaults, general purpose VMs, Also on-premise results as baseline.* EMR v5 released in August 2016
7
![Page 8: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/8.jpg)
Systems-Under-Test (SUTs):VM/Instance specs, elasticity, perf characterization
Focus: 8-datanodes, up to 1TB data size
8
![Page 9: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/9.jpg)
SUTs: Tech specs and costs
* Estimate based on 3 years life time including support and maintenance (see refs.) 10
Notes:• Default Cloud SKUs have 4-
cores and ~15GB of in all providers• 4GBs of RAM / core
• Prices vary greatly
• Rackspace defaults • to high-end OnMetal
Provider Instance type Default? Cores/Node RAM/Node RAM/core
Amazon EMR(us-east-1)
m3.xlarge Yes 4 15 3.8
m4.xlarge 4 16 4
Google CDP(Europe-west1-b)
n1-standard-4 Yes 4 15 3.8
n1-standard-4 1 SSD 4 15 3.8
n1-standard-8 8 30 7.5
Azure HDI(South Central US)
A3 (Large) (old def.) 4 7 1.8
D3 v1 and v2 Yes 4 14 3.5
D4 v1 and v2 4 14 3.5
Rackspace CBD(Northern Virginia (IAD))
hadoop1-7 2 7 3.5
hadoop1-15 (2nd) 4 15 3.8
hadoop1-30 8 30 3.8
OnMetal 40 Yes 40 128 3.2
On-premise2012 (12cores/64GB) 12 64 5.3
D Nodes Cost/Hour Cluster Shared
8 USD 3.36 Yes
8 USD 2.99 Yes
8 USD 1.81 Yes
8 USD 1.92 Yes
8 USD 3.61 Yes
8 USD 2.70 Yes
8 USD 5.25 Yes
8 USD 10.48 Yes
8 USD 2.72 Yes
8 USD 5.44 Yes
8 USD 10.88 Yes
4 USD 11.80 No
8 USD 3.50 * No
![Page 10: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/10.jpg)
Includes I/O costs Cost/5TB/hr* Deploy time
Yes / No with EBSUSD 0.07
~ 10 mins
No ~ 10 mins
No
USD 0.18
~ 01 min
No ~ 01 min
No ~ 01 min
No
USD 0.17
~ 25 mins
No ~ 25 mins
No ~ 25 mins
YesLocal
USD 0.00Cloud
USD 0.07
~ 25 mins
Yes ~ 25 mins
Yes ~ 25 mins
Yes ~ 25 mins
Yes USD 0.00 N/A
SUTs: Elasticity and I/O
*Tests need 5TB of raw HDFS storage, this cost is used. **supports up to 4 SSD drives 12
Provider Instance type Elasticity Storage
Amazon EMRm3.xlarge Compute (and EBS option) 2x40GB Local SSD / node
m4.xlarge Compute and EBS (fixed size) EBS size defined on deploy
Google CDP
n1-standard-4
Compute and GCS (fixed size )
GCS size defined on deploy
n1-standard-4 1 SSD 1x375GB SSD ** + GCS
n1-standard-8 GCS size defined on deploy
Azure HDI
A3 (Large)
Compute and storage
Elastic (WASB)
D3 v1 and 2 Elastic (WASB) + 200GB SSD local
D4 v1 and 2 Elastic (WASB) + 400GB SSD local
Rackspace CBD
hadoop1-7
Compute (Cloud files option)
1.5TB SATA / node
hadoop1-15 2.5TB SATA / node
hadoop1-30 5TB SATA / node
OnMetal 40 2x1.5TB SSD / node
On-premise 2012 (12cores/64GB) No 1TB SATA x6 / node
![Page 11: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/11.jpg)
SUTs: Perf characterization summary
• Ran CPU, MEM B/W, NET, I/O to 1 data disk, and DFSIO benchmarks• CPU (not all cores are born the same) and MEM B/W:
• Best performing OnMetal, then• CPD n1-std-8 similar to HDI D4v2s (and OnPremise)• CDP n1-std-4 similar to HDI D3v2s and EMR m4.xlarge• Then, EMR m3.xlarge, HDI A3s, CBD cloud-based respectively (but similar)
• NET Gbp/s:• EMR < 40, CDP < 8 (some variance), CBD < 5, On-Prem 1 Gbp/s• HDI VM dependent < 6Gbp/s (A3 1, D3 2, D4-D3v2 ~3, D4 6)
• I/O MB/s (write to 1 data disk):• most between 100-150, n1-std-4 w/ SSD 400 (symmetrical), D4v2 and OnMetal > 1000? MB/s
• DFSIO R/W (whole cluster) MB/s:• Most below 50 read 35 write; n1-std-4 w/ SSD 400/200, D4v2 60/50, OnMetal 615/315 MB/s
13
![Page 12: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/12.jpg)
SQL-on-Hadoop benchmarkingMethodology and evaluations
14
![Page 13: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/13.jpg)
Benchmark suite: TPC-H (derived)
• DB industry standard for decision support• well understood benchmark and accepted (since `99)• available audited results on-line
• 22 “real world” business queries• Complex joins, grouping, nested queries
• Defines scale factors for data
• DDLs and queries from D2F-Bench project:• Includes Hive adaptation with ORC tables • Repo: https://github.com/Aloja/D2F-Bench
• based on https://github.com/hortonworks/hive-testbench• changes makes it HDP agnostic
• Supports other engines: Spark, pig, impala, drill, …
15
TPC-H 8-tables schema
![Page 14: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/14.jpg)
Test methodology
• ALOJA-BENCH as a driver• Test methodology
• Queries run from 1-22 • sequentially
• To try to avoid caches• [at least] 3 repetitions• Query ALL (Q ALL) as full run
• Power runs (no concurrency)
• Data sizes:• 1GB, 10GB, 100GB, 500GB*, 1TB
• Metric: execution time• Comparisons
• Q ALL (full run)• Scans Q1, and Q6, • Joins Q2, Q16• Q16 most “complete” single query
• Process and settings• TCP-H datagen CSVs converted to
Hive ORC tables• Each system its own hive.settings
(On prem from repo)
16*500GB is not a standard size, but 300GB is.
![Page 15: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/15.jpg)
SUTs Performance and ScalabilityExecution times
Scalability to data size
Query drill-down
Latency test
17
![Page 16: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/16.jpg)
Exec times by SUT: 8dn 100GB Q ALL
Notes:• Results show execution times for full TPC-H, on SKUs with 8 data nodes at 100GB. Except for the CBD-on metal which has 4dns.
• CBD: • OnMetal fast• Cloud, scale to SKU size
• CDP:• SSD slightly faster than regular• N1std8 only 30% faster than
N1std4• EMR:
• m4.xlarge 18% faster than m3.xlarge
• HDI:• Scale to SKU size• Fastest result D4v2
• M100 (OnPrem):• Poor results
• A3s and CBD Cloud present high variability
18
CBD CDP EMR HDI
SSD version marginal resultLocal SSD + EBS
OnMetal
D4v2 Fastest
D3v2 fastest default
EBS Only
OnPrem
![Page 17: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/17.jpg)
Exec times by SKU: 8dn 1TB Q ALL
19
CBD CDP EMR HDINotes:• Results show execution times for full TPC-H, on SKUs with 8 data nodes at 1TB. Except for the CBD-on metal which has 4dns.• At 1TB, lower end systems obtain poorer performance.• CBD:
• OnMetal fastest default• Cloud, 1-7 cannot process 1TB, 1-
15,1-30 similar results• CDP:
• SSD slightly slower than regular• N1std8 2x faster than N1std4 (as
expected)• EMR:
• m4.xlarge 15% faster than m3.xlarge
• HDI:• Scale to SKU size• Fastest result D4v2
• M100 (OnPrem):• Improves results (comparing)
Systems similar, but poor results
OnMetal2ns fastest
D4v2 Fastest
![Page 18: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/18.jpg)
Data size scalability of defaults: up to 1TB (Q ALL)
20
Notes:• Chart shows the data scale factor from 10GB to 1TB
of the different SUTs of 8 data nodes. Except for CBD On Metal, which has 4.
• Comparing defaults instances, CDP has poorest scalability, then EMR.
• On-prem scales linearly up to 1TB• HDI and OnMetal can scale to larger sizes
![Page 19: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/19.jpg)
Data size scalability up to 1TB (Q ALL)
21
Notes:• Chart shows the data scale factor from 10GB to 1TB
of the different SUTs of 8 data nodes. Except for CBD On Metal, which has 4.
• CBD-hadoop-1-7 cannot process more than 100GB• Then, HDI A3s scales the poorest (old-gen system)• EMR and CDP in the middle• HDI D4s has the best scalability and times. Followed
by the CBD OnMetal system
![Page 20: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/20.jpg)
Exec times defaults: Scans vs. Joins 1TBScans (parallelizable Q1 CPU, Q6 I/O) Joins (less parallelizable Q2, Q16)
22
Notes: Q1 (I/O + CPU) is slow on the CDP and EMR systems. Same for Q16. On metal fastest for I/O and Joins, then HDI D3v2.
Defaults with 4-cores Defaults with 4-cores
![Page 21: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/21.jpg)
0
20
40
60
80
100
044 88
132
176
220
264
308
352
396
440
Average of%iowait
Average of%system
Average of%user
0
200000
400000
600000
800000
1000000
2
16
2
32
2
48
2
64
2
80
2
96
2
1122
1282
1442
1602
0
200000
400000
600000
800000
1000000
21
50
29
84
46
59
47
42
89
010
3811
8613
3414
82 0
1000000
2000000
3000000
4000000
5000000
2
51
100
149
198
247
296
345
394
443
Sum ofrkB/s
Sum ofwkB/s
0
500000
1000000
1500000
-3
174
351
528
705
882
1059
1236
1413
1590
0
500000
1000000
1500000
013
627
240
854
468
081
695
210
8812
2413
6014
96
0
20
40
60
80
100
-3
158
319
480
641
802
963
1124
1285
1446
1607
0
20
40
60
80
100
014
829
644
459
2
740
888
1036
1184
1332
1480
0
20
40
60
80
1000
53
106
159
212
265
318
371
424
477
530
0
200000
400000
600000
800000
1000000
1
59
117
175
233
291
349
407
465
523
0
500000
1000000
1500000
0
64
128
192
256
320
384
448
512
CP
U %
Dis
k R
/W k
B/s
NET
R/w
kB
/sPerf details Q16 1TB default VMs
HDI-D3v2-8 CDP-N1std4-8 EMR-M3.xlarge-8 CBD-OnMetal-4
0
500000
1000000
1500000
0
Sum ofrxkB/s
Sum oftxkB/s
N/A NET data CBD-OnMetal-4
Highest I/O wait
Highest Disk throughput
Highest NET util
Different patternLowest CPU util
Lowest Disk util
![Page 22: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/22.jpg)
Configurations
25Notes: CDP and CBD on Java 1.8, all on OpenJDK. HDI only to enable Tez and config perf options
Category Config EMR CDP HDI CBD (On Metal) On-prem
System Java version OpenJDK 1.7.0_111 OpenJDK 1.8.0_91 OpenJDK 1.7.0_101 OpenJDK 1.8.0_71 JDK 1.7
HDFS File system EBS / S3 GCS (hadoopv.) WASB Local + Swift + S3 Local
Replication 3 2 3 2 3
Block size 128MB 128MB 128MB 256MB 128MB
File buffer size 4KB 64KB 128KB 256KB 64KB
M/R Outputcompression SNAPPY False False SNAPPY False
IO Factor / MB 48 /200 10 /100 100 / 614 100 / 358 10 / 100
Memory MB 1536 3072 1536 2048 1536
Hive Engine MR MR Tez MR MR
ORC config Defaults Defaults Defaults Defaults Defaults
Vectorized exec False False Enabled False Enabled
Cost Based Op False Enabled Enabled Enabled Enabled
Enforce Bucketing False False True False True
Optimizebucket map join False False True False True
![Page 23: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/23.jpg)
Latency test: Exec time by SKU 8dn 1GB Q 16
Notes:• Results show execution times for query 16 and 1GB. Except for the CBD-on metal which has 4.
• HDI D3v2 and D4v2 have the lowest times• Then the CDP systems
26
CBD CDP EMR HDI
D3v2 and D4v2“lowest latency”
![Page 24: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/24.jpg)
Price / PerformancePrice and Execution times assume:
• only cost of running benchmark or full 24/7 utilization
• no provisioning time or idle times
• by the second billing
27
![Page 25: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/25.jpg)
Price/Performance 100GB (Q ALL)
28
Notes:• Shows the price/performance ratio by SUT• Lower in price and time is better• Chart zoomed to differentiate clusters
Price assumptions:• Measures only the cost of running the
benchmark in seconds. Cluster setup time is ignored.
Rank Cluster Best cost Best time1 CDP-n1std4-8 USD 6.37 3:11:57
2 CDP-n1std4-1SSD-8 USD 6.55 3:06:44
3 EMR-m4.xlarge-8 USD 8.18 2:40:24
4 HDI-D3v2-HDP24-8 USD 8.74 1:36:45
5 CDP-n1std8-8 USD 9.35 2:27:57
6 HDI-D4v2-HDP24-8 USD 10.20 0:57:29
7 EMR-m3.xlarge-8 USD 10.79 3:08:49
8 HDI-A3-8 USD 11.96 4:10:04
9 M100-8n USD 13.10 3:32:29
10 HDI-D4-8 USD 15.08 1:24:59
11 CBD-hadoop1-7-8 USD 19.16 7:02:33
12 CBD-OnMetal40-4 USD 19.31 1:38:12
13 CBD-hadoop1-15-8 USD 26.45 4:51:41
Cheapest run
Fastest run
Most Cost-effective
![Page 26: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/26.jpg)
Price/Performance 1TB (Q ALL)
29
Notes:• Shows the price/performance ratio by SUT• Lower in price and time is better• Chart zoomed to differentiate clusters
Price assumptions:• Measures only the cost of running the
benchmark in seconds. Cluster setup time is ignored.
Rank Cluster Best cost Best time1 HDI-D3v2-HDP24-8 USD 39.63 7:18:42
2 HDI-D4v2-HDP24-8 USD 42.02 3:56:45
3 M100-8n USD 42.85 11:34:50
4 CDP-n1std8-8 USD 44.91 11:50:46
5 CDP-n1std4-8 USD 46.49 23:21:05
6 CDP-n1std4-1SSD-8 USD 50.53 24:00:52
7 EMR-m4.xlarge-8 USD 54.26 17:44:01
8 HDI-D4-8 USD 62.75 5:53:32
9 CBD-OnMetal40-4 USD 67.77 5:44:36
10 EMR-m3.xlarge-8 USD 69.92 20:23:01
11 HDI-A3-8 USD 74.83 27:42:56
12 CBD-hadoop1-15-8 USD 128.44 23:36:37
Cheapest run
Fastest run
Most cost effective
![Page 27: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/27.jpg)
SW and HW improvementsPaaS provider improvements over time (tests on 4 data nodes)
30
![Page 28: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/28.jpg)
SW: HDP version 2.3 to 2.4 improvement on HDI D3v1 4 nodes Q ALL 100GB
Notes:• Test to compare migration to HDP 2.4. D3s improved, they can now run 1TB without modifications
on 4 data nodes (D3s). No more namenode swapping. On larger nodes less improvements.31
D3s 35% Improvement
Run time at 100GB Scalability from 1GB to 1TB
D3s can scale to 1TB now
![Page 29: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/29.jpg)
SW: EMR version 4.7 to 5.0 improvement on m4.xlarge 4 nodes Q ALL
Notes:• Test to compare perf improvements on EMR 5.0 (Hive 2.1, Tez by default, Spark 2.0)• EMR 5.0 gets a 2x increase at 4 nodes.
32
ERM 5.0 2x improvement
Run time at 1TB Scalability from 1GB to 1TB
![Page 30: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/30.jpg)
HDI default HW improvement: 4 nodes Q ALL
Notes:• Test to compare perf improvements on HDI default VM instances from A3, to
D3 and D3v2 (30% faster CPU, same price) on HDP 2.333
HDI default VM improvement
Run time at 1TB Scalability from 1GB to 1TB
Variability
![Page 31: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/31.jpg)
Summary Lessons learned, findings, conclusions, references
34
![Page 32: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/32.jpg)
Remarks / Findings
• Setting up and fine tuning Big Data stacks is complex and requires and iterative process• Cloud services optimize continuously their PaaS for general-purpose
• All tune M/R and Yarn, and their custom file storages• Update HW (and prices) overtime
• You might need to re-deploy to get benefits
• Room for improvement• Only HDI fine-tunes Hive, what about other new services? (Spark, Storm, R, HBASE)• All updating to Hive and Spark v2 (and enabling Tez, tuning ORC)• CDP upgrading HDP version
• Beware, commodity VMs != commodity Bare-Metal for Big Data• Errors … Originally this was to be a 4-node comparison … • Variability, An issue for low-end, old-gen VMs
• Also scalability, and reliability, beware.• Less of an issue on newer VMs
• Network throttling, not apparent at 8-danode cluster, but for larger clusters…
35
![Page 33: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/33.jpg)
Summary:
Similarities
• Similar defaults for cloud based:• 4-cores, ~16GB RAM, local SSDs
• ~4GB RAM / Core• Good enough for Hadoop / Hive
• Elasticity• All allow on-demand scaling-up• Mixed mode of local + remote
• Fast networking• Specially EMR• HDI, depending on VM size• Required for networked storage…
• Most deploy in < 25 mins
Differences
• CBD offers OnMetal as default• High-end, non-shared system.
• What about in-mem systems• Spark, Graph/Graph?
• Elasticity• But no all down-scaling / stop (delete)• HDI completely (local for temp)
• Pricing, very different!• EMR, CBD, HDI / hour• CDP / minute• But similar overall price/perf
• CDP deploys in a ~minute
36
![Page 34: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/34.jpg)
The state of SQL-on-Hadoop in the Cloud
• Providers have integrated successfully on-demand Big Data services• Most are in the path to offer pay-what-you process models• Disaggregating completely storage-to-compute• Giving more elasticity to your data and needs
• Multiple clusters, pay only what you use, planning free, governance
• What about performance and reliability?• Providers are upgrading and defaulting to newer-gen VMs
• Faster CPUs, SSDs (local and remote), end-of-rotational?, fast networks• As well as keeping the SW up-to date
• Newer versions, security and performance patches, tuned for their infrastructure
• Is it price-performant?• Yes, at least for the medium-seized. The cost is in compute, so you pay for what you use!
• For ALOJA, this work is the base work for future research.
37
![Page 35: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/35.jpg)
Benchmarking with ALOJA
Local dev ENV
1. Install prerequisites
• git, vagrant, VirtualBox
2. git clone https://github.com/Aloja/aloja.git
3. cd aloja
4. vagrant up
5. Open your browser at: http://localhost:8080
6. Optional start the benchmarking
cluster
vagrant up /.*/
Repeat / Reproduce results
1. (Read the docs… or write us)
2. Setup your cloud credentials
• Or test on-prem
3. Deploy cluster
• aloja/aloja-deploy.sh HDI-D3v2-8
4. aloja/aloja-bench/run_benchs.sh –b D2F-
Hive-Bench
5. (also cluster-bench and sysbench)
38
![Page 36: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/36.jpg)
More info:• Upcoming publication: The state of SQL-on-Hadoop
• Data release and more in-depth tech analysis
• ALOJA Benchmarking platform and online repository• http://aloja.bsc.es http://aloja.bsc.es/publications
• BDOOP meetup group in Barcelona
• Workshop Big Data Benchmarking (WBDB)• Next in Barcelona
• SPEC Research Big Data working group• http://research.spec.org/working-groups/big-data-working-group.html
• Slides and video:• Benchmarking Big Data on different architectures:
• FOSDEM ‘16: https://archive.fosdem.org/2016/schedule/event/hpc_bigdata_automating_big_data_benchmarking/• http://www.slideshare.net/ni_po/benchmarking-hadoop
• Michael Frank on Big Data benchmarking • http://www.tele-task.de/archive/podcast/20430/
• Tilmann Rabl Big Data Benchmarking Tutorial• http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl
![Page 37: The state of SQL-on-Hadoop in the Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062523/586f77a51a28ab10258b6877/html5/thumbnails/37.jpg)
Thanks, questions?
Follow up / feedback : [email protected]
Twitter: @ni_po
The state of SQL-on-Hadoop in the Cloud