The state of SQL-on-Hadoop in the Cloud

The state of SQL-on-Hadoop in the Cloud

By Nicolas Poggi

Lead researcher – Big Data Frameworks

Data Centric Computing (DCC) Research Group

Hadoop Summit Melbourne – August 2016

Agenda

• Intro on BSC and ALOJA• motivation

• PaaS services overview• Instances comparison

• SW and HW specs

• SQL Benchmark• Test methodology

• Evaluations• Execution times

• Data size scalability

• Price / Performance

• PaaS evolution overtime• SW and HW improvements

• Summary• Lessons learned

• Conclusions & future work

2

Barcelona Supercomputing Center (BSC)

• Spanish national supercomputing center 22 years history in: • Computer Architecture, networking and distributed systems research• Based at BarcelonaTech University (UPC)• Led by Mateo Valero:

• ACM fellow, Eckert-Mauchly 2007, Google 2009 , Seymour Cray 2015 awards

• Large ongoing life science computational projects• With industry and academia• Active research staff with 1000+ publications

• Prominent body of research activity around Hadoop• 2008-2013: SLA Adaptive Scheduler, Accelerators, Locality Awareness,

Performance Management. 7+ publications• 2013-Present: Cost-efficient upcoming Big Data architectures (ALOJA)

• Open model focus: No patents, public IP, publications (5+), and open source

ALOJA: towards cost-effective Big Data

• Open research project for automating characterization and optimization of Big Data deployments

• Open source Benchmarking-to-Insights platform and tools

• Largest benchmarking public repository• Over 80,000 job runs, and +100 HW configs tested (2014-2016)

• Community collaboration with industry and academia

• Preliminary to this study:• Big Data Benchmark Compendium (TPC-TC `15)

• The Benefits of Hadoop as PaaS (Hadoop Summit EU `16)

http://aloja.bsc.es

Big Data Benchmarking

Online Repository

Web / ML

Analytics

Motivation of SQL-on-Hadoop study

• Extend the ALOJA platform to survey popular PaaS SQL Big Data Cloud solutions using Hive [to begin]

• First approach to services, from an end-user’s perspective• Using the public cloud (and pricing), online docs, and resources• Medium size test deployments and data (8 data-nodes, up to 1TB)

• Evaluate and compare out-of-the-box (default VMs and config)

• Architectural differences, readiness, competitive advantages, • Scalability, Price and Performance

Disclaimer: snapshot of the out-of-the-box price and performance during March-July 2016. Performance and especially costs change often. We use non-discounted pricing. I/O costs are complex to estimate for a single benchmark.

5

Platform-as-a-Service Big Data

• Cloud-based managed Hadoop services• Ready to use Hive, spark, …

• Simplified management

• Deploys in minutes, on-demand, elastic• You select the instance and

• the number of processing nodes

• Pay-as-you-go, pay-what-you-process models

• Optimized for general purpose• Fined tuned to the cloud provider architecture

6

Surveyed Hadoop/Hive PaaS services

• Amazon Elastic Map Reduce (EMR)• Released: Apr 2009• OS: Amazon Linux AMI 4.4 (RHEL-like)• SW stack: EMR (custom, 4.7*)• Instances:

• m3.xlarge and m4.xlarge

• Google Cloud DataProc (CDP)• Released: Feb 2016• OS: Debian GNU/Linux 8.4• SW stack: (custom, v1)• Instances:

• n1-standard-4 and n1-standard-8

• Azure HDInsight (HDI) • Released: Oct 2013• OS: Windows Server and Ubuntu 14.04.5 LTS• SW stack: HDP based (v 2.3 and 2.4)• Instances:

• A3s, D3s v1-2, and D4s v1-2

• Rackspace Cloud Big Data (CBD) • Released: ~ Oct 2013• OS: CentOS 7• SW stack: HDP (2.3)• API: OpenStack (+ Lava)• Instances:

• Hadoop 1-7, 1-15, 1-30, On Metal 40

We selected defaults, general purpose VMs, Also on-premise results as baseline.* EMR v5 released in August 2016

7

Systems-Under-Test (SUTs):VM/Instance specs, elasticity, perf characterization

Focus: 8-datanodes, up to 1TB data size

8

SUTs: Tech specs and costs

* Estimate based on 3 years life time including support and maintenance (see refs.) 10

Notes:• Default Cloud SKUs have 4-

cores and ~15GB of in all providers• 4GBs of RAM / core

• Prices vary greatly

• Rackspace defaults • to high-end OnMetal

Provider Instance type Default? Cores/Node RAM/Node RAM/core

Amazon EMR(us-east-1)

m3.xlarge Yes 4 15 3.8

m4.xlarge 4 16 4

Google CDP(Europe-west1-b)

n1-standard-4 Yes 4 15 3.8

n1-standard-4 1 SSD 4 15 3.8

n1-standard-8 8 30 7.5

Azure HDI(South Central US)

A3 (Large) (old def.) 4 7 1.8

D3 v1 and v2 Yes 4 14 3.5

D4 v1 and v2 4 14 3.5

Rackspace CBD(Northern Virginia (IAD))

hadoop1-7 2 7 3.5

hadoop1-15 (2nd) 4 15 3.8

hadoop1-30 8 30 3.8

OnMetal 40 Yes 40 128 3.2

On-premise2012 (12cores/64GB) 12 64 5.3

D Nodes Cost/Hour Cluster Shared

8 USD 3.36 Yes

8 USD 2.99 Yes

8 USD 1.81 Yes

8 USD 1.92 Yes

8 USD 3.61 Yes

8 USD 2.70 Yes

8 USD 5.25 Yes

8 USD 10.48 Yes

8 USD 2.72 Yes

8 USD 5.44 Yes

8 USD 10.88 Yes

4 USD 11.80 No

8 USD 3.50 * No

Includes I/O costs Cost/5TB/hr* Deploy time

Yes / No with EBSUSD 0.07

~ 10 mins

No ~ 10 mins

No

USD 0.18

~ 01 min

No ~ 01 min

No ~ 01 min

No

USD 0.17

~ 25 mins

No ~ 25 mins

No ~ 25 mins

YesLocal

USD 0.00Cloud

USD 0.07

~ 25 mins

Yes ~ 25 mins

Yes ~ 25 mins

Yes ~ 25 mins

Yes USD 0.00 N/A

SUTs: Elasticity and I/O

*Tests need 5TB of raw HDFS storage, this cost is used. **supports up to 4 SSD drives 12

Provider Instance type Elasticity Storage

Amazon EMRm3.xlarge Compute (and EBS option) 2x40GB Local SSD / node

m4.xlarge Compute and EBS (fixed size) EBS size defined on deploy

Google CDP

n1-standard-4

Compute and GCS (fixed size )

GCS size defined on deploy

n1-standard-4 1 SSD 1x375GB SSD ** + GCS

n1-standard-8 GCS size defined on deploy

Azure HDI

A3 (Large)

Compute and storage

Elastic (WASB)

D3 v1 and 2 Elastic (WASB) + 200GB SSD local

D4 v1 and 2 Elastic (WASB) + 400GB SSD local

Rackspace CBD

hadoop1-7

Compute (Cloud files option)

1.5TB SATA / node

hadoop1-15 2.5TB SATA / node

hadoop1-30 5TB SATA / node

OnMetal 40 2x1.5TB SSD / node

On-premise 2012 (12cores/64GB) No 1TB SATA x6 / node

SUTs: Perf characterization summary

• Ran CPU, MEM B/W, NET, I/O to 1 data disk, and DFSIO benchmarks• CPU (not all cores are born the same) and MEM B/W:

• Best performing OnMetal, then• CPD n1-std-8 similar to HDI D4v2s (and OnPremise)• CDP n1-std-4 similar to HDI D3v2s and EMR m4.xlarge• Then, EMR m3.xlarge, HDI A3s, CBD cloud-based respectively (but similar)

• NET Gbp/s:• EMR < 40, CDP < 8 (some variance), CBD < 5, On-Prem 1 Gbp/s• HDI VM dependent < 6Gbp/s (A3 1, D3 2, D4-D3v2 ~3, D4 6)

• I/O MB/s (write to 1 data disk):• most between 100-150, n1-std-4 w/ SSD 400 (symmetrical), D4v2 and OnMetal > 1000? MB/s

• DFSIO R/W (whole cluster) MB/s:• Most below 50 read 35 write; n1-std-4 w/ SSD 400/200, D4v2 60/50, OnMetal 615/315 MB/s

13

SQL-on-Hadoop benchmarkingMethodology and evaluations

14

Benchmark suite: TPC-H (derived)

• DB industry standard for decision support• well understood benchmark and accepted (since `99)• available audited results on-line

• 22 “real world” business queries• Complex joins, grouping, nested queries

• Defines scale factors for data

• DDLs and queries from D2F-Bench project:• Includes Hive adaptation with ORC tables • Repo: https://github.com/Aloja/D2F-Bench

• based on https://github.com/hortonworks/hive-testbench• changes makes it HDP agnostic

• Supports other engines: Spark, pig, impala, drill, …

15

TPC-H 8-tables schema

https://github.com/Aloja/D2F-Bench

https://github.com/hortonworks/hive-testbench

Test methodology

• ALOJA-BENCH as a driver• Test methodology

• Queries run from 1-22 • sequentially

• To try to avoid caches• [at least] 3 repetitions• Query ALL (Q ALL) as full run

• Power runs (no concurrency)

• Data sizes:• 1GB, 10GB, 100GB, 500GB*, 1TB

• Metric: execution time• Comparisons

• Q ALL (full run)• Scans Q1, and Q6, • Joins Q2, Q16• Q16 most “complete” single query

• Process and settings• TCP-H datagen CSVs converted to

Hive ORC tables• Each system its own hive.settings

(On prem from repo)

16*500GB is not a standard size, but 300GB is.

SUTs Performance and ScalabilityExecution times

Scalability to data size

Query drill-down

Latency test

17

Exec times by SUT: 8dn 100GB Q ALL

Notes:• Results show execution times for full TPC-H, on SKUs with 8 data nodes at 100GB. Except for the CBD-on metal which has 4dns.

• CBD: • OnMetal fast• Cloud, scale to SKU size

• CDP:• SSD slightly faster than regular• N1std8 only 30% faster than

N1std4• EMR:

• m4.xlarge 18% faster than m3.xlarge

• HDI:• Scale to SKU size• Fastest result D4v2

• M100 (OnPrem):• Poor results

• A3s and CBD Cloud present high variability

18

CBD CDP EMR HDI

SSD version marginal resultLocal SSD + EBS

OnMetal

D4v2 Fastest

D3v2 fastest default

EBS Only

OnPrem

Exec times by SKU: 8dn 1TB Q ALL

19

CBD CDP EMR HDINotes:• Results show execution times for full TPC-H, on SKUs with 8 data nodes at 1TB. Except for the CBD-on metal which has 4dns.• At 1TB, lower end systems obtain poorer performance.• CBD:

• OnMetal fastest default• Cloud, 1-7 cannot process 1TB, 1-

15,1-30 similar results• CDP:

• SSD slightly slower than regular• N1std8 2x faster than N1std4 (as

expected)• EMR:

• m4.xlarge 15% faster than m3.xlarge

• HDI:• Scale to SKU size• Fastest result D4v2

• M100 (OnPrem):• Improves results (comparing)

Systems similar, but poor results

OnMetal2ns fastest

D4v2 Fastest

Data size scalability of defaults: up to 1TB (Q ALL)

20

Notes:• Chart shows the data scale factor from 10GB to 1TB

of the different SUTs of 8 data nodes. Except for CBD On Metal, which has 4.

• Comparing defaults instances, CDP has poorest scalability, then EMR.

• On-prem scales linearly up to 1TB• HDI and OnMetal can scale to larger sizes

Data size scalability up to 1TB (Q ALL)

21

Notes:• Chart shows the data scale factor from 10GB to 1TB

of the different SUTs of 8 data nodes. Except for CBD On Metal, which has 4.

• CBD-hadoop-1-7 cannot process more than 100GB• Then, HDI A3s scales the poorest (old-gen system)• EMR and CDP in the middle• HDI D4s has the best scalability and times. Followed

by the CBD OnMetal system

Exec times defaults: Scans vs. Joins 1TBScans (parallelizable Q1 CPU, Q6 I/O) Joins (less parallelizable Q2, Q16)

22

Notes: Q1 (I/O + CPU) is slow on the CDP and EMR systems. Same for Q16. On metal fastest for I/O and Joins, then HDI D3v2.

Defaults with 4-cores Defaults with 4-cores

0

20

40

60

80

100

044 88

132

176

220

264

308

352

396

440

Average of%iowait

Average of%system

Average of%user

0

200000

400000

600000

800000

1000000

2

16

2

32

2

48

2

64

2

80

2

96

2

1122

1282

1442

1602

0

200000

400000

600000

800000

1000000

21

50

29

84

46

59

47

42

89

010

3811

8613

3414

82 0

1000000

2000000

3000000

4000000

5000000

2

51

100

149

198

247

296

345

394

443

Sum ofrkB/s

Sum ofwkB/s

0

500000

1000000

1500000

-3

174

351

528

705

882

1059

1236

1413

1590

0

500000

1000000

1500000

013

627

240

854

468

081

695

210

8812

2413

6014

96

0

20

40

60

80

100

-3

158

319

480

641

802

963

1124

1285

1446

1607

0

20

40

60

80

100

014

829

644

459

2

740

888

1036

1184

1332

1480

0

20

40

60

80

1000

53

106

159

212

265

318

371

424

477

530

0

200000

400000

600000

800000

1000000

1

59

117

175

233

291

349

407

465

523

0

500000

1000000

1500000

0

64

128

192

256

320

384

448

512

CP

U %

Dis

k R

/W k

B/s

NET

R/w

kB

/sPerf details Q16 1TB default VMs

HDI-D3v2-8 CDP-N1std4-8 EMR-M3.xlarge-8 CBD-OnMetal-4

0

500000

1000000

1500000

0

Sum ofrxkB/s

Sum oftxkB/s

N/A NET data CBD-OnMetal-4

Highest I/O wait

Highest Disk throughput

Highest NET util

Different patternLowest CPU util

Lowest Disk util

Configurations

25Notes: CDP and CBD on Java 1.8, all on OpenJDK. HDI only to enable Tez and config perf options

Category Config EMR CDP HDI CBD (On Metal) On-prem

System Java version OpenJDK 1.7.0_111 OpenJDK 1.8.0_91 OpenJDK 1.7.0_101 OpenJDK 1.8.0_71 JDK 1.7

HDFS File system EBS / S3 GCS (hadoopv.) WASB Local + Swift + S3 Local

Replication 3 2 3 2 3

Block size 128MB 128MB 128MB 256MB 128MB

File buffer size 4KB 64KB 128KB 256KB 64KB

M/R Outputcompression SNAPPY False False SNAPPY False

IO Factor / MB 48 /200 10 /100 100 / 614 100 / 358 10 / 100

Memory MB 1536 3072 1536 2048 1536

Hive Engine MR MR Tez MR MR

ORC config Defaults Defaults Defaults Defaults Defaults

Vectorized exec False False Enabled False Enabled

Cost Based Op False Enabled Enabled Enabled Enabled

Enforce Bucketing False False True False True

Optimizebucket map join False False True False True

Latency test: Exec time by SKU 8dn 1GB Q 16

Notes:• Results show execution times for query 16 and 1GB. Except for the CBD-on metal which has 4.

• HDI D3v2 and D4v2 have the lowest times• Then the CDP systems

26

CBD CDP EMR HDI

D3v2 and D4v2“lowest latency”

Price / PerformancePrice and Execution times assume:

• only cost of running benchmark or full 24/7 utilization

• no provisioning time or idle times

• by the second billing

27

Price/Performance 100GB (Q ALL)

28

Notes:• Shows the price/performance ratio by SUT• Lower in price and time is better• Chart zoomed to differentiate clusters

Price assumptions:• Measures only the cost of running the

benchmark in seconds. Cluster setup time is ignored.

Rank Cluster Best cost Best time1 CDP-n1std4-8 USD 6.37 3:11:57

2 CDP-n1std4-1SSD-8 USD 6.55 3:06:44

3 EMR-m4.xlarge-8 USD 8.18 2:40:24

4 HDI-D3v2-HDP24-8 USD 8.74 1:36:45

5 CDP-n1std8-8 USD 9.35 2:27:57

6 HDI-D4v2-HDP24-8 USD 10.20 0:57:29

7 EMR-m3.xlarge-8 USD 10.79 3:08:49

8 HDI-A3-8 USD 11.96 4:10:04

9 M100-8n USD 13.10 3:32:29

10 HDI-D4-8 USD 15.08 1:24:59

11 CBD-hadoop1-7-8 USD 19.16 7:02:33

12 CBD-OnMetal40-4 USD 19.31 1:38:12

13 CBD-hadoop1-15-8 USD 26.45 4:51:41

Cheapest run

Fastest run

Most Cost-effective

Price/Performance 1TB (Q ALL)

29

Notes:• Shows the price/performance ratio by SUT• Lower in price and time is better• Chart zoomed to differentiate clusters

Price assumptions:• Measures only the cost of running the

benchmark in seconds. Cluster setup time is ignored.

Rank Cluster Best cost Best time1 HDI-D3v2-HDP24-8 USD 39.63 7:18:42

2 HDI-D4v2-HDP24-8 USD 42.02 3:56:45

3 M100-8n USD 42.85 11:34:50

4 CDP-n1std8-8 USD 44.91 11:50:46

5 CDP-n1std4-8 USD 46.49 23:21:05

6 CDP-n1std4-1SSD-8 USD 50.53 24:00:52

7 EMR-m4.xlarge-8 USD 54.26 17:44:01

8 HDI-D4-8 USD 62.75 5:53:32

9 CBD-OnMetal40-4 USD 67.77 5:44:36

10 EMR-m3.xlarge-8 USD 69.92 20:23:01

11 HDI-A3-8 USD 74.83 27:42:56

12 CBD-hadoop1-15-8 USD 128.44 23:36:37

Cheapest run

Fastest run

Most cost effective

SW and HW improvementsPaaS provider improvements over time (tests on 4 data nodes)

30

SW: HDP version 2.3 to 2.4 improvement on HDI D3v1 4 nodes Q ALL 100GB

Notes:• Test to compare migration to HDP 2.4. D3s improved, they can now run 1TB without modifications

on 4 data nodes (D3s). No more namenode swapping. On larger nodes less improvements.31

D3s 35% Improvement

Run time at 100GB Scalability from 1GB to 1TB

D3s can scale to 1TB now

SW: EMR version 4.7 to 5.0 improvement on m4.xlarge 4 nodes Q ALL

Notes:• Test to compare perf improvements on EMR 5.0 (Hive 2.1, Tez by default, Spark 2.0)• EMR 5.0 gets a 2x increase at 4 nodes.

32

ERM 5.0 2x improvement

Run time at 1TB Scalability from 1GB to 1TB

HDI default HW improvement: 4 nodes Q ALL

Notes:• Test to compare perf improvements on HDI default VM instances from A3, to

D3 and D3v2 (30% faster CPU, same price) on HDP 2.333

HDI default VM improvement

Run time at 1TB Scalability from 1GB to 1TB

Variability

Summary Lessons learned, findings, conclusions, references

34

Remarks / Findings

• Setting up and fine tuning Big Data stacks is complex and requires and iterative process• Cloud services optimize continuously their PaaS for general-purpose

• All tune M/R and Yarn, and their custom file storages• Update HW (and prices) overtime

• You might need to re-deploy to get benefits

• Room for improvement• Only HDI fine-tunes Hive, what about other new services? (Spark, Storm, R, HBASE)• All updating to Hive and Spark v2 (and enabling Tez, tuning ORC)• CDP upgrading HDP version

• Beware, commodity VMs != commodity Bare-Metal for Big Data• Errors … Originally this was to be a 4-node comparison … • Variability, An issue for low-end, old-gen VMs

• Also scalability, and reliability, beware.• Less of an issue on newer VMs

• Network throttling, not apparent at 8-danode cluster, but for larger clusters…

35

Summary:

Similarities

• Similar defaults for cloud based:• 4-cores, ~16GB RAM, local SSDs

• ~4GB RAM / Core• Good enough for Hadoop / Hive

• Elasticity• All allow on-demand scaling-up• Mixed mode of local + remote

• Fast networking• Specially EMR• HDI, depending on VM size• Required for networked storage…

• Most deploy in < 25 mins

Differences

• CBD offers OnMetal as default• High-end, non-shared system.

• What about in-mem systems• Spark, Graph/Graph?

• Elasticity• But no all down-scaling / stop (delete)• HDI completely (local for temp)

• Pricing, very different!• EMR, CBD, HDI / hour• CDP / minute• But similar overall price/perf

• CDP deploys in a ~minute

36


• Providers have integrated successfully on-demand Big Data services• Most are in the path to offer pay-what-you process models• Disaggregating completely storage-to-compute• Giving more elasticity to your data and needs

• Multiple clusters, pay only what you use, planning free, governance

• What about performance and reliability?• Providers are upgrading and defaulting to newer-gen VMs

• Faster CPUs, SSDs (local and remote), end-of-rotational?, fast networks• As well as keeping the SW up-to date

• Newer versions, security and performance patches, tuned for their infrastructure

• Is it price-performant?• Yes, at least for the medium-seized. The cost is in compute, so you pay for what you use!

• For ALOJA, this work is the base work for future research.

37

Benchmarking with ALOJA

Local dev ENV

1. Install prerequisites

• git, vagrant, VirtualBox

2. git clone https://github.com/Aloja/aloja.git

3. cd aloja

4. vagrant up

5. Open your browser at: http://localhost:8080

6. Optional start the benchmarking

cluster

vagrant up /.*/

Repeat / Reproduce results

1. (Read the docs… or write us)

2. Setup your cloud credentials

• Or test on-prem

3. Deploy cluster

• aloja/aloja-deploy.sh HDI-D3v2-8

4. aloja/aloja-bench/run_benchs.sh –b D2F-

Hive-Bench

5. (also cluster-bench and sysbench)

38

https://github.com/Aloja/aloja.git

http://localhost:8080/

More info:• Upcoming publication: The state of SQL-on-Hadoop

• Data release and more in-depth tech analysis

• ALOJA Benchmarking platform and online repository• http://aloja.bsc.es http://aloja.bsc.es/publications

• BDOOP meetup group in Barcelona

• Workshop Big Data Benchmarking (WBDB)• Next in Barcelona

• SPEC Research Big Data working group• http://research.spec.org/working-groups/big-data-working-group.html

• Slides and video:• Benchmarking Big Data on different architectures:

• FOSDEM ‘16: https://archive.fosdem.org/2016/schedule/event/hpc_bigdata_automating_big_data_benchmarking/• http://www.slideshare.net/ni_po/benchmarking-hadoop

• Michael Frank on Big Data benchmarking • http://www.tele-task.de/archive/podcast/20430/

• Tilmann Rabl Big Data Benchmarking Tutorial• http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl

http://aloja.bsc.es/

http://aloja.bsc.es/publications

http://research.spec.org/working-groups/big-data-working-group.html

https://archive.fosdem.org/2016/schedule/event/hpc_bigdata_automating_big_data_benchmarking/

http://www.slideshare.net/ni_po/benchmarking-hadoop

http://www.tele-task.de/archive/podcast/20430/

http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl

Thanks, questions?

Follow up / feedback : [email protected]

Twitter: @ni_po


mailto:[email protected]

The state of SQL-on-Hadoop in the Cloud

Technology

Transcript of The state of SQL-on-Hadoop in the Cloud