Boost Performance with Scala – Learn From Those Who’ve Done It!

62
© Hortonworks Inc. 2011 – 2014. All Rights Reserved Boost Performance with Scala Learn From Those Who’ve Done It! We do Hadoop.

Transcript of Boost Performance with Scala – Learn From Those Who’ve Done It!

Page 1: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Boost Performance with ScalaLearn From Those Who’ve Done It!

We do Hadoop.

Page 2: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Your speakers…

Dhruv Kumar Partner Solutions EngineerHortonworks

Cyrille ChépélovR&D DirectorTransparency Rights Management

Page 3: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Customer Momentum• 437+ customers (as of March 31, 2015)

Hortonworks Data Platform• Completely open multi-tenant platform for any app & any data.

• A centralized architecture of consistent enterprise services for resource management, security, operations, and governance.

Partner for Customer Success• Open source community leadership focus on enterprise needs

• Unrivaled world class support

• Founded in 2011

• Original 24 architects, developers, operators of Hadoop from Yahoo!

• 600+ Employees

• 1,000+ Ecosystem Partners

Page 4: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Traditional systems under pressure

Challenges• Constrains data to app

• Can’t manage new data

• Costly to Scale

Business Value

Clickstream

Geolocation

Web Data

Internet of Things

Docs, emails

Server logs

20122.8 Zettabytes

202040 Zettabytes

LAGGARDS

INDUSTRY LEADERS

1

2 New Data

ERP CRM SCM

New

Traditional

Page 5: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop emerged as foundation of new data architecture

Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data

• Built by Yahoo! to be the heartbeat of its ad & search business

• Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises

• Incredibly disruptive to current platform economics

Traditional Hadoop Advantages Manages new data paradigm Handles data at scale Cost effective Open source

Traditional Hadoop Had Limitations

Batch-only architecture

Single purpose clusters, specific data sets

Difficult to integrate with existing investments

Not enterprise-grade

Application

StorageHDFS

Batch ProcessingMapReduce

Page 6: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Modern Data Architecture emerges to unify data & processing

Modern Data Architecture

• Enable applications to have access to all your enterprise data through an efficient centralized platform

• Supported with a centralized approach governance, security and operations

• Versatile to handle any applications and datasets no matter the size or type

Clickstream Web & Social

Geolocation Sensor & Machine

Server Logs

Unstructured

SO

UR

CE

S

Existing Systems

ERP CRM SCM

AN

ALY

TIC

S

Data Marts

Business Analytics

Visualization& Dashboards

AN

ALY

TIC

S

ApplicationsBusiness Analytics

Visualization& Dashboards

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS (Hadoop Distributed File System)

YARN: Data Operating System

Interactive Real-TimeBatch Partner ISVBatch BatchMPP

EDW

Page 7: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hortonworks & Concurrent

Hortonworks and Concurrent Advance Enterprise Data Application Development on Hadoop

HDP Integrates and delivers Cascading SDK• Collection of tools, documentation, libraries,

tutorials and example projects

• Simplifies SQL integration and enables Scala development for Hadoop

Hortonworks provides level 1 & 2 support for Cascading SDK

Cascading is the proven application development platform for building data applications on Hadoop

Page 8: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hortonworks & Concurrent: Partnership Benefits

• SDK empowers developers to quickly build rich data-centric enterprise applications on Hadoop

• Leverage existing Java or Scala based skill sets to develop complex applications

• Combines the robustness and simplicity of Cascading with the reliability and stability of HDP

• Apps built on Cascading such as Scalding can easily take advantage of YARN and Tez

Page 9: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Cascading SDK: Overview

• The most widely used application development framework for building Big Data applications

• Enables improved Developer Productivity for enterprises using HDP

Page 10: Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP Integration of Cascading SDK

• SDKs that enable the the rapid development of batch and interactive data-driven applications

• Integration with data processing layer allows Cascading to take advantage of advances in interactive applications

Efficient Cluster Resource Management & Shared Services

(YARN)

Interactive Data ProcessingTEZ

Batch Data ProcessingMapReduce

JavaCascading

ScalaScalding

SQLLingual

MLPattern

JavaCascading

ScalaScalding

SQLLingual

MLPattern

Enable both existing and new application toprovide value to the organization

PRESENTATION & APPLICATION

Page 11: Boost Performance with Scala – Learn From Those Who’ve Done It!

Your Trusted Third Party in the Digital Age™

Scalding on Tez

Page 12: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

12

HOW DID WE CHOOSE SCALDING ?

Page 13: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

13

• A Trusted Third Party – Data escrow, controlled execution– Independent re-computation– Privacy & Personal Data compliance

assessment

• Big Data Services for Entertainment– Metadata enrichment– IP use certification– Dataset analysis as a service

Why Scalding?Transparency Rights Management:

Page 14: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

14

Why Scalding?« Big Data Services for Entertainment » - a Use Case

Digital Service Provider Report

Copyright Owners / Collective Management

Organizations

Page 15: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

15

Why Scalding?« Big Data Services for Entertainment » - a Use Case

Digital Service Provider Report

Copyright Owners / Collective Management

Organizations

Data Improvement Automatic Data Feed (« in your format »)

Independent ReportConformance Report

Page 16: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

16

• September 2013: SQL Server overheats• October 2013: using Lingual

12 SQL steps + bash scripts

• September 2014: Cascading + Java• September 28th: tried out Scalding• November 2014: delivered first results on Scalding• April 2015: First success on Scalding+Tez

Why Scalding?Dataset analysis (from YouTube monthly reports)

Page 17: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

17

Anatomy of a scalding app

Your App (in scala)

scalding

cascading

Hadoop + Tez platform libraries

You

@TwitterOSS

Concurrent, Inc.

Apache

Page 18: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

18

SCALDING ON TEZ, THE MINI-HOWTO

Page 19: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

19

• Step 0: Prerequisites:– A YARN cluster– Cascading 3.0– TEZ runtime lib in HDFS– A version of scalding with fabric selection

Scalding on Tez, the mini-howto

0.6.2-SNAPSHOT

0.13.1 + PR1220

Page 20: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

20https://github.com/cchepelov/wcplus/blob/master/build.sbt 

Scalding on Tez, the mini-HOWTO• Step 1: build.sbt

Page 21: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

21

Scalding on Tez, the mini-HOWTO• Step 1: build.sbt (redux)

1.Regain control on what libraries are included

2.Exclude some « long transitive » dependencies that pull in junk

3.Put in the desired fabric, in a configurable way sbt --DCASCADING_FABRIC=hadoop clean assembly

Page 22: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

22

Scalding on Tez, the mini-HOWTO• Step 1bis: assembly.sbt

We’re using fatjars to simplify deployment.

Because of jar hell, we « need » a complicated assembly.sbt

https://github.com/cchepelov/wcplus/blob/master/assembly.sbt 

Page 23: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

23https://github.com/cchepelov/wcplus/blob/master/src/main/scala/com/transparencyrights/demo/wcplus/CommonJob.scala 

Scalding on Tez, the mini-HOWTO• Step 2: a few job flags

Page 24: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

24

• tez.task.resource.memory.mb– As large as you can afford to give, per CPU per node– The more memory, the less Tez needs to spill intermediates to

disk

• tez.container.max.java.heap.fraction– Defaults (1024MiB * 0.8) assume the JVM’s Native memory

requirements don’t exceed 208 MiB– Scalding + the Scala runtime + Cascading on top of Tez

seems to require more. YARN kills offenders switftly!

– The 460MiB figure we’re using (1024+512)*(1-0.7) may be a bit wasteful

• Step 2: a few job flags (continued)

Page 25: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

25

THAT’S IT.

(ALMOST)

Page 26: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

26

IN PRACTICE…

Page 27: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

27

« A VERSION OF SCALDING WITH FABRIC SELECTION »

WAIT, WHAT?

Page 28: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

28

Scalding traditional --local and --hdfs flags:– Uses either LocalFlowConnector or

HadoopFlowConnector– Types are hard-coded

Cascading 2.5 introduced a new fabric concept. You can run either with cascading-hadoop or with cascading-hadoop2-mr1. But:

– Incompatible jars (can’t load both)– Main types visible to Scalding are different

In practice« A version of scalding with fabric selection » Wait, What?

Page 29: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

29

PR1220: No longer hardcodes « either Local or Hadoop 1.X » Enables supplying any flow connector implementation, as

long as the jar’s around. --hdfs to be deprecated as an alias to --hadoop1 Still built against Cascading 2.6

In practice« A version of scalding with fabric selection » Wait, What?

Page 30: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

30

« STILL BUILT ON CASCADING 2.6 »

WHY?

Page 31: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

31

Cascading 3.0 has carefully updated some argument types to prepare for the futureThis is source- and binary-compatible:

In practice« Still built on Cascading 2.6 »

Scala enforces generic type safety, and the Cascading 3.0 upgrades are not legal with scalac. But they still are with the JVM…

libra

ryco

nsum

er

Libr

ary

V2Sa

me

cons

umer

In Java

Page 32: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

32

Scalding will require some adjustment to become compatible with the java-level source upgrades.

Can this happen without breaking scalding application source code ?

In practice… Going to native Cascading 3.0 ?

Page 33: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

33

GUAVA

Page 34: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

34

GUAVAGUAVA

Page 35: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

35

• Guava is a nice library…… of little use in Scala (?)

• In a Scalding/Cascading/Tez JVM, multiple versions of guava are required. Each layer depends on its own version.About every single version from 11.0 to 16.0.2

• There have been breaking changes (method renames & removals) in guava 13

• These happen on really mundane objects (Closeable, Stopwatch), but they’re major troublemakers

In practice…Guava

Page 36: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

36

• Discussions and actions in progress to remove the pain

• In the mean-time, using a patched version « frankenguava » to provide both older and newer interfaces, to keep all consumers happy across the stack.

In practice…Guava

Page 37: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

37

CASCADING’S TEZ*REGISTRY

Page 38: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

38

• Cascading 3.0 uses a set of mapping registries to convert cascading patterns into the back-end API.

The Tez registries are new, and distinct from the MR registries

• The Tez registries are hardened against Concurrent’s extensive test library, which is built on years of MR experience. Tez has its own trouble spots.

Beware of hash joins.

• It works fine now, but getting the scalding test library onboard will help a long way.

In practice…Cascading’s Tez*Registry

Page 39: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

39

• It works mostly fine now, but getting the scalding test library onboard will help a long way.

In practice…Cascading’s Tez*Registry

Last-minute update:

.filterWithValue / .mapWithValue currently crash the Cascading planner (as of 3.0.1)

(implementation uses a HashJoin)

Page 40: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

40

AN EXAMPLE

Page 41: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

41

A small test:

Page 42: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

42

A small test: « wc plus »

70 books1.1M lines10M words56M bytes

Word, relative frequency, 

deviation from median relative freq

Two Words, relative frequency, 

deviation from median relative freq

Ten Words, relative frequency, 

deviation from median relative freq

ComputeFrequencies

Ignoring things that are more frequent than 80% of the max

word frequency 

All Expressions (1-W to 10-W), relative frequency, 

deviation from median relative freq

Page 43: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

43

A small test: « wc plus »

70 books1.1M lines10M words56M bytes

Word, relative frequency, 

deviation from median relative freq

Two Words, relative frequency, 

deviation from median relative freq

Ten Words, relative frequency, 

deviation from median relative freq

ComputeFrequencies

Ignoring things that are more frequent than 80% of the max

word frequency 

All Expressions (1-W to 10-W), relative frequency, 

deviation from median relative freq

No .filterWithValue / .mapWithValue for now

Roulex45 / Wikipedia

count

count

count

count

Page 44: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

44

A small test: « wc plus »

https://github.com/cchepelov/wcplus 

Page 45: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

45

TIPS & TRICKS

Page 46: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

46

Run your job with -Dcascading.planner.plan.path=/tmp/path/to/plan.lst

The planner will output a lot of useful files. One of them is…/$(Job)/4-final-flow-steps/0000-step-node-sub-graph.dot

Run that file through graphvizdot –O –Tpdf 0000-step-node-sub-graph.dot

or, if the PDF is illegible, Firefox’s great at zooming into SVG files: dot –O –Tsvg 0000-step-node-sub-graph.dot

Tips & Tricks0000-step-node-sub-graph.dot

Page 47: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

47

Tips & Tricks0000-step-node-sub-graph.dot

This is how TEZ names our stuff !

Page 48: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

48

MR– One flow, many (MANY)

independent steps– One or more operators per

step– Step-to-step communications

involve disk (HDFS)– Each step is independent as

far as MR is concerned– Step scheduling managed

from outside the cluster, by Cascading

TEZ– One flow, one DAG. A DAG

includes several nodes.– One or more operators per node– Node-to-Node communications

managed by TEZ. Memory, direct network or disk as necessary

– YARN sees one « Application » per flow

– Node scheduling managed by TEZ DAG AppMaster

Tips & TricksMajor differences between how a cascading job gets mapped to MR and to TEZ:

Page 49: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

49

Tips & Tricksyarn-swimlanes.sh

• A tool included in the tez source distribution, in tez-tools/swimlanes (bash + python)

• Requires YARN ATS to work« yarn logs –applicationId application_1345431315_1511 » must work

• Reports, in a GANTT chart, the per-container occupation

Page 50: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

50

Tips & Tricksyarn-swimlanes.sh (2)

application_1435150225179_0474.svg

Page 51: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

51

Tips & Tricksyarn-swimlanes.sh (3)

time

containers

Page 52: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

52

Tips & TricksConsider using .forceToDisk to ensure work is balanced within the DAG

890 seconds

160 seconds

Page 53: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

53

Tips & TricksConsider using .forceToDisk to ensure work is balanced within the DAG

890 seconds 160 seconds

Page 54: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

54

• .forceToDisk really means « don’t merge those two TEZ nodes » which implies « manage appropriate data transmission between these two nodes »

• TextFile & other FixedPathSource friends don’t seem to automatically spread out work as well as they used to (huh?)

• YMMV, WIP.

Tips & Tricks• Consider using .forceToDisk to ensure work is balanced

within the DAG

Page 55: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

55

PERFORMANCE

Page 56: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

56

PerformanceMR vs TEZ

Page 57: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

57

PerformanceMR vs TEZ; to scale

Page 58: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

58

PerformanceMR vs TEZ; TO SCALE!!!

MR run time:14:22 (wall)12:49 (cluster time)5:43:26 (total CPU)

TEZ run time:4:03(wall)2:50(cluster time)1:25:35 (total CPU)

Page 59: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

59

CONCLUSION

Page 60: Boost Performance with Scala – Learn From Those Who’ve Done It!

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

60

Apache Tez enables very significant performance gains compared to traditional MAPREDUCE applications, on the same cluster and alongside the legacy.

The new Tez back-end built by Concurrent, enables these exciting performance gains for existing Cascading and Scalding applications.

Taking advantage of these performance gains should become as easy as upgrading and setting up a few configuration switches in the next few months.

Conclusion

Page 61: Boost Performance with Scala – Learn From Those Who’ve Done It!

Next Steps…

Download the Hortonworks Sandbox

Learn Hadoop

Build Your Analytic App

Try Hadoop 2

More about Concurrent & Hortonworkshttp://hortonworks.com/partner/concurrent

More about Transparency Rights Managementhttp://www.transparencyrights.com/

Contact us: [email protected]

Page 62: Boost Performance with Scala – Learn From Those Who’ve Done It!

Page 62 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Q&A