Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Boost Performance with ScalaLearn From Those Who’ve Done It!

We do Hadoop.


Your speakers…

Dhruv Kumar Partner Solutions EngineerHortonworks

Cyrille ChépélovR&D DirectorTransparency Rights Management


Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Customer Momentum• 437+ customers (as of March 31, 2015)

Hortonworks Data Platform• Completely open multi-tenant platform for any app & any data.

• A centralized architecture of consistent enterprise services for resource management, security, operations, and governance.

Partner for Customer Success• Open source community leadership focus on enterprise needs

• Unrivaled world class support

• Founded in 2011

• Original 24 architects, developers, operators of Hadoop from Yahoo!

• 600+ Employees

• 1,000+ Ecosystem Partners


Traditional systems under pressure

Challenges• Constrains data to app

• Can’t manage new data

• Costly to Scale

Business Value

Clickstream

Geolocation

Web Data

Internet of Things

Docs, emails

Server logs

20122.8 Zettabytes

202040 Zettabytes

LAGGARDS

INDUSTRY LEADERS

1

2 New Data

ERP CRM SCM

New

Traditional


Hadoop emerged as foundation of new data architecture

Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data

• Built by Yahoo! to be the heartbeat of its ad & search business

• Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises

• Incredibly disruptive to current platform economics

Traditional Hadoop Advantages Manages new data paradigm Handles data at scale Cost effective Open source

Traditional Hadoop Had Limitations

Batch-only architecture

Single purpose clusters, specific data sets

Difficult to integrate with existing investments

Not enterprise-grade

Application

StorageHDFS

Batch ProcessingMapReduce


Modern Data Architecture emerges to unify data & processing

Modern Data Architecture

• Enable applications to have access to all your enterprise data through an efficient centralized platform

• Supported with a centralized approach governance, security and operations

• Versatile to handle any applications and datasets no matter the size or type

Clickstream Web & Social

Geolocation Sensor & Machine

Server Logs

Unstructured

SO

UR

CE

S

Existing Systems

ERP CRM SCM

AN

ALY

TIC

S

Data Marts

Business Analytics

Visualization& Dashboards

AN

ALY

TIC

S

ApplicationsBusiness Analytics

Visualization& Dashboards

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS (Hadoop Distributed File System)

YARN: Data Operating System

Interactive Real-TimeBatch Partner ISVBatch BatchMPP

EDW


Hortonworks & Concurrent

Hortonworks and Concurrent Advance Enterprise Data Application Development on Hadoop

HDP Integrates and delivers Cascading SDK• Collection of tools, documentation, libraries,

tutorials and example projects

• Simplifies SQL integration and enables Scala development for Hadoop

Hortonworks provides level 1 & 2 support for Cascading SDK

Cascading is the proven application development platform for building data applications on Hadoop


Hortonworks & Concurrent: Partnership Benefits

• SDK empowers developers to quickly build rich data-centric enterprise applications on Hadoop

• Leverage existing Java or Scala based skill sets to develop complex applications

• Combines the robustness and simplicity of Cascading with the reliability and stability of HDP

• Apps built on Cascading such as Scalding can easily take advantage of YARN and Tez


Cascading SDK: Overview

• The most widely used application development framework for building Big Data applications

• Enables improved Developer Productivity for enterprises using HDP


HDP Integration of Cascading SDK

• SDKs that enable the the rapid development of batch and interactive data-driven applications

• Integration with data processing layer allows Cascading to take advantage of advances in interactive applications

Efficient Cluster Resource Management & Shared Services

(YARN)

Interactive Data ProcessingTEZ

Batch Data ProcessingMapReduce

JavaCascading

ScalaScalding

SQLLingual

MLPattern

JavaCascading

ScalaScalding

SQLLingual

MLPattern

Enable both existing and new application toprovide value to the organization

PRESENTATION & APPLICATION

Your Trusted Third Party in the Digital Age™

Scalding on Tez

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

12

HOW DID WE CHOOSE SCALDING ?

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

13

• A Trusted Third Party – Data escrow, controlled execution– Independent re-computation– Privacy & Personal Data compliance

assessment

• Big Data Services for Entertainment– Metadata enrichment– IP use certification– Dataset analysis as a service

Why Scalding?Transparency Rights Management:

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

14

Why Scalding?« Big Data Services for Entertainment » - a Use Case

Digital Service Provider Report

Copyright Owners / Collective Management

Organizations

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

15

Why Scalding?« Big Data Services for Entertainment » - a Use Case

Digital Service Provider Report

Copyright Owners / Collective Management

Organizations

Data Improvement Automatic Data Feed (« in your format »)

Independent ReportConformance Report

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

16

• September 2013: SQL Server overheats• October 2013: using Lingual

12 SQL steps + bash scripts

• September 2014: Cascading + Java• September 28th: tried out Scalding• November 2014: delivered first results on Scalding• April 2015: First success on Scalding+Tez

Why Scalding?Dataset analysis (from YouTube monthly reports)

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

17

Anatomy of a scalding app

Your App (in scala)

scalding

cascading

Hadoop + Tez platform libraries

You

@TwitterOSS

Concurrent, Inc.

Apache

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

18

SCALDING ON TEZ, THE MINI-HOWTO

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

19

• Step 0: Prerequisites:– A YARN cluster– Cascading 3.0– TEZ runtime lib in HDFS– A version of scalding with fabric selection

Scalding on Tez, the mini-howto

0.6.2-SNAPSHOT

0.13.1 + PR1220

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

20https://github.com/cchepelov/wcplus/blob/master/build.sbt

Scalding on Tez, the mini-HOWTO• Step 1: build.sbt

https://github.com/cchepelov/wcplus/blob/master/build.sbt

https://github.com/cchepelov/wcplus/blob/master/build.sbt

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

21

Scalding on Tez, the mini-HOWTO• Step 1: build.sbt (redux)

1.Regain control on what libraries are included

2.Exclude some « long transitive » dependencies that pull in junk

3.Put in the desired fabric, in a configurable way sbt --DCASCADING_FABRIC=hadoop clean assembly

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

22

Scalding on Tez, the mini-HOWTO• Step 1bis: assembly.sbt

We’re using fatjars to simplify deployment.

Because of jar hell, we « need » a complicated assembly.sbt

https://github.com/cchepelov/wcplus/blob/master/assembly.sbt



Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

23https://github.com/cchepelov/wcplus/blob/master/src/main/scala/com/transparencyrights/demo/wcplus/CommonJob.scala

Scalding on Tez, the mini-HOWTO• Step 2: a few job flags

https://github.com/cchepelov/wcplus/blob/master/src/main/scala/com/transparencyrights/demo/wcplus/CommonJob.scala

https://github.com/cchepelov/wcplus/blob/master/src/main/scala/com/transparencyrights/demo/wcplus/CommonJob.scala

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

24

• tez.task.resource.memory.mb– As large as you can afford to give, per CPU per node– The more memory, the less Tez needs to spill intermediates to

disk

• tez.container.max.java.heap.fraction– Defaults (1024MiB * 0.8) assume the JVM’s Native memory

requirements don’t exceed 208 MiB– Scalding + the Scala runtime + Cascading on top of Tez

seems to require more. YARN kills offenders switftly!

– The 460MiB figure we’re using (1024+512)*(1-0.7) may be a bit wasteful

• Step 2: a few job flags (continued)

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

25

THAT’S IT.

(ALMOST)

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

26

IN PRACTICE…

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

27

« A VERSION OF SCALDING WITH FABRIC SELECTION »

WAIT, WHAT?

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

28

Scalding traditional --local and --hdfs flags:– Uses either LocalFlowConnector or

HadoopFlowConnector– Types are hard-coded

Cascading 2.5 introduced a new fabric concept. You can run either with cascading-hadoop or with cascading-hadoop2-mr1. But:

– Incompatible jars (can’t load both)– Main types visible to Scalding are different

In practice« A version of scalding with fabric selection » Wait, What?

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

29

PR1220: No longer hardcodes « either Local or Hadoop 1.X » Enables supplying any flow connector implementation, as

long as the jar’s around. --hdfs to be deprecated as an alias to --hadoop1 Still built against Cascading 2.6

In practice« A version of scalding with fabric selection » Wait, What?

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

30

« STILL BUILT ON CASCADING 2.6 »

WHY?

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

31

Cascading 3.0 has carefully updated some argument types to prepare for the futureThis is source- and binary-compatible:

In practice« Still built on Cascading 2.6 »

Scala enforces generic type safety, and the Cascading 3.0 upgrades are not legal with scalac. But they still are with the JVM…

libra

ryco

nsum

er

Libr

ary

V2Sa

me

cons

umer

In Java

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

32

Scalding will require some adjustment to become compatible with the java-level source upgrades.

Can this happen without breaking scalding application source code ?

In practice… Going to native Cascading 3.0 ?

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

33

GUAVA

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

34

GUAVAGUAVA

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

35

• Guava is a nice library…… of little use in Scala (?)

• In a Scalding/Cascading/Tez JVM, multiple versions of guava are required. Each layer depends on its own version.About every single version from 11.0 to 16.0.2

• There have been breaking changes (method renames & removals) in guava 13

• These happen on really mundane objects (Closeable, Stopwatch), but they’re major troublemakers

In practice…Guava

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

36

• Discussions and actions in progress to remove the pain

• In the mean-time, using a patched version « frankenguava » to provide both older and newer interfaces, to keep all consumers happy across the stack.

In practice…Guava

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

37

CASCADING’S TEZ*REGISTRY

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

38

• Cascading 3.0 uses a set of mapping registries to convert cascading patterns into the back-end API.

The Tez registries are new, and distinct from the MR registries

• The Tez registries are hardened against Concurrent’s extensive test library, which is built on years of MR experience. Tez has its own trouble spots.

Beware of hash joins.

• It works fine now, but getting the scalding test library onboard will help a long way.

In practice…Cascading’s Tez*Registry

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

39

• It works mostly fine now, but getting the scalding test library onboard will help a long way.

In practice…Cascading’s Tez*Registry

Last-minute update:

.filterWithValue / .mapWithValue currently crash the Cascading planner (as of 3.0.1)

(implementation uses a HashJoin)

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

40

AN EXAMPLE

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

41

A small test:

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

42

A small test: « wc plus »

70 books1.1M lines10M words56M bytes

Word, relative frequency,

deviation from median relative freq

Two Words, relative frequency,


Ten Words, relative frequency,


ComputeFrequencies

Ignoring things that are more frequent than 80% of the max

word frequency

All Expressions (1-W to 10-W), relative frequency,


…

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

43


70 books1.1M lines10M words56M bytes

Word, relative frequency,


Two Words, relative frequency,


Ten Words, relative frequency,


ComputeFrequencies

Ignoring things that are more frequent than 80% of the max

word frequency

All Expressions (1-W to 10-W), relative frequency,


…

No .filterWithValue / .mapWithValue for now

Roulex45 / Wikipedia

count

count

count

count

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

44


https://github.com/cchepelov/wcplus

https://github.com/cchepelov/wcplus

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

45

TIPS & TRICKS

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

46

Run your job with -Dcascading.planner.plan.path=/tmp/path/to/plan.lst

The planner will output a lot of useful files. One of them is…/$(Job)/4-final-flow-steps/0000-step-node-sub-graph.dot

Run that file through graphvizdot –O –Tpdf 0000-step-node-sub-graph.dot

or, if the PDF is illegible, Firefox’s great at zooming into SVG files: dot –O –Tsvg 0000-step-node-sub-graph.dot

Tips & Tricks0000-step-node-sub-graph.dot

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

47

Tips & Tricks0000-step-node-sub-graph.dot

This is how TEZ names our stuff !

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

48

MR– One flow, many (MANY)

independent steps– One or more operators per

step– Step-to-step communications

involve disk (HDFS)– Each step is independent as

far as MR is concerned– Step scheduling managed

from outside the cluster, by Cascading

TEZ– One flow, one DAG. A DAG

includes several nodes.– One or more operators per node– Node-to-Node communications

managed by TEZ. Memory, direct network or disk as necessary

– YARN sees one « Application » per flow

– Node scheduling managed by TEZ DAG AppMaster

Tips & TricksMajor differences between how a cascading job gets mapped to MR and to TEZ:

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

49

Tips & Tricksyarn-swimlanes.sh

• A tool included in the tez source distribution, in tez-tools/swimlanes (bash + python)

• Requires YARN ATS to work« yarn logs –applicationId application_1345431315_1511 » must work

• Reports, in a GANTT chart, the per-container occupation

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

50

Tips & Tricksyarn-swimlanes.sh (2)

application_1435150225179_0474.svg

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

51

Tips & Tricksyarn-swimlanes.sh (3)

time

containers

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

52

Tips & TricksConsider using .forceToDisk to ensure work is balanced within the DAG

890 seconds

160 seconds

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

53

Tips & TricksConsider using .forceToDisk to ensure work is balanced within the DAG

890 seconds 160 seconds

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

54

• .forceToDisk really means « don’t merge those two TEZ nodes » which implies « manage appropriate data transmission between these two nodes »

• TextFile & other FixedPathSource friends don’t seem to automatically spread out work as well as they used to (huh?)

• YMMV, WIP.

Tips & Tricks• Consider using .forceToDisk to ensure work is balanced

within the DAG

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

55

PERFORMANCE

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

56

PerformanceMR vs TEZ

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

57

PerformanceMR vs TEZ; to scale

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

58

PerformanceMR vs TEZ; TO SCALE!!!

MR run time:14:22 (wall)12:49 (cluster time)5:43:26 (total CPU)

TEZ run time:4:03(wall)2:50(cluster time)1:25:35 (total CPU)

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

59

CONCLUSION

Cop

yri

gh

t ©

20

15

Tra

nsp

are

ncy

Rig

hts

Man

ag

em

en

t. A

ll ri

gh

ts r

ese

rved

60

Apache Tez enables very significant performance gains compared to traditional MAPREDUCE applications, on the same cluster and alongside the legacy.

The new Tez back-end built by Concurrent, enables these exciting performance gains for existing Cascading and Scalding applications.

Taking advantage of these performance gains should become as easy as upgrading and setting up a few configuration switches in the next few months.

Conclusion

Next Steps…

Download the Hortonworks Sandbox

Learn Hadoop

Build Your Analytic App

Try Hadoop 2

More about Concurrent & Hortonworkshttp://hortonworks.com/partner/concurrent

More about Transparency Rights Managementhttp://www.transparencyrights.com/

Contact us: [email protected]


Q&A

Boost Performance with Scala – Learn From Those Who’ve Done It!

Software

Transcript of Boost Performance with Scala – Learn From Those Who’ve Done It!