Boost Performance with Scala – Learn From Those Who’ve Done It!
-
Upload
cecile-poyet -
Category
Software
-
view
20 -
download
3
Transcript of Boost Performance with Scala – Learn From Those Who’ve Done It!
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Boost Performance with ScalaLearn From Those Who’ve Done It!
We do Hadoop.
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Your speakers…
Dhruv Kumar Partner Solutions EngineerHortonworks
Cyrille ChépélovR&D DirectorTransparency Rights Management
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP
Customer Momentum• 437+ customers (as of March 31, 2015)
Hortonworks Data Platform• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for resource management, security, operations, and governance.
Partner for Customer Success• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers, operators of Hadoop from Yahoo!
• 600+ Employees
• 1,000+ Ecosystem Partners
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Traditional systems under pressure
Challenges• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
20122.8 Zettabytes
202040 Zettabytes
LAGGARDS
INDUSTRY LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages Manages new data paradigm Handles data at scale Cost effective Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
StorageHDFS
Batch ProcessingMapReduce
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Modern Data Architecture emerges to unify data & processing
Modern Data Architecture
• Enable applications to have access to all your enterprise data through an efficient centralized platform
• Supported with a centralized approach governance, security and operations
• Versatile to handle any applications and datasets no matter the size or type
Clickstream Web & Social
Geolocation Sensor & Machine
Server Logs
Unstructured
SO
UR
CE
S
Existing Systems
ERP CRM SCM
AN
ALY
TIC
S
Data Marts
Business Analytics
Visualization& Dashboards
AN
ALY
TIC
S
ApplicationsBusiness Analytics
Visualization& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS (Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMPP
EDW
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks & Concurrent
Hortonworks and Concurrent Advance Enterprise Data Application Development on Hadoop
HDP Integrates and delivers Cascading SDK• Collection of tools, documentation, libraries,
tutorials and example projects
• Simplifies SQL integration and enables Scala development for Hadoop
Hortonworks provides level 1 & 2 support for Cascading SDK
Cascading is the proven application development platform for building data applications on Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks & Concurrent: Partnership Benefits
• SDK empowers developers to quickly build rich data-centric enterprise applications on Hadoop
• Leverage existing Java or Scala based skill sets to develop complex applications
• Combines the robustness and simplicity of Cascading with the reliability and stability of HDP
• Apps built on Cascading such as Scalding can easily take advantage of YARN and Tez
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Cascading SDK: Overview
• The most widely used application development framework for building Big Data applications
• Enables improved Developer Productivity for enterprises using HDP
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP Integration of Cascading SDK
• SDKs that enable the the rapid development of batch and interactive data-driven applications
• Integration with data processing layer allows Cascading to take advantage of advances in interactive applications
Efficient Cluster Resource Management & Shared Services
(YARN)
Interactive Data ProcessingTEZ
Batch Data ProcessingMapReduce
JavaCascading
ScalaScalding
SQLLingual
MLPattern
JavaCascading
ScalaScalding
SQLLingual
MLPattern
Enable both existing and new application toprovide value to the organization
PRESENTATION & APPLICATION
Your Trusted Third Party in the Digital Age™
Scalding on Tez
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
12
HOW DID WE CHOOSE SCALDING ?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
13
• A Trusted Third Party – Data escrow, controlled execution– Independent re-computation– Privacy & Personal Data compliance
assessment
• Big Data Services for Entertainment– Metadata enrichment– IP use certification– Dataset analysis as a service
Why Scalding?Transparency Rights Management:
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
14
Why Scalding?« Big Data Services for Entertainment » - a Use Case
Digital Service Provider Report
Copyright Owners / Collective Management
Organizations
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
15
Why Scalding?« Big Data Services for Entertainment » - a Use Case
Digital Service Provider Report
Copyright Owners / Collective Management
Organizations
Data Improvement Automatic Data Feed (« in your format »)
Independent ReportConformance Report
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
16
• September 2013: SQL Server overheats• October 2013: using Lingual
12 SQL steps + bash scripts
• September 2014: Cascading + Java• September 28th: tried out Scalding• November 2014: delivered first results on Scalding• April 2015: First success on Scalding+Tez
Why Scalding?Dataset analysis (from YouTube monthly reports)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
17
Anatomy of a scalding app
Your App (in scala)
scalding
cascading
Hadoop + Tez platform libraries
You
@TwitterOSS
Concurrent, Inc.
Apache
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
18
SCALDING ON TEZ, THE MINI-HOWTO
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
19
• Step 0: Prerequisites:– A YARN cluster– Cascading 3.0– TEZ runtime lib in HDFS– A version of scalding with fabric selection
Scalding on Tez, the mini-howto
0.6.2-SNAPSHOT
0.13.1 + PR1220
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
20https://github.com/cchepelov/wcplus/blob/master/build.sbt
Scalding on Tez, the mini-HOWTO• Step 1: build.sbt
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
21
Scalding on Tez, the mini-HOWTO• Step 1: build.sbt (redux)
1.Regain control on what libraries are included
2.Exclude some « long transitive » dependencies that pull in junk
3.Put in the desired fabric, in a configurable way sbt --DCASCADING_FABRIC=hadoop clean assembly
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
22
Scalding on Tez, the mini-HOWTO• Step 1bis: assembly.sbt
We’re using fatjars to simplify deployment.
Because of jar hell, we « need » a complicated assembly.sbt
https://github.com/cchepelov/wcplus/blob/master/assembly.sbt
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
23https://github.com/cchepelov/wcplus/blob/master/src/main/scala/com/transparencyrights/demo/wcplus/CommonJob.scala
Scalding on Tez, the mini-HOWTO• Step 2: a few job flags
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
24
• tez.task.resource.memory.mb– As large as you can afford to give, per CPU per node– The more memory, the less Tez needs to spill intermediates to
disk
• tez.container.max.java.heap.fraction– Defaults (1024MiB * 0.8) assume the JVM’s Native memory
requirements don’t exceed 208 MiB– Scalding + the Scala runtime + Cascading on top of Tez
seems to require more. YARN kills offenders switftly!
– The 460MiB figure we’re using (1024+512)*(1-0.7) may be a bit wasteful
• Step 2: a few job flags (continued)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
25
THAT’S IT.
(ALMOST)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
26
IN PRACTICE…
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
27
« A VERSION OF SCALDING WITH FABRIC SELECTION »
WAIT, WHAT?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
28
Scalding traditional --local and --hdfs flags:– Uses either LocalFlowConnector or
HadoopFlowConnector– Types are hard-coded
Cascading 2.5 introduced a new fabric concept. You can run either with cascading-hadoop or with cascading-hadoop2-mr1. But:
– Incompatible jars (can’t load both)– Main types visible to Scalding are different
In practice« A version of scalding with fabric selection » Wait, What?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
29
PR1220: No longer hardcodes « either Local or Hadoop 1.X » Enables supplying any flow connector implementation, as
long as the jar’s around. --hdfs to be deprecated as an alias to --hadoop1 Still built against Cascading 2.6
In practice« A version of scalding with fabric selection » Wait, What?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
30
« STILL BUILT ON CASCADING 2.6 »
WHY?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
31
Cascading 3.0 has carefully updated some argument types to prepare for the futureThis is source- and binary-compatible:
In practice« Still built on Cascading 2.6 »
Scala enforces generic type safety, and the Cascading 3.0 upgrades are not legal with scalac. But they still are with the JVM…
libra
ryco
nsum
er
Libr
ary
V2Sa
me
cons
umer
In Java
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
32
Scalding will require some adjustment to become compatible with the java-level source upgrades.
Can this happen without breaking scalding application source code ?
In practice… Going to native Cascading 3.0 ?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
33
GUAVA
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
34
GUAVAGUAVA
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
35
• Guava is a nice library…… of little use in Scala (?)
• In a Scalding/Cascading/Tez JVM, multiple versions of guava are required. Each layer depends on its own version.About every single version from 11.0 to 16.0.2
• There have been breaking changes (method renames & removals) in guava 13
• These happen on really mundane objects (Closeable, Stopwatch), but they’re major troublemakers
In practice…Guava
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
36
• Discussions and actions in progress to remove the pain
• In the mean-time, using a patched version « frankenguava » to provide both older and newer interfaces, to keep all consumers happy across the stack.
In practice…Guava
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
37
CASCADING’S TEZ*REGISTRY
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
38
• Cascading 3.0 uses a set of mapping registries to convert cascading patterns into the back-end API.
The Tez registries are new, and distinct from the MR registries
• The Tez registries are hardened against Concurrent’s extensive test library, which is built on years of MR experience. Tez has its own trouble spots.
Beware of hash joins.
• It works fine now, but getting the scalding test library onboard will help a long way.
In practice…Cascading’s Tez*Registry
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
39
• It works mostly fine now, but getting the scalding test library onboard will help a long way.
In practice…Cascading’s Tez*Registry
Last-minute update:
.filterWithValue / .mapWithValue currently crash the Cascading planner (as of 3.0.1)
(implementation uses a HashJoin)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
40
AN EXAMPLE
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
41
A small test:
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
42
A small test: « wc plus »
70 books1.1M lines10M words56M bytes
Word, relative frequency,
deviation from median relative freq
Two Words, relative frequency,
deviation from median relative freq
Ten Words, relative frequency,
deviation from median relative freq
ComputeFrequencies
Ignoring things that are more frequent than 80% of the max
word frequency
All Expressions (1-W to 10-W), relative frequency,
deviation from median relative freq
…
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
43
A small test: « wc plus »
70 books1.1M lines10M words56M bytes
Word, relative frequency,
deviation from median relative freq
Two Words, relative frequency,
deviation from median relative freq
Ten Words, relative frequency,
deviation from median relative freq
ComputeFrequencies
Ignoring things that are more frequent than 80% of the max
word frequency
All Expressions (1-W to 10-W), relative frequency,
deviation from median relative freq
…
No .filterWithValue / .mapWithValue for now
Roulex45 / Wikipedia
count
count
count
count
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
44
A small test: « wc plus »
https://github.com/cchepelov/wcplus
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
45
TIPS & TRICKS
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
46
Run your job with -Dcascading.planner.plan.path=/tmp/path/to/plan.lst
The planner will output a lot of useful files. One of them is…/$(Job)/4-final-flow-steps/0000-step-node-sub-graph.dot
Run that file through graphvizdot –O –Tpdf 0000-step-node-sub-graph.dot
or, if the PDF is illegible, Firefox’s great at zooming into SVG files: dot –O –Tsvg 0000-step-node-sub-graph.dot
Tips & Tricks0000-step-node-sub-graph.dot
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
47
Tips & Tricks0000-step-node-sub-graph.dot
This is how TEZ names our stuff !
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
48
MR– One flow, many (MANY)
independent steps– One or more operators per
step– Step-to-step communications
involve disk (HDFS)– Each step is independent as
far as MR is concerned– Step scheduling managed
from outside the cluster, by Cascading
TEZ– One flow, one DAG. A DAG
includes several nodes.– One or more operators per node– Node-to-Node communications
managed by TEZ. Memory, direct network or disk as necessary
– YARN sees one « Application » per flow
– Node scheduling managed by TEZ DAG AppMaster
Tips & TricksMajor differences between how a cascading job gets mapped to MR and to TEZ:
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
49
Tips & Tricksyarn-swimlanes.sh
• A tool included in the tez source distribution, in tez-tools/swimlanes (bash + python)
• Requires YARN ATS to work« yarn logs –applicationId application_1345431315_1511 » must work
• Reports, in a GANTT chart, the per-container occupation
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
50
Tips & Tricksyarn-swimlanes.sh (2)
application_1435150225179_0474.svg
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
51
Tips & Tricksyarn-swimlanes.sh (3)
time
containers
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
52
Tips & TricksConsider using .forceToDisk to ensure work is balanced within the DAG
890 seconds
160 seconds
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
53
Tips & TricksConsider using .forceToDisk to ensure work is balanced within the DAG
890 seconds 160 seconds
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
54
• .forceToDisk really means « don’t merge those two TEZ nodes » which implies « manage appropriate data transmission between these two nodes »
• TextFile & other FixedPathSource friends don’t seem to automatically spread out work as well as they used to (huh?)
• YMMV, WIP.
Tips & Tricks• Consider using .forceToDisk to ensure work is balanced
within the DAG
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
55
PERFORMANCE
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
56
PerformanceMR vs TEZ
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
57
PerformanceMR vs TEZ; to scale
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
58
PerformanceMR vs TEZ; TO SCALE!!!
MR run time:14:22 (wall)12:49 (cluster time)5:43:26 (total CPU)
TEZ run time:4:03(wall)2:50(cluster time)1:25:35 (total CPU)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
59
CONCLUSION
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
60
Apache Tez enables very significant performance gains compared to traditional MAPREDUCE applications, on the same cluster and alongside the legacy.
The new Tez back-end built by Concurrent, enables these exciting performance gains for existing Cascading and Scalding applications.
Taking advantage of these performance gains should become as easy as upgrading and setting up a few configuration switches in the next few months.
Conclusion
Next Steps…
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about Concurrent & Hortonworkshttp://hortonworks.com/partner/concurrent
More about Transparency Rights Managementhttp://www.transparencyrights.com/
Contact us: [email protected]
Page 62 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Q&A