Post on 05-Dec-2014
description
Page 1 © Hortonworks Inc. 2014
Scalding YARN Webinar Series
September 18, 2014
Ajay Singh, Director - Hortonworks Jonathan Coveney, Senior Software Engineer - Twitter
Page 2 © Hortonworks Inc. 2014
Agenda
Introduction: Ajay Singh, Hortonworks Modern Data Architecture and how Cascading and Scalding fit in
Scalding: Jonathan Coveney, Twitter
Why Scalding?
Core Concepts and Limitations
Scalding at Twitter
Resources
Page 3 © Hortonworks Inc. 2014
Speakers
Ajay Singh is Hortonworks Director of Technical Channels and leads the strategic alliances with partners from a technology standpoint such as driving alignment on roadmaps, product certifications and demos. Ajay is dedicated to building, scaling and delivering exceptional go-to-market solutions with partners.
Jonathan Coveney currently works at Twitter, where he has spent a lot of time maintaining and updating Scalding; in the past, he has worked extensively on Apache Pig. He is deeply interested in functional programming, as well as developing usable, scalable API's for data processing at scale.
Page 4 © Hortonworks Inc. 2014
A Modern Data Architecture
APPLICAT
IONS
DATA
SYSTEM
REPOSITORIES
SOURC
ES
Exis4ng Sources (CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources (Sensor, Sen4ment, Geo, Unstructured)
OPERATIONAL TOOLS
MANAGE & MONITOR
DEV & DATA TOOLS
BUILD & TEST
Business Analy4cs Custom Applica4ons Packaged
Applica4ons
Gov
erna
nce
&
Inte
grat
ion
ENTERPRISE HADOOP
Secu
rity
Ope
ratio
ns
Data Access
Data Management
Page 5 © Hortonworks Inc. 2014
HDP 2.1: Enterprise Hadoop
HDP 2.1 Hortonworks Data Platform
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS YARN : Data Opera4ng System
DATA MANAGEMENT
SECURITY DATA ACCESS GOVERNANCE & INTEGRATION
Authen4ca4on Authoriza4on Accoun4ng
Data Protec4on
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
OPERATIONS
Script Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory AnalyNcs, ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Deployment Choice Linux Windows On-Premise Cloud
Cascading
Page 6 © Hortonworks Inc. 2014
Cascading SDK
HDP Integrates and delivers Cascading SDK • Collection of tools, documentation, libraries,
tutorials and example projects • Key Benefits
• Simplified Development • Multi Language Support • Reuse existing skills and tools • Native YARN Integration
Hortonworks delivers Enterprise support • Backed by Concurrent
Hortonworks and Concurrent Advance Enterprise Data Application Development on Hadoop
Page 7 © Hortonworks Inc. 2014
HDP Integration of Cascading SDK • Write once and deploy on your fabric of
choice
• Integration with data processing layer allows Cascading to take advantage of advances in interactive applications
• Sep 17th - Cascading 3.0 WIP Now Supports Apache Tez – http://www.cascading.org/2014/09/17/
cascading-3-0-wip-now-supports-apache-tez/
Efficient Cluster Resource Management & Shared Services
(YARN)
Batch Data Processing MapReduce
Interac4ve Data Processing TEZ
Java Cascading
Scala Scalding
SQL Lingual
ML Pa6ern
Java Cascading
Scala Scalding
SQL Lingual
ML Pa6ern
Enable both existing and new application to provide value to the organization
PRESENTATION & APPLICATION
CURRENT WIP
Page 8 © Hortonworks Inc. 2014
Cascading.org Scalding Resources
Scalding Resources on Cascading.org • Videos and Tutorials
• Mailing List
• Newsletter
Cascading 3.0 WIP With Tez Support
• https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez
Scalding Training Debuts This Fall
• In-person, 1-day class with labs
• Email: info@cascading.io
Page 9 © Hortonworks Inc. 2014
Jonathan Coveney Twitter
@jco
Page 10 © Hortonworks Inc. 2014
Why Scalding?
Writing raw map reduce is difficult! ● Scalding is
o Less verbose o Less error prone (type checking!) o Easier to evolve o Performant enough
Page 11 © Hortonworks Inc. 2014
● Really good for certain things o Excellent for quick, ad-hoc work o Easy to understand o Can leverage existing knowledge (ie SQL)
● Not always the best for maintainability o Composition isn’t great o Testing is difficult o Type safety is lacking
But what about Hive and Pig?
Page 12 © Hortonworks Inc. 2014
So… Cascading?
● Still pretty verbose! ● But you can use normal java tools
o Maven o JUnit o IDEs
● Handles the low level details for you ● A good target for higher level languages
Page 13 © Hortonworks Inc. 2014
Scalding
● Concise, expressive syntax ● Testable ● Abstractable ● Composable Because it’s in a full-featured, functional language!
Page 14 © Hortonworks Inc. 2014
But Scala is scary!
● Scalding doesn’t force you to use more complicated features
● Can just write less-verbose Java if desired ● Functional programming is an important paradigm -- but
especially for big data Learning new things is good for your brain :)
Page 15 © Hortonworks Inc. 2014
Example Scalding job
class Webinar(arg: Args) extends Job(args) { import TDsl._
TextLine(args(“input”)) .flatMap { _.split(“\s+”) } .map { w => (w, 1L) } .group .sum .write(TypedTsv[(String, Long)](args(“output”)))
} “Hadoop is a system for counting words” -Oscar Boykin, @posco
Page 16 © Hortonworks Inc. 2014
Core concepts
● Source o How to read or write data
● TypedPipe[T] o A distributed list of T o Kind of like a Seq[T] in Scala’s collections library
● Grouped[K, T] o A grouping on K o Represents transition to reduce phase
Page 17 © Hortonworks Inc. 2014
Word Co-Occurrence
TextLine(args("input")) .flatMap { line => val words = line.split("\s+") for (w1 <- words; w2 <- words if (w1 != w2)) yield (w1, Map(w2 -> 1L)) }.group[String, Map[String, Long]] .sum .flatMap { case (word, wordMap) => wordMap.map { case (otherWord, count) => (word, otherWord, count) }}.write(TypedTsv[(String, String, Long)](args("output")))
Page 18 © Hortonworks Inc. 2014
Scalding leverages a lot of Scala idioms, as well as concepts from functional programming ● map
o a 1 to 1 mapping for every piece of data ● flatMap
o a 1 to 0 or more mapping for every piece of data
Important concepts
Page 19 © Hortonworks Inc. 2014
Important concepts (continued)
● Typeclasses o The separation of computation from data types o Think Java’s Comparator (but way more powerful) o These are what power .sum
Page 20 © Hortonworks Inc. 2014
Scalding’s limitations are MapReduce’s limitations ● Bad at iterative jobs ● Lots of checkpointing, serialization, sorting However... ● Cascading on Tez could help!
o in progress as part of Cascading 3.0 ● So could Cascading on Spark!
Limitations
Page 21 © Hortonworks Inc. 2014
The cutting edge
● REPL support ● Executor[T]
o Decoupling TypedPipes from specifics of the execution engine
o Makes Iterative algorithms much easier to express ● Macros
o Allowing easier use of case classes o Closure analysis?
Page 22 © Hortonworks Inc. 2014
Scalding at Twitter
● Thousands of users o Engineers AND data scientists
● Many thousands of jobs every day o ETL o Recommendations o Email o Time series analysis
When you use Twitter, you’re using features powered by Scalding!
Page 23 © Hortonworks Inc. 2014
Useful practices
● A standardized “Job” subclass with company specific information o Want the common case to be as simple as possible o Especially should configure serialization for users
● Separate data from functions on data o At Twitter, this means Thrift for data, and various Scala
functions operating and that data o Decouples the specification of some data from the derived
data people want based on it
Page 24 © Hortonworks Inc. 2014
Q&A
Page 25 © Hortonworks Inc. 2014
Contribute! ● Scalding ● Algebird
o Math inspired aggregators (.sum uses it)
● Bijection o Conversion and serialization made fun
● Summingbird o Abstraction for batch and online map/reduce (see resources for more)
Page 26 © Hortonworks Inc. 2014
More resources
Scalding/Algebird • Oscar Boykin: Algebra for Scalable Analytics • Avi Bryant: Add ALL the Things • Oscar Boykin, Argyris Zimny: Scalding: Powerful & Concise MapReduce
You might also be interested in… • Summingbird! Streaming real-time and batch analytics, unified and made
beautiful • Oscar Boykin: Introduction to Summingbird • Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin:
Summingbird, A Framework for Integrating Batch and Online MapReduce Computations
Page 27 © Hortonworks Inc. 2014
Next Webinar – Oct 2 - Spark
Writing applications to Hadoop and YARN using Spark • October 2nd at 9am Pacific Time
• Register
Find all webinars
• Hortonworks.com/webinars
Find past recorded webinars
• Hortonworks.com/webinars/#library
Page 28 © Hortonworks Inc. 2014
Thank you!