Using Scalding for Data Driven Product Development at LinkedIn
Scalding
-
Upload
mario-pastorelli -
Category
Technology
-
view
110 -
download
1
description
Transcript of Scalding
What is Scalding
Scalding is a Scala library written on top of Cascading that makesit easy to define MapReduce programs
2/21
Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
3/21
Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
4/21
Map and Reduce
At high level, a MapReduce Job is described with two functionsoperating over lists of key/value pairs.
I Map: a function from an input key/value pair to a list ofintermediate key/value pairs
map : (keyinput, valueinput) → list(keymap, valuemap)
I Reduce: a function from an intermediate key/values pairs toa list of output key/value pairs
reduce : (keymap, list(valuemap)) → list(keyreduce, valuereduce)
5/21
Map and Reduce
At high level, a MapReduce Job is described with two functionsoperating over lists of key/value pairs.
I Map: a function from an input key/value pair to a list ofintermediate key/value pairs
map : (keyinput, valueinput) → list(keymap, valuemap)
I Reduce: a function from an intermediate key/values pairs toa list of output key/value pairs
reduce : (keymap, list(valuemap)) → list(keyreduce, valuereduce)
5/21
Map and Reduce
At high level, a MapReduce Job is described with two functionsoperating over lists of key/value pairs.
I Map: a function from an input key/value pair to a list ofintermediate key/value pairs
map : (keyinput, valueinput) → list(keymap, valuemap)
I Reduce: a function from an intermediate key/values pairs toa list of output key/value pairs
reduce : (keymap, list(valuemap)) → list(keyreduce, valuereduce)
5/21
Hadoop Programming Model
The Hadoop MapReduce programming model allows to control allthe job workflow components. Job components are divided in twophases:
I The Map Phase:
DataSource
Ki1 Vi
1
Ki2 Vi
2
Km1Vm
1
Km2Vm
2
Km1Vm
5
Km1Vm
6
Km2Vm
2
Km3Vm
3
Km3Vm
3
P1
P2
Km1Vm
6
Km3Vm
3
Km2Vm
2
P1
P2
Km1Vm
6Km
3Vm3
Km2Vm
2
Inputreader Mapper Combiner Partitioner Sorter
combine(Vm1,Vm
5)=Vm6
I The Reduce Phase:
DataDest
OutputWriter
Shuffle Km1Vm
6 Vm7
Km3Vm
3
Km4Vm
8
SorterKm
3Vm3
Km4Vm
8
Km1Vm
6 Vm7
Grouper
G2
G1Km
3Vm3
Km4Vm
8
Km1Vm
6 Vm7
Reducer Kr1 Vr
1
Kr2 Vr
2
6/21
Hadoop Programming Model
The Hadoop MapReduce programming model allows to control allthe job workflow components. Job components are divided in twophases:
I The Map Phase:
DataSource
Ki1 Vi
1
Ki2 Vi
2
Km1Vm
1
Km2Vm
2
Km1Vm
5
Km1Vm
6
Km2Vm
2
Km3Vm
3
Km3Vm
3
P1
P2
Km1Vm
6
Km3Vm
3
Km2Vm
2
P1
P2
Km1Vm
6Km
3Vm3
Km2Vm
2
Inputreader Mapper Combiner Partitioner Sorter
combine(Vm1,Vm
5)=Vm6
I The Reduce Phase:
DataDest
OutputWriter
Shuffle Km1Vm
6 Vm7
Km3Vm
3
Km4Vm
8
SorterKm
3Vm3
Km4Vm
8
Km1Vm
6 Vm7
Grouper
G2
G1Km
3Vm3
Km4Vm
8
Km1Vm
6 Vm7
Reducer Kr1 Vr
1
Kr2 Vr
2
6/21
Hadoop Programming Model
The Hadoop MapReduce programming model allows to control allthe job workflow components. Job components are divided in twophases:
I The Map Phase:
DataSource
Ki1 Vi
1
Ki2 Vi
2
Km1Vm
1
Km2Vm
2
Km1Vm
5
Km1Vm
6
Km2Vm
2
Km3Vm
3
Km3Vm
3
P1
P2
Km1Vm
6
Km3Vm
3
Km2Vm
2
P1
P2
Km1Vm
6Km
3Vm3
Km2Vm
2
Inputreader Mapper Combiner Partitioner Sorter
combine(Vm1,Vm
5)=Vm6
I The Reduce Phase:
DataDest
OutputWriter
Shuffle Km1Vm
6 Vm7
Km3Vm
3
Km4Vm
8
SorterKm
3Vm3
Km4Vm
8
Km1Vm
6 Vm7
Grouper
G2
G1Km
3Vm3
Km4Vm
8
Km1Vm
6 Vm7
Reducer Kr1 Vr
1
Kr2 Vr
2
6/21
Example: Word Count 1/2
1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{2
3 public void map(Object key, Text value, Context context)4 throws IOException, InterruptedException {5 StringTokenizer itr = new StringTokenizer(value.toString());6 while (itr.hasMoreTokens()) {7 word.set(itr.nextToken());8 context.write(new Text(word), new IntWritable(1));9 }
10 }11 }12
13 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{14
15 public void reduce(Text key, Iterable<IntWritable> values,16 Context context17 ) throws IOException, InterruptedException {18 int sum = 0;19 for (IntWritable val : values)20 sum += val.get();21 context.write(key, new IntWritable(sum));22 }23 }
7/21
Example: Word Count 2/2
1 public class WordCount {2
3 public static void main(String[] args) throws Exception {4 Job job = new Job(conf, "word count");5 job.setMapperClass(TokenizerMapper.class);6
7
job.setCombinerClass(IntSumReducer.class);
8
9 job.setReducerClass(IntSumReducer.class);10 job.setOutputKeyClass(Text.class);11 job.setOutputValueClass(IntWritable.class);12 FileInputFormat.addInputPath(job, new Path(args[0]));13 FileOutputFormat.setOutputPath(job, new Path(args[1]));14 System.exit(job.waitForCompletion(true) ? 0 : 1);15 }16 }
I Sending the integer 1 for each instance of a word is veryinefficient (1TB of data yields 1TB+ of data)
I Hadoop doesn’t know if it can use the reducer as combiner. Amanual set is needed
8/21
Example: Word Count 2/2
1 public class WordCount {2
3 public static void main(String[] args) throws Exception {4 Job job = new Job(conf, "word count");5 job.setMapperClass(TokenizerMapper.class);6
7 job.setCombinerClass(IntSumReducer.class);8
9 job.setReducerClass(IntSumReducer.class);10 job.setOutputKeyClass(Text.class);11 job.setOutputValueClass(IntWritable.class);12 FileInputFormat.addInputPath(job, new Path(args[0]));13 FileOutputFormat.setOutputPath(job, new Path(args[1]));14 System.exit(job.waitForCompletion(true) ? 0 : 1);15 }16 }
I Sending the integer 1 for each instance of a word is veryinefficient (1TB of data yields 1TB+ of data)
I Hadoop doesn’t know if it can use the reducer as combiner. Amanual set is needed
8/21
Hadoop weaknesses
I The reducer cannot be always used as combiner, Hadooprelies on the combiner specification or on manual partialaggregation inside the mapper instance life cycle (in-mappercombiner)
I Combiners are limited to associative and commutativefunctions (like sum). Partial aggregation is more general andpowerful
I Programming model limited to the map/reduce phasesmodel, multi-job programs are often difficult andcounter-intuitive (think about iterative algorithms likePageRank)
I Joins can be difficult, many techniques must beimplemented from scratch
I More in general, MapReduce is indeed simple but manyoptimizations are similar to hacks and not so natural
9/21
Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
10/21
Cascading
I Open source project developed @ConcurrentI It is Java application framework on top of Hadoop developed
to be extendible by providing:I Processing API: to develop complex data flowsI Integration API: integration test supported by the framework
to avoid to put in production unstable softwareI Scheduling API: used to schedule unit of work from any
third-party application
I It changes the MapReduce programming model to a moregeneric data flow oriented programming model
I Cascading has a data flow optimizer that converts user dataflows to optimized data flows
11/21
Cascading Programming Model
I A Cascading program is composed by flows
I A flow is composed by a source tap, a sink tap and pipesthat connect them
I A pipe holds a particular transformation over its input dataflow
I Pipes can be combined to create more complex programs
12/21
Example: Word Count
I MapReduce word count concept:
DataSource
Ki1 Vi
1TextLine
Map(tokenize textand emit 1 foreach token)
Shuffle
Reduce(count values and emit the result) Kr
1 Vr1
TextLine DataDest
I Cascading word count concept:
tokenize each line group by tokens count values in every group
TextLine
TextLine
13/21
Example: Word Count
1 public class WordCount {2 public static void main( String[] args ) {3 Tap docTap = new Hfs( new TextDelimited( true, "\t" ), args[0] );4 Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), args[1] );5
6 RegexSplitGenerator s = new RegexSplitGenerator(7 new Fields("token"),8 "[ \\[\\]\\(\\),.]" );9 Pipe docPipe = new Each( "token", new Fields( "text" ), s,
10 Fields.RESULTS ); // text -> token11
12 Pipe wcPipe = new Pipe( "wc", docPipe );13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );15
16 // connect the taps and pipes to create a flow definition17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" )18 .addSource( docPipe, docTap )19 .addTailSink( wcPipe, wcTap );20
21 getFlowConnector().connect( flowDef ).complete();22 }23 }
14/21
Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
15/21
Scalding
I Open source project developed @Twitter
I Two APIs:I Field Based
I Primary API: stableI Uses Cascading Fields: dynamic with errors at runtime
I Type SafeI Secondary API: experimentalI Uses Scala Types: static with errors at compile time
I The two APIs can work together using pipe.typed andTypedPipe.from
I This presentation is about the TypeSafe API ¨̂
16/21
Scalding
I Open source project developed @TwitterI Two APIs:
I Field BasedI Primary API: stableI Uses Cascading Fields: dynamic with errors at runtime
I Type SafeI Secondary API: experimentalI Uses Scala Types: static with errors at compile time
I The two APIs can work together using pipe.typed andTypedPipe.from
I This presentation is about the TypeSafe API ¨̂
16/21
Scalding
I Open source project developed @TwitterI Two APIs:
I Field BasedI Primary API: stableI Uses Cascading Fields: dynamic with errors at runtime
I Type SafeI Secondary API: experimentalI Uses Scala Types: static with errors at compile time
I The two APIs can work together using pipe.typed andTypedPipe.from
I This presentation is about the TypeSafe API ¨̂
16/21
Scalding
I Open source project developed @TwitterI Two APIs:
I Field BasedI Primary API: stableI Uses Cascading Fields: dynamic with errors at runtime
I Type SafeI Secondary API: experimentalI Uses Scala Types: static with errors at compile time
I The two APIs can work together using pipe.typed andTypedPipe.from
I This presentation is about the TypeSafe API ¨̂
16/21
Why Scalding
I MapReduce high-level idea comes from LISP and works onfunctions (Map/Reduce) and function composition
I Cascading works on objects representing functions and usesconstructors as compositor between pipes:
1 Pipe wcPipe = new Pipe( "wc", docPipe );2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),4 Fields.ALL );
I Functional programming can naturally describe data flows:every pipe can be seen as a function working and pipes can becombined using functional compositing. The code above canbe written as:
1 docPipe.groupBy( new Fields( "token" ) )2 .every(Fields.ALL, new Count(), Fields.ALL)
17/21
Why Scalding
I MapReduce high-level idea comes from LISP and works onfunctions (Map/Reduce) and function composition
I Cascading works on objects representing functions and usesconstructors as compositor between pipes:
1 Pipe wcPipe = new Pipe( "wc", docPipe );2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),4 Fields.ALL );
I Functional programming can naturally describe data flows:every pipe can be seen as a function working and pipes can becombined using functional compositing. The code above canbe written as:
1 docPipe.groupBy( new Fields( "token" ) )2 .every(Fields.ALL, new Count(), Fields.ALL)
17/21
Why Scalding
I MapReduce high-level idea comes from LISP and works onfunctions (Map/Reduce) and function composition
I Cascading works on objects representing functions and usesconstructors as compositor between pipes:
1 Pipe wcPipe = new Pipe( "wc", docPipe );2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),4 Fields.ALL );
I Functional programming can naturally describe data flows:every pipe can be seen as a function working and pipes can becombined using functional compositing. The code above canbe written as:
1 docPipe.groupBy( new Fields( "token" ) )2 .every(Fields.ALL, new Count(), Fields.ALL)
17/21
Why Scalding
I MapReduce high-level idea comes from LISP and works onfunctions (Map/Reduce) and function composition
I Cascading works on objects representing functions and usesconstructors as compositor between pipes:
1 Pipe wcPipe = new Pipe( "wc", docPipe );2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),4 Fields.ALL );
I Functional programming can naturally describe data flows:every pipe can be seen as a function working and pipes can becombined using functional compositing. The code above canbe written as:
1 docPipe.groupBy( new Fields( "token" ) )2 .every(Fields.ALL, new Count(), Fields.ALL)
17/21
Example: Word Count
1 class WordCount(args : Args) extends Job(args) {2
3 /* TextLine reads each line of the given file */4 val input = TypedPipe.from( TextLine( args( "input" ) ) )5
6 /* tokenize every line and flat the result into a list of words */7 val words = input.flatMap{ tokenize(_) }8
9 /* group by words and add a new field size that is the group size */10 val wordGroups = words.groupBy{ identity(_) }.size11
12 /* write each pair (word,count) as line using TextLine */13 wordGroups.write((0,1), TextLine( args( "output" ) ) )14
15 /* Split a piece of text into individual words */16 def tokenize(text : String) : Array[String] = {17 // Lowercase each word and remove punctuation.18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9\\s]", "")19 .split("\\s+")20 }21 }
18/21
Scalding TypeSafe API
Two main concepts:
I TypedPipe[T]: class whose instances are distributedobjects that wrap a cascading Pipe object, and holds thetransformation done up until that point. Its interface is similarto Scala’s Iterator[T] (map, flatMap, groupBy,filter,. . . )
I KeyedList[K,V]: trait that represents a sharded lists ofitems. Two implementations:
I Grouped[K,V]: represents a grouping on keys of type KI CoGrouped2[K,V,W,Result]: represents a cogroup over
two grouped pipes. Used for joins
19/21
Scalding TypeSafe API
Two main concepts:
I TypedPipe[T]: class whose instances are distributedobjects that wrap a cascading Pipe object, and holds thetransformation done up until that point. Its interface is similarto Scala’s Iterator[T] (map, flatMap, groupBy,filter,. . . )
I KeyedList[K,V]: trait that represents a sharded lists ofitems. Two implementations:
I Grouped[K,V]: represents a grouping on keys of type KI CoGrouped2[K,V,W,Result]: represents a cogroup over
two grouped pipes. Used for joins
19/21
Scalding TypeSafe API
Two main concepts:
I TypedPipe[T]: class whose instances are distributedobjects that wrap a cascading Pipe object, and holds thetransformation done up until that point. Its interface is similarto Scala’s Iterator[T] (map, flatMap, groupBy,filter,. . . )
I KeyedList[K,V]: trait that represents a sharded lists ofitems. Two implementations:
I Grouped[K,V]: represents a grouping on keys of type KI CoGrouped2[K,V,W,Result]: represents a cogroup over
two grouped pipes. Used for joins
19/21
Conclusions
I MapReduce API is powerful but limited
I Cascading API is as simple as the MapReduce API but moregeneric and powerful
I Scalding combines Cascading and Scala to easily describedistributed programs. Major strength points are:
I Functional programming to naturally describe data flows.Scalding is similar to Scala library, if you know Scala thenyou already know how to use Scalding
I Statically typed (TypeSafe API), no type errors at runtimeI Scala is standard and works on top of the JVMI Scala libraries and tools can be used in production: IDEs,
debug systems, test systems, build systems and everything else.
20/21
Conclusions
I MapReduce API is powerful but limited
I Cascading API is as simple as the MapReduce API but moregeneric and powerful
I Scalding combines Cascading and Scala to easily describedistributed programs. Major strength points are:
I Functional programming to naturally describe data flows.Scalding is similar to Scala library, if you know Scala thenyou already know how to use Scalding
I Statically typed (TypeSafe API), no type errors at runtimeI Scala is standard and works on top of the JVMI Scala libraries and tools can be used in production: IDEs,
debug systems, test systems, build systems and everything else.
20/21
Conclusions
I MapReduce API is powerful but limited
I Cascading API is as simple as the MapReduce API but moregeneric and powerful
I Scalding combines Cascading and Scala to easily describedistributed programs. Major strength points are:
I Functional programming to naturally describe data flows.Scalding is similar to Scala library, if you know Scala thenyou already know how to use Scalding
I Statically typed (TypeSafe API), no type errors at runtimeI Scala is standard and works on top of the JVMI Scala libraries and tools can be used in production: IDEs,
debug systems, test systems, build systems and everything else.
20/21
Thank you for listening
21/21