Scalding

Mario Pastorelli ([email protected])

EURECOM

September 27, 2012

What is Scalding

Scalding is a Scala library written on top of Cascading that makesit easy to define MapReduce programs

Summary

Hadoop MapReduce Programming Model

Cascading

Scalding

Summary

Cascading

Scalding

Map and Reduce

At high level, a MapReduce Job is described with two functionsoperating over lists of key/value pairs.

I Map: a function from an input key/value pair to a list ofintermediate key/value pairs

map : (keyinput, valueinput) → list(keymap, valuemap)

I Reduce: a function from an intermediate key/values pairs toa list of output key/value pairs

reduce : (keymap, list(valuemap)) → list(keyreduce, valuereduce)

Map and Reduce

Hadoop Programming Model

The Hadoop MapReduce programming model allows to control allthe job workflow components. Job components are divided in twophases:

I The Map Phase:

DataSource

Ki1 Vi

Ki2 Vi

Inputreader Mapper Combiner Partitioner Sorter

combine(Vm1,Vm

5)=Vm6

I The Reduce Phase:

DataDest

OutputWriter

Shuffle Km1Vm

SorterKm

Grouper

Reducer Kr1 Vr

Kr2 Vr

I The Map Phase:

DataSource

Ki1 Vi

Ki2 Vi

combine(Vm1,Vm

5)=Vm6

I The Reduce Phase:

DataDest

OutputWriter

Shuffle Km1Vm

SorterKm

Grouper

Reducer Kr1 Vr

Kr2 Vr

I The Map Phase:

DataSource

Ki1 Vi

Ki2 Vi

combine(Vm1,Vm

5)=Vm6

I The Reduce Phase:

DataDest

OutputWriter

Shuffle Km1Vm

SorterKm

Grouper

Reducer Kr1 Vr

Kr2 Vr

Example: Word Count 1/2

1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{2

3 public void map(Object key, Text value, Context context)4 throws IOException, InterruptedException {5 StringTokenizer itr = new StringTokenizer(value.toString());6 while (itr.hasMoreTokens()) {7 word.set(itr.nextToken());8 context.write(new Text(word), new IntWritable(1));9 }

10 }11 }12

13 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{14

15 public void reduce(Text key, Iterable<IntWritable> values,16 Context context17 ) throws IOException, InterruptedException {18 int sum = 0;19 for (IntWritable val : values)20 sum += val.get();21 context.write(key, new IntWritable(sum));22 }23 }

1 public class WordCount {2

3 public static void main(String[] args) throws Exception {4 Job job = new Job(conf, "word count");5 job.setMapperClass(TokenizerMapper.class);6

job.setCombinerClass(IntSumReducer.class);

9 job.setReducerClass(IntSumReducer.class);10 job.setOutputKeyClass(Text.class);11 job.setOutputValueClass(IntWritable.class);12 FileInputFormat.addInputPath(job, new Path(args[0]));13 FileOutputFormat.setOutputPath(job, new Path(args[1]));14 System.exit(job.waitForCompletion(true) ? 0 : 1);15 }16 }

I Sending the integer 1 for each instance of a word is veryinefficient (1TB of data yields 1TB+ of data)

I Hadoop doesn’t know if it can use the reducer as combiner. Amanual set is needed

1 public class WordCount {2

3 public static void main(String[] args) throws Exception {4 Job job = new Job(conf, "word count");5 job.setMapperClass(TokenizerMapper.class);6

7 job.setCombinerClass(IntSumReducer.class);8

9 job.setReducerClass(IntSumReducer.class);10 job.setOutputKeyClass(Text.class);11 job.setOutputValueClass(IntWritable.class);12 FileInputFormat.addInputPath(job, new Path(args[0]));13 FileOutputFormat.setOutputPath(job, new Path(args[1]));14 System.exit(job.waitForCompletion(true) ? 0 : 1);15 }16 }

I Sending the integer 1 for each instance of a word is veryinefficient (1TB of data yields 1TB+ of data)

I Hadoop doesn’t know if it can use the reducer as combiner. Amanual set is needed

Hadoop weaknesses

I The reducer cannot be always used as combiner, Hadooprelies on the combiner specification or on manual partialaggregation inside the mapper instance life cycle (in-mappercombiner)

I Combiners are limited to associative and commutativefunctions (like sum). Partial aggregation is more general andpowerful

I Programming model limited to the map/reduce phasesmodel, multi-job programs are often difficult andcounter-intuitive (think about iterative algorithms likePageRank)

I Joins can be difficult, many techniques must beimplemented from scratch

I More in general, MapReduce is indeed simple but manyoptimizations are similar to hacks and not so natural

Summary

Cascading

Scalding

Cascading

I Open source project developed @ConcurrentI It is Java application framework on top of Hadoop developed

to be extendible by providing:I Processing API: to develop complex data flowsI Integration API: integration test supported by the framework

to avoid to put in production unstable softwareI Scheduling API: used to schedule unit of work from any

third-party application

I It changes the MapReduce programming model to a moregeneric data flow oriented programming model

I Cascading has a data flow optimizer that converts user dataflows to optimized data flows

Cascading Programming Model

I A Cascading program is composed by flows

I A flow is composed by a source tap, a sink tap and pipesthat connect them

I A pipe holds a particular transformation over its input dataflow

I Pipes can be combined to create more complex programs

Example: Word Count

I MapReduce word count concept:

DataSource

Ki1 Vi

1TextLine

Map(tokenize textand emit 1 foreach token)

Shuffle

Reduce(count values and emit the result) Kr

TextLine DataDest

I Cascading word count concept:

tokenize each line group by tokens count values in every group

TextLine

Example: Word Count

1 public class WordCount {2 public static void main( String[] args ) {3 Tap docTap = new Hfs( new TextDelimited( true, "\t" ), args[0] );4 Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), args[1] );5

6 RegexSplitGenerator s = new RegexSplitGenerator(7 new Fields("token"),8 "[ \\[\\]\\(\\),.]" );9 Pipe docPipe = new Each( "token", new Fields( "text" ), s,

10 Fields.RESULTS ); // text -> token11

12 Pipe wcPipe = new Pipe( "wc", docPipe );13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );15

16 // connect the taps and pipes to create a flow definition17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" )18 .addSource( docPipe, docTap )19 .addTailSink( wcPipe, wcTap );20

21 getFlowConnector().connect( flowDef ).complete();22 }23 }

Summary

Cascading

Scalding

I Open source project developed @Twitter

I Two APIs:I Field Based

I Primary API: stableI Uses Cascading Fields: dynamic with errors at runtime

I Type SafeI Secondary API: experimentalI Uses Scala Types: static with errors at compile time

I The two APIs can work together using pipe.typed andTypedPipe.from

I This presentation is about the TypeSafe API ¨̂

Scalding

I Open source project developed @TwitterI Two APIs:

I Field BasedI Primary API: stableI Uses Cascading Fields: dynamic with errors at runtime

Scalding

Why Scalding

I MapReduce high-level idea comes from LISP and works onfunctions (Map/Reduce) and function composition

I Cascading works on objects representing functions and usesconstructors as compositor between pipes:

1 Pipe wcPipe = new Pipe( "wc", docPipe );2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),4 Fields.ALL );

I Functional programming can naturally describe data flows:every pipe can be seen as a function working and pipes can becombined using functional compositing. The code above canbe written as:

1 docPipe.groupBy( new Fields( "token" ) )2 .every(Fields.ALL, new Count(), Fields.ALL)

Why Scalding

Example: Word Count

1 class WordCount(args : Args) extends Job(args) {2

3 /* TextLine reads each line of the given file */4 val input = TypedPipe.from( TextLine( args( "input" ) ) )5

6 /* tokenize every line and flat the result into a list of words */7 val words = input.flatMap{ tokenize(_) }8

9 /* group by words and add a new field size that is the group size */10 val wordGroups = words.groupBy{ identity(_) }.size11

12 /* write each pair (word,count) as line using TextLine */13 wordGroups.write((0,1), TextLine( args( "output" ) ) )14

15 /* Split a piece of text into individual words */16 def tokenize(text : String) : Array[String] = {17 // Lowercase each word and remove punctuation.18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9\\s]", "")19 .split("\\s+")20 }21 }

Scalding TypeSafe API

Two main concepts:

I TypedPipe[T]: class whose instances are distributedobjects that wrap a cascading Pipe object, and holds thetransformation done up until that point. Its interface is similarto Scala’s Iterator[T] (map, flatMap, groupBy,filter,. . . )

I KeyedList[K,V]: trait that represents a sharded lists ofitems. Two implementations:

I Grouped[K,V]: represents a grouping on keys of type KI CoGrouped2[K,V,W,Result]: represents a cogroup over

two grouped pipes. Used for joins

Two main concepts:

Conclusions

I MapReduce API is powerful but limited

I Cascading API is as simple as the MapReduce API but moregeneric and powerful

I Scalding combines Cascading and Scala to easily describedistributed programs. Major strength points are:

I Functional programming to naturally describe data flows.Scalding is similar to Scala library, if you know Scala thenyou already know how to use Scalding

I Statically typed (TypeSafe API), no type errors at runtimeI Scala is standard and works on top of the JVMI Scala libraries and tools can be used in production: IDEs,

debug systems, test systems, build systems and everything else.

Conclusions

Thank you for listening

Scalding

Technology

Transcript of Scalding