Scalding - the not-so-basics @ ScalaDays 2014

Scaldingthe not-so-basics

Konrad 'ktoso' Malawski Scala Days 2014 @ Berlin

Konrad `@ktosopl` Malawski

typesafe.com geecon.org

Java.pl / KrakowScala.pl sckrk.com / meetup.com/Paper-Cup @ London

GDGKrakow.pl meetup.com/Lambda-Lounge-Krakow

hAkker @

http://typesafe.com

http://geecon.org

http://java.pl

http://krakowscala.pl

http://sckrk.com

http://www.meetup.com/Paper-Cup

http://gdgkrakow.pl

http://www.meetup.com/Lambda-Lounge-Krakow/

http://hadoop.apache.org/

http://research.google.com/archive/mapreduce.html

How old is this guy?

http://hadoop.apache.org/

http://research.google.com/archive/mapreduce.html

Google MapReduce, paper: 2004 Hadoop (Yahoo impl): 2005

the Big Landscape

Hadoop

https://github.com/twitter/scalding

Scalding is “on top of” Hadoop


Scalding is “on top of” Cascading, which is “on top of” Hadoop

http://www.cascading.org/


Summingbird is “op top of” Scalding, which is “on top of” Cascading, which is “on top of” Hadoop

http://www.cascading.org/https://github.com/twitter/summingbird


Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop

http://www.cascading.org/https://github.com/twitter/summingbirdhttp://storm.incubator.apache.org/


Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop;

Spark is a bit “separate” currently.


http://storm.incubator.apache.org/

http://spark.apache.org/







HDFS yes,

MapReduce no







HDFS yes,

MapReduce no

Possibly soon?!



Spark has nothing to do with all this.




-streams


Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop




this talk

Stuff > Memory

Scala collections... fun but, memory bound!

val text = "so many words... waaah! ..."!!! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!

Stuff > Memory



in Memory

Stuff > Memory



in Memory

in Memory

Stuff > Memory



in Memory

in Memory

in Memory

Stuff > Memory



in Memory

in Memory

in Memory

in Memory

Stuff > Memory



in Memory

in Memory

in Memory

in Memory

in Memory

package org.myorg;!!import org.apache.hadoop.fs.Path;!import org.apache.hadoop.io.IntWritable;!import org.apache.hadoop.io.LongWritable;!import org.apache.hadoop.io.Text;!import org.apache.hadoop.mapred.*;!!import java.io.IOException;!import java.util.Iterator;!import java.util.StringTokenizer;!!public class WordCount {!! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }!

Why Scalding?Word Count in Hadoop MR

private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }! }! }!! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! output.collect(key, new IntWritable(sum));! }! }!! public static void main(String[] args) throws Exception {! JobConf conf = new JobConf(WordCount.class);! conf.setJobName("wordcount");!! conf.setOutputKeyClass(Text.class);! conf.setOutputValueClass(IntWritable.class);!! conf.setMapperClass(Map.class);! conf.setCombinerClass(Reduce.class);! conf.setReducerClass(Reduce.class);!! conf.setInputFormat(TextInputFormat.class);! conf.setOutputFormat(TextOutputFormat.class);!! FileInputFormat.setInputPaths(conf, new Path(args[0]));! FileOutputFormat.setOutputPath(conf, new Path(args[1]));!! JobClient.runJob(conf);! }!}!

Why Scalding?Word Count in Hadoop MR

“Field API”

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapScala:

val data = 1 :: 2 :: 3 :: Nil


map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

Scala:

val data = 1 :: 2 :: 3 :: Nil


map


Scala:

available in Pipe

val data = 1 :: 2 :: 3 :: Nil


map


Scala:

available in Pipestays in Pipe

val data = 1 :: 2 :: 3 :: Nil!!val doubled = data map { _ * 2 }!! // Int => Int

map

IterableSource(data)! .map('number -> 'doubled) { n: Int => n * 2 }!!! // Int => Int

Scala:

must choose type!

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapToScala:

var data = 1 :: 2 :: 3 :: Nil


mapToScala:

“release reference”

var data = 1 :: 2 :: 3 :: Nil


mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

Scala:


var data = 1 :: 2 :: 3 :: Nil


mapTo


Scala:

doubled stays in Pipe


var data = 1 :: 2 :: 3 :: Nil


mapTo


Scala:

doubled stays in Pipenumber is removed


flatMap

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

flatMapScala:


val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]

Scala:

flatMap


val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

flatMapScala:


val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]

Scala:

groupBy

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupByScala:




groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:




groupBy


Scala:

groups all with == value




groupBy


Scala:

groups all with == value 'lessThanTenCounts

groupBy

groupBy

IterableSource(List(1, 2, 30, 42), 'num)

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

'total = [3, 74]

import org.apache.hadoop.util.ToolRunner!import com.twitter.scalding!!object ScaldingJobRunner extends App {!! ToolRunner.run(new Configuration, new scalding.Tool, args)!!}

Main Class - "Runner"

import org.apache.hadoop.util.ToolRunner!import com.twitter.scalding!!object ScaldingJobRunner extends App {!! ToolRunner.run(new Configuration, new scalding.Tool, args)!!}

Main Class - "Runner"

from App

class WordCountJob(args: Args) extends Job(args) {!!!!!!!!!!!}

Word Count in Scalding

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!!!!!!!!}


class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)!!!!!!}


class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }!!!! def tokenize(text: String): Array[String] = implemented!}


class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size('count) }!!! def tokenize(text: String): Array[String] = implemented!}


class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size }!!! def tokenize(text: String): Array[String] = implemented!}


class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }!!! def tokenize(text: String): Array[String] = implemented!}


class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))!! def tokenize(text: String): Array[String] = implemented!}


class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))!! def tokenize(text: String): Array[String] = implemented!}


4{

1 day in the life of a guy implementing Scalding jobs

“How much are my shops selling?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output))


Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output))

1!107!2!144!3!16!… …


Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true))


Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true))

shopId! totalSoldItems! 1!! ! ! 107! 2!! ! ! 144! 3!! ! ! 16! …!! ! ! …

“Which are the top selling shops?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true))

“Which are the top selling shops?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true))

shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16! …!! ! ! …

“What’s the top 3 shops?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true))



shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16




SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))


x!List((5,146), (2,142), (3,32))!


x!List((5,146), (2,142), (3,32))!

WAT!?


x!List((5,146), (2,142), (3,32))!

WAT!?

Emits scala.collection.List[_]

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))

Provide Ordering explicitly because implicit Ordering is not enough for Tuple2 here

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))




MUCH faster Job =

Happier me.

Reduce, these Monoids

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}




interface:


+ 3 laws:


interface:


+ 3 laws:

Closure:


interface:


+ 3 laws:(T, T) => TClosure:


∀a,b∈T:a·b∈T

interface:



Associativity:


∀a,b∈T:a·b∈T

interface:



Associativity:


∀a,b∈T:a·b∈T

∀a,b,c∈T:(a·b)·c=a·(b·c)(a + b) + c! ==!a + (b + c)

interface:



Associativity:

Identity element:


∀a,b∈T:a·b∈T


interface:



Associativity:

Identity element:


∀a,b∈T:a·b∈T


interface:

∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a


object IntSum extends Monoid[Int] {! def zero = 0! def +(a: Int, b: Int) = a + b!}

Summing:

Monoid ops can start “Map-side”

bear, 2

car, 3

deer, 2

Monoid ops can already start being computed map-side!

Monoid ops can already start being computed map-side!

river, 2

Monoid ops can start “Map-side”

average() sum()

sortWithTake() histogram()

Examples:

bear, 2

car, 3

deer, 2

river, 2

Obligatory: “Go check out Algebird, NOW!” slide

https://github.com/twitter/algebird

ALGE-birds

BloomFilterMonoid

https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL

val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2!// bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")!// approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)!! val res = approxBool.isTrue! // res: Boolean = true

https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL

BloomFilterMonoid

Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))

BloomFilterMonoid


shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false!

BloomFilterMonoid



Why not Set[String]? It would OutOfMemory.

BloomFilterMonoid



ApproximateBoolean(true,0.9999580954658956)

Why not Set[String]? It would OutOfMemory.

Joins

that.joinWithLarger('id1 -> 'id2, other)!that.joinWithSmaller('id1 -> 'id2, other)! !!that.joinWithTiny('id1 -> 'id2, other)

Joins


joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where

mappers is the number of mappers in the job.

Joins


joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where

mappers is the number of mappers in the job.

The “usual”

Joinsval people = IterableSource(!(1, “hans”) ::!(2, “bob”) ::!(3, “hermut”) ::!(4, “heinz”) ::!(5, “klemens”) :: … :: Nil,!('id, 'name))

val cars = IterableSource(!(99, 1, “bmw") :: !(123, 2, "mercedes”) ::!(240, 11, “other”) :: Nil,!('carId, 'ownerId, 'carName))!

Joins

import com.twitter.scalding.FunctionImplicits._!!people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output)

val people = IterableSource(!(1, “hans”) ::!(2, “bob”) ::!(3, “hermut”) ::!(4, “heinz”) ::!(5, “klemens”) :: … :: Nil,!('id, 'name))


Joins

import com.twitter.scalding.FunctionImplicits._!!people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output)

Hello hans, your bmw is really nice!Hello bob, your bob's car is really nice!

val people = IterableSource(!(1, “hans”) ::!(2, “bob”) ::!(3, “hermut”) ::!(4, “heinz”) ::!(5, “klemens”) :: … :: Nil,!('id, 'name))


“map-side” join

that.joinWithTiny('id1 -> 'id2, tinyPipe)

Choose this when: !

or: when the Left side is 3 orders of magnitude larger.

Left > max(mappers,reducers) * Right!

Skew Joinsval sampleRate = 0.001!val reducers = 10!val replicationFactor = 1!val replicator = SkewReplicationA(replicationFactor)! !!val genders: RichPipe = …!val followers: RichPipe = …!!followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))


1. Sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe.



2. Use these estimated counts to replicate the join keys, according to the given replication strategy.



2. Use these estimated counts to replicate the join keys, according to the given replication strategy.

3. Join the replicated pipes together.

Where did my type-safety go?!

Where did my type-safety go?!Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))!


Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744)

Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)


Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744)

Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)

“oh, right… We changed that file to be user names, not ids…”

Trap it!Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))

Trap it!Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))

solves “dirty data”, no help for maintenance

Typed API

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!


import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!


import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!

Must give Type to each Field


TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!

import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!



import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!Tuple arity: 2 Tuple arity: 3


Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)




Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)



“planing-time” exception


// … with Relationships {! import TDsl._!! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))!!}



Easier to reuse schemas now



Easier to reuse schemas now

Not coupled by Field names, but still too magic for reuse… “_1”?


// … with Relationships {! import TDsl._!! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))!!}


// … with Relationships {! import TDsl._!! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))!!}

TypedPipe[Person]

Typed Joins

case class UserName(id: Long, handle: String)!case class UserFavs(byUser: Long, favs: List[Long])!case class UserTweets(byUser: Long, tweets: List[Long])! !def users: TypedSource[UserName]!def favs: TypedSource[UserFavs]!def tweets: TypedSource[UserTweets]! !def output: TypedSink[(UserName, UserFavs, UserTweets)]! !users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!

Typed Joins

case class UserName(id: Long, handle: String)!case class UserFavs(byUser: Long, favs: List[Long])!case class UserTweets(byUser: Long, tweets: List[Long])! !def users: TypedSource[UserName]!def favs: TypedSource[UserFavs]!def tweets: TypedSource[UserTweets]! !def output: TypedSink[(UserName, UserFavs, UserTweets)]! !users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!

3-way-merge in 1 MR step

> run pl.project13.oculus.job.WordCountJob ! —local —tool.graph --input in --output out!!writing DOT: ! pl.project13.oculus.job.WordCountJob0.dot!!writing Steps DOT: ! pl.project13.oculus.job.WordCountJob0_steps.dot

Do the DOT

Do the DOT !!!!

pl.project13.oculus.job.WordCountJob0.dot!!!!!!!!!!!!! !

pl.project13.oculus.job.WordCountJob0_steps.dot

!!!!

> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!!!!!!!!!!!!! !

Do the DOT

!!!!


Do the DOT

M A P

!!!!


Do the DOT

M A P

R E D

Do the DOT

<3 Testing

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }!!}!

<3 Testing

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }!!}!

<3 Testing

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }!!}!

<3 Testing

run || runHadoop

“Parallelize all the batches!”

“Parallelize all the batches!”Feels much like Scala collections

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading


Easy to add custom Taps


Easy to add custom TapsType Safe, when you want to



Pure Scala



Pure ScalaTesting friendly




Matrix API




Matrix APIEfficient columnar storage (Parquet)

Scalding Re-Cap

!!! !! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))!!!

Scalding Re-Cap

!!! !! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))!!!

4{

!!! !! $ activator new activator-scalding!!

Try it!

http://typesafe.com/activator/template/activator-scalding

Template by Dean Wampler

Loads Of Links

1. http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about 2. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala 3. http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4 4. http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2?

qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3 5. http://www.slideshare.net/LivePersonDev/scalding-reaching-efficient-mapreduce?

qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2 6. http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/ 7. http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/ 8. https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science 9. https://github.com/parquet/parquet-format 10. http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code 11. https://github.com/scalaz/scalaz 12. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about

https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala

http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4

http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3

http://www.slideshare.net/LivePersonDev/scalding-reaching-efficient-mapreduce?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2

http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/

https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science

https://github.com/parquet/parquet-format

http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code

https://github.com/scalaz/scalaz

http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

!

Danke! Dzięki! Thanks! Gracias!

ありがとう！

ktoso @ typesafe.com t: ktosopl / g: ktoso blog: project13.pl

http://twitter.com/ktosopl

http://github.com/ktoso

http://blog.project13.pl

Scalding - the not-so-basics @ ScalaDays 2014

Technology

Transcript of Scalding - the not-so-basics @ ScalaDays 2014