Scalding by Adform Research, Alex Gryzlov
-
Upload
vasil-remeniuk -
Category
Technology
-
view
95 -
download
3
Transcript of Scalding by Adform Research, Alex Gryzlov
![Page 1: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/1.jpg)
Quick Guide
![Page 2: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/2.jpg)
What is Scalding ?
• Scala wrapper for Cascading
![Page 3: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/3.jpg)
What is Cascading ?
Tap / Pipe / Sink abstraction over Map / Reduce in Java
![Page 4: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/4.jpg)
What is Scalding ?
• Scala wrapper for Cascading
• Just like working with in-memory collections !
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
• No more scripting and UDFs!
![Page 5: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/5.jpg)
Hands on
• Clone the skeleton repository
• Get IntelliJ Idea and the scala plugin
• Open the project
• Compile, wait for dependencies to download
• Create a run configuration …
• Create a specs2 configuration for tests
![Page 6: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/6.jpg)
run the WordCountJob in local mode with given input and output
![Page 7: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/7.jpg)
Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3
• Configure teamcity
![Page 8: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/8.jpg)
Running on EMR
• hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar
• hadoop jar job.jar \
com.twitter.scalding.Tool \ Entry class
com.adform.dspr.WordCountJob \ Scalding job class
--hdfs \ Run in HDFS mode
--input s3://adform-dsp-metadata/countries/countries.txt \ Parameter
--output s3://dev-adform-temp-results/wordcount Parameter
![Page 9: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/9.jpg)
Under the covers
• sbt run-main \
com.twitter.scalding.Tool \
com.adform.dspr.WordCountJob \
--hdfs \
--tool.graph \
--input dummy --output dummy
• dot -Tpng com.adform.dspr.WordCountJob0.dot -o logical_plan.png
• dot -Tpng com.adform.dspr.WordCountJob0_steps.dot -o mr_plan.png
![Page 10: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/10.jpg)
![Page 11: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/11.jpg)
Development
• Different APIs:• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction
![Page 12: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/12.jpg)
Development
• Fields:• No need to parse columns
• Redundant
• No IDE support like auto-completion
• Typed:• All benefits of types
• More manual work with parsing
![Page 13: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/13.jpg)
Resources
• https://github.com/twitter/scalding
• https://github.com/twitter/scalding/tree/develop/tutorial
• https://github.com/twitter/scalding/wiki
• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation
• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014
• https://gitz.adform.com/dspr/data-processing/tree/develop/jobs/process-logs-rtb
![Page 14: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/14.jpg)
My Experience
• Running the job locally is a HUGE time saver
• Programming scala is amazing (no more UDFs)
• Type safety, IDE support!
• Debugging !!!!111
• More optimal job plans
![Page 15: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/15.jpg)
My Experience
• A lot of configuring and googling random issues
• Scarce documentation, had to read source code
• IntelliJ is slow
• Boilerplate code for parsing data
![Page 16: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/16.jpg)
Use cases
• Easy jobs hive
• Non-trivial jobs scalding
• Optional: scalding is nice for doing matrix calculations, twitter also provides a lot of monoids (algorithms) for nice approximations, e.g. HyperLogLog, CountMinSketch, etc. (see algebird).
![Page 17: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/17.jpg)
process-logs-rtb
• Had to hack scalding: • WritableMultiSinkTap
• Records
• CompressedTsv
• ModelKryoInstantiator
• Uses typed API
• Helpers like FluentJob
![Page 18: Scalding by Adform Research, Alex Gryzlov](https://reader033.fdocuments.in/reader033/viewer/2022051414/55a68d5b1a28abbe7d8b4721/html5/thumbnails/18.jpg)