Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit EU talk by Emlyn Whittick
-
Upload
spark-summit -
Category
Data & Analytics
-
view
262 -
download
2
Transcript of Spark Summit EU talk by Emlyn Whittick
![Page 1: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/1.jpg)
SPARK SUMMIT EUROPE 2016
TALK DATA TO ME:SPARKING INSIGHTS AT ELSEVIER
Emlyn WhittickPrincipal Software Engineer, Elsevier
![Page 2: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/2.jpg)
Elsevier• Founded in 1880 (136 years old)• Started life as a traditional
publisher• Specialised in scientific and
medical publishing• Now an information solutions
provider
![Page 3: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/3.jpg)
LEAD THE WAY IN ADVANCING SCIENCE, TECHNOLOGY AND HEALTHElsevier’s Mission
![Page 4: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/4.jpg)
A Wealth of Data
Article abstracts
Article citation
data
Full text articles Profile data
Institutional data
Funding data
Usage data Editorial data
Social networks
![Page 5: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/5.jpg)
The Challenge
![Page 6: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/6.jpg)
The Challenge
![Page 7: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/7.jpg)
COLLECTING THE INGREDIENTSPart One
![Page 8: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/8.jpg)
Let’s Go Shopping
![Page 9: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/9.jpg)
Collecting IngredientsStandardised, automated
collection and delivery
Proprietary, mostly automated mechanisms
Manual, ad-hoc collection and delivery
![Page 10: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/10.jpg)
Getting Access
![Page 11: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/11.jpg)
ALONG CAME A SPARKPart Two
![Page 12: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/12.jpg)
An Initial Spark• Use Case: Application of NLP on Elsevier’s
entire body of published content• Databricks selected for:
– Mounting of data– Minimal operational overhead– Presentation of results
![Page 13: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/13.jpg)
An Initial Spark• Databricks used by team of 15
people for content analytics• Spark adopted as the main
processing engine for Elsevier’s big data platform
![Page 14: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/14.jpg)
An Initial Spark• Spark 2.0 and Spark 1.6• RDD to DataFrame, and now to
DataSet• Production workflows in Scala• Data Science and Analytics
workflows may be in Python or R
![Page 15: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/15.jpg)
A Recipe for Insights• Empower the master chefs• Prepare the ingredients• Share pre-prepared and cooked dishes• Serve delicious dishes
![Page 16: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/16.jpg)
PREPARING OUR INGREDIENTSStep Three
![Page 17: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/17.jpg)
CSV Potato
• Old Classic• All shapes and sizes• Often dirty• Sometimes it’s a
sweet potato
![Page 18: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/18.jpg)
XML Onion
• Lots of nested layers• Sometimes it makes
you cry• Pretty much everyone
has some somewhere
![Page 19: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/19.jpg)
Delivering the Groceries• Large suppliers have well established delivery
mechanisms• The local store may be less reliable• The local farm even less so
![Page 20: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/20.jpg)
Sparking Data Preparation• 200 million article abstracts• Stored in Amazon S3• XML files named by ID• Data delivery notifications via SNS• Variable file sizes (kB to MB)
![Page 21: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/21.jpg)
Sparking Data Preparation• Skewed key distribution made processing
difficult• Data pre-processing using Spark Streaming• Data processing in batch using Spark and
Parquet as an output format
![Page 22: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/22.jpg)
LET’S GET COOKINGStep Four
![Page 23: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/23.jpg)
Cooking our Ingredients• Focus on sharing cooked ingredients• Cooking in batches has its advantages
– Verifying consistency– Testing and releasing batches– Recoverability
![Page 24: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/24.jpg)
Cooking with Spark• Our data processing is exclusively done in Spark• Cooking is mainly done in batch but we’re
looking at more and more streaming• Most synthesized batch datasets stored as
Parquet
![Page 25: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/25.jpg)
Cooking Citations• 200 million article abstracts in XML format• Calculation of article citations by article and
author• Transformed to key value JSON for REST API• Mounted and shared in Databricks
![Page 26: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/26.jpg)
Self Service with Databricks• Data from the data lake
mounted in DBFS• Single shared cluster
provided for general use• Bespoke clusters for
heavy workloads0
10
20
30
40
50
60
70
80
Databricks Users at Elsevier
![Page 27: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/27.jpg)
![Page 28: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/28.jpg)
Databricks Use Cases• Author relationships and graphs• Author disambiguation research• Article recommendations• Data profiling and exploration• Learning
![Page 29: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/29.jpg)
Databricks Use Cases
![Page 30: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/30.jpg)
HAUTE CUISINENext Up
![Page 31: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/31.jpg)
Our Road Ahead• Fine-grained data access, privacy and security• Data discovery and provenance• Data cleansing and classification• Enhanced operational support
![Page 33: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/33.jpg)
![Page 34: Spark Summit EU talk by Emlyn Whittick](https://reader031.fdocuments.in/reader031/viewer/2022021813/586f75821a28ab10258b60a7/html5/thumbnails/34.jpg)
HPCC• Developed for processing public records• Limited flexibility• Limited support and community• High barrier to entry• Difficult to get back to the original data• Not tolerant to faults