Spark tutorial
-
Upload
sahan-bulathwela -
Category
Data & Analytics
-
view
272 -
download
0
Transcript of Spark tutorial
![Page 1: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/1.jpg)
Learning PySparkA Tutorial
By:Maria Mestre
(@mariarmestre)Sahan Bulathwela
(@in4maniac)Erik Pazos (@zerophewl)
![Page 2: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/2.jpg)
This tutorial
Skimlinks | Spark… A view from the trenches !!
● Some key Spark concepts (2 minute crash course)
● First part: Spark core○ Notebook: basic operations○ Spark execution model
● Second part: Dataframes and SparkSQL○ Notebook : using DataFrames and Spark SQL ○ DataFrames execution model
● Final note on Spark configs and useful areas to go from here
![Page 3: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/3.jpg)
How to setup the tutorial
Skimlinks | Spark… A view from the trenches !!
● Directions and resources to setup the tutorial in your local environment can be found at the below mentioned blog post
https://in4maniac.wordpress.com/2016/10/09/spark-tutorial/
![Page 4: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/4.jpg)
● Data Extracted from Amazon Dataset o Image-based recommendations on styles and substitutes , J. McAuley, C. Targett, J.
Shi, A. van den Hengel, SIGIR, 2015o Inferring networks of substitutable and complementary products, J. McAuley, R.
Pandey, J. Leskovec, Knowledge Discovery and Data Mining, 2015
● sample of Amazon product reviewso fashion.json, electronics.json, sports.jsono fields: ASIN, review text, reviewer name, …
● sample of product metadatao sample_metadata.jsono fields: ASIN, price, category, ...
The datasets
Skimlinks | Spark… A view from the trenches
![Page 5: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/5.jpg)
Some Spark definitions (1)
Skimlinks | Spark… A view from the trenches
● An RDD is a distributed dataset● The dataset is divided into partitions● It is possible to cache data in memory
![Page 6: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/6.jpg)
Some Spark definitions (2)
Skimlinks | Spark… A view from the trenches
● A cluster = a master node and slave nodes● Transformations through the Spark context● Only the master node has access to the Spark context● Actions and transformations
![Page 7: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/7.jpg)
Skimlinks | Spark… A view from the trenches
Notebook - Spark core parts 1-3
![Page 8: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/8.jpg)
Why understanding Spark internals?
● essential to understand failures and improve performance
This section is a condensed version of: https://spark-summit.org/2014/talk/a-deeper-understanding-of-spark-internals
Skimlinks | Spark… A view from the trenches !!
![Page 9: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/9.jpg)
From code to computations
Skimlinks | Spark… A view from the trenches
rd = sc.textFile(‘product_reviews.txt’)
rd.map(lambda x: (x[‘asin’], x[‘overall’]))
.groupByKey()
.filter(lambda x: len(x[1])> 1)
.count()
![Page 10: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/10.jpg)
From code to computations
Skimlinks | Spark… A view from the trenches
1. You write code using RDDs
2. Spark creates a graph of RDDs
rd = sc.textFile(‘product_reviews.txt’)rd..map(lambda x: (x[‘asin’], x[‘overall’]))
.groupByKey().filter(lambda x: len(x[1])> 1).count()
![Page 11: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/11.jpg)
Execution model
Skimlinks | Spark… A view from the trenches
Stage 1
3. Spark figures out logical execution plan for each computation
Stage 2
![Page 12: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/12.jpg)
Execution model
Skimlinks | Spark… A view from the trenches
4. Schedules and executes individual tasks
![Page 13: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/13.jpg)
Skimlinks | Spark… A view from the trenches
If your shuffle fails...● Shuffles are usually the bottleneck:
o if very large tasks ⇒ memory pressureo if too many tasks ⇒ network overheado if too few tasks ⇒ suboptimal cluster utilisation
● Best practices:o always tune the number of partitions!o between 100 and 10,000 partitionso lower bound: at least ~2x number of coreso upper bound: task should take at least 100 ms
● https://spark.apache.org/docs/latest/tuning.html
![Page 14: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/14.jpg)
Skimlinks | Spark… A view from the trenches
Other things failing...
● I’m trying to save a file but it keeps failing...○ Turn speculation off!
● I get an error “no space left on device”!○ Make sure the SPARK_LOCAL_DIRS use the right disk
partition on the slaves
● I keep losing my executors○ could be a memory problem: increase executor
memory, or reduce the number of cores
![Page 15: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/15.jpg)
Skimlinks | Spark… A view from the trenches
Notebook - Spark core part 4
![Page 16: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/16.jpg)
Skimlinks | Spark… A view from the trenches
Apache Spark
![Page 17: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/17.jpg)
Skimlinks | Spark… A view from the trenches
DataFrames API
![Page 18: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/18.jpg)
Skimlinks | Spark… A view from the trenches
DataFrames API
![Page 19: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/19.jpg)
DataFrames and Spark SQL
Skimlinks | Spark… A view from the trenches
A DataFrame is a collection of data that is organized with named columns.
● API very similar to Pandas/R DataFrames
Spark SQL is a functionality that allows to query from DataFrames using SQL-like schematic language
● Catalyst SQL engine
● Hive Context opens up most of HQL functionality with DataFrames
![Page 20: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/20.jpg)
RDDs and DataFrames
Skimlinks | Spark… A view from the trenches
RDDData is stored as independent objects in partitions
Does process optimization on RDD level
More focus on “HOW” to obtain the required data
DataFrameData has higher level column information in
addition to partitioning
Does optimizations on schematic structure
More focus on “WHAT” data is required
Transformable
![Page 21: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/21.jpg)
Skimlinks | Spark… A view from the trenches
Notebook - Spark DataFrames
![Page 22: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/22.jpg)
How do DataFrames work?
●WHY DATAFRAMES??●Overview
This section is inspired by: http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science
Skimlinks | Spark… A view from the trenches
![Page 23: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/23.jpg)
Main Considerations
Skimlinks | Spark… A view from the trenches
Chart extracted from : https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
![Page 24: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/24.jpg)
Fundamentals
Skimlinks | Spark… A view from the trenches
Un Resolved Logical
Plan Logical Plan
Optimized Logical
Plan
Efficient Physical
Plan
PhysicalPlans
SELECT colsFROM tablesWHERE cond
Code: more_code more() Code=1
DataFrame SparkSQL
RDD
Catalog
![Page 25: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/25.jpg)
COMPANYNAME.COM | PRESENTATION
Notebook - Spark SQL
![Page 26: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/26.jpg)
New stuff: Data Source APIs●Schema Evolution
o In parquet, you can start from a basic schema and keep adding new fields.
●Run SQL directly on the fileo In Parquet files, run the SQL on the file itself
as parquet has got structure
![Page 27: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/27.jpg)
Data Source APIs●Partition Discovery
oTable partitioning is used in systems like HiveoData is normally stored in different directories
![Page 28: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/28.jpg)
spark-sklearn ●Parameter Tuning is the problem
oDataset is smalloGrid search is BIG
More info: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html
![Page 29: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/29.jpg)
New stuff: DataSet API● Spark : Complex
analyses with minimal programming effort
● Run Spark applications faster o Closely knit to Catalyst
engine and Tungsten Engine
● Extension of DataFrame API: type safe, object oriented programming interface
More info: https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
![Page 30: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/30.jpg)
Spark 2.0● API Changes● A lot of work on
Tungsten Execution engine
● Support of Dataset API
● Unification of DataFrame & Dataset APIs
More info: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-
rdds-dataframes-and-datasets.html
![Page 31: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/31.jpg)
Important Links
Skimlinks | Spark… A view from the trenches
● Amazon Dataset : https://snap.stanford.edu/data/web-Amazon.html● Spark DataFrames :
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html● More resources about Apache Spark:
○ http://www.slideshare.net/databricks○ https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA
● Spark SQL programming guide for 1.6.1:https://spark.apache.org/docs/latest/sql-programming-guide.html● Using Apache Spark in real world applications:
http://files.meetup.com/13722842/Spark%20Meetup.pdf ● Tungsten
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html ● Further Questions:
○ Maria : @mariarmestre○ Erik : @zerophewl○ Sahan : @in4maniac
![Page 32: Spark tutorial](https://reader036.fdocuments.in/reader036/viewer/2022062522/586f7a461a28ab10258b72b3/html5/thumbnails/32.jpg)
Skimlinks is hiring Data Scientists and Senior Software Engineers !!
● Machine Learning● Apache Spark and Big Data
Get in touch with: ● Sahan : [email protected]● Erik : [email protected]