Apache: Big Data - Starting with Apache Spark, Best Practices
-
Upload
felixcss -
Category
Data & Analytics
-
view
242 -
download
3
Transcript of Apache: Big Data - Starting with Apache Spark, Best Practices
![Page 1: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/1.jpg)
![Page 2: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/2.jpg)
Starting with Apache Spark, Best Practices and Learning from
the FieldFelix Cheung, Principal Engineer + Spark Committer
Spark@Microsoft
![Page 3: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/3.jpg)
![Page 4: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/4.jpg)
Best PracticesEnterprise Solutions
![Page 5: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/5.jpg)
![Page 6: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/6.jpg)
![Page 7: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/7.jpg)
Resilient - Fault tolerant
![Page 8: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/8.jpg)
![Page 9: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/9.jpg)
19,500+ commits
![Page 10: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/10.jpg)
TungstenAMPLab becoming RISELab• Drizzle – low latency execution, 3.5x lower than
Spark Streaming• Ernest – performance prediction, automatically
choose the optimal resource config on the cloud
![Page 11: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/11.jpg)
![Page 12: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/12.jpg)
DeploymentSchedulerResource Manager (aka Cluster Manager)
- Spark History Server, Spark UI
Spark Core
![Page 13: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/13.jpg)
![Page 14: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/14.jpg)
Parallelization, PartitionTransformationActionShuffle
![Page 15: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/15.jpg)
Doing multiple things at the same time
![Page 16: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/16.jpg)
A unit of parallelization
![Page 17: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/17.jpg)
Manipulating data - immutable"Narrow""Wide"
![Page 18: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/18.jpg)
![Page 19: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/19.jpg)
![Page 20: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/20.jpg)
Processing: sorting, serialize/deserialize, compressionTransfer: disk IO, network bandwidth/latencyTake up memory, or spill to disk for intermediate results ("shuffle file")
![Page 21: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/21.jpg)
Materialize resultsExecute the chain of transformations that leads to output – lazy evaluationcountcollect -> takewrite
![Page 22: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/22.jpg)
DataFrameDatasetData sourceExecution engine - Catalyst
SQL
![Page 23: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/23.jpg)
Execution PlanPredicate Pushdown
![Page 24: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/24.jpg)
Strong typingOptimized execution
![Page 25: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/25.jpg)
Dataset[Row]
Partition = set of Row's
![Page 26: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/26.jpg)
"format" - Parquet, CSV, JSON, or Cassandra, HBase
![Page 27: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/27.jpg)
![Page 28: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/28.jpg)
Ability to process expressions as early in the plan as possible
![Page 29: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/29.jpg)
spark.read.jdbc(jdbcUrl, "food", connectionProperties)
// with pushdownspark.read.jdbc(jdbcUrl, "food", connectionProperties).select("hotdog", "pizza", "sushi")
![Page 30: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/30.jpg)
Discretized Streams (DStreams)Receiver DStreamDirect DStreamBasic and Advanced Sources
Streaming
![Page 31: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/31.jpg)
SourceReliabilityReceiver + Write Ahead Log (WAL)Checkpointing
![Page 32: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/32.jpg)
![Page 33: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/33.jpg)
![Page 34: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/34.jpg)
![Page 35: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/35.jpg)
https://databricks.com/wp-content/uploads/2015/01/blog-ha-52.png
![Page 36: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/36.jpg)
Only for reliable messaging sources that supports read from positionStronger fault-tolerance, exactly-once*No receiver/WAL– less resource, lower overhead
![Page 37: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/37.jpg)
Saving to reliable storage to recover from failure1. Metadata checkpointingStreamingContext.checkpoint()
2. Data checkpointingdstream.checkpoint()
![Page 38: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/38.jpg)
ML PipelineTransformerEstimatorEvaluator
Machine Learning
![Page 39: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/39.jpg)
DataFrame-based- leverage optimizations and support transformationsa sequence of algorithms- PipelineStages
![Page 40: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/40.jpg)
Transformer EstimatorTransformerDataFrame
Feature engineering Modeling
![Page 41: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/41.jpg)
Feature transformer- take a DataFrame and its Column and append one or more new Column
![Page 42: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/42.jpg)
StopWordsRemoverBinarizerSQLTransformerVectorAssembler
![Page 43: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/43.jpg)
Estimators
An algorithmDataFrame -> ModelA Model is a Transformer
LinearRegressionKMeans
![Page 44: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/44.jpg)
Evaluator
Metric to measure Model performance on held-out test data
![Page 45: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/45.jpg)
Evaluator
MulticlassClassificationEvaluatorBinaryClassificationEvaluatorRegressionEvaluator
![Page 46: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/46.jpg)
MLWriter/MLReader
Pipeline persistenceInclude transformers, estimators, Params
![Page 47: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/47.jpg)
GraphPregelGraph AlgorithmsGraph Queries
Graph
![Page 48: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/48.jpg)
Directed multigraph with user properties on edges and vertices
SEANYC
LAX
![Page 49: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/49.jpg)
PageRankConnectedComponents
ranks = tripGraph.pageRank(resetProbability=0.15, maxIter=5)
![Page 50: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/50.jpg)
DataFrame-basedSimplify loading graph data, wranglingSupport Graph Queries
![Page 51: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/51.jpg)
Pattern matchingMix pattern with SQL syntax
motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a); !(c)-[]->(a)").filter("a.id = 'MIA'")
![Page 52: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/52.jpg)
Structured Streaming ModelSourceSinkStreamingQuery
Structured Streaming
![Page 53: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/53.jpg)
Extending same DataFrame to include incremental execution of unbounded input
Reliability, correctness / exactly-once -checkpointing (2.1 JSON format)
![Page 54: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/54.jpg)
Stream as Unbounded Input
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
![Page 55: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/55.jpg)
Watermark (2.1) - handling of late dataStreaming ETL, joining static data, partitioning, windowing
![Page 56: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/56.jpg)
FileStreamSourceKafkaSourceMemoryStream (not for production)TextSocketSourceMQTT
![Page 57: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/57.jpg)
FileStreamSink (new formats in 2.1)ConsoleSinkForeachSink (Scala only)MemorySink – as Temp View
![Page 58: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/58.jpg)
staticDF = (spark
.read
.schema(jsonSchema)
.json(inputPath))
![Page 59: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/59.jpg)
streamingDF = (spark
.readStream
.schema(jsonSchema)
.option("maxFilesPerTrigger", 1)
.json(inputPath))# Take a list of files as a stream
![Page 60: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/60.jpg)
streamingCountsDF = (streamingDF
.groupBy(streamingDF.word,window(streamingDF.time,"1 hour"))
.count())
![Page 61: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/61.jpg)
query = (streamingCountsDF
.writeStream
.format("memory")
.queryName("word_counts")
.outputMode("complete")
.start())spark.sql("select count from word_counts order by time")
![Page 62: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/62.jpg)
![Page 63: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/63.jpg)
How much going in affects how much work it's going to take
![Page 64: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/64.jpg)
Size does matter!CSV or JSON is "simple" but also tend to be bigJSON-> Parquet (compressed)- 7x faster
![Page 65: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/65.jpg)
Format also does matter
Recommended format - ParquetDefault data source/format• VectorizedReader• Better dictionary decoding
![Page 66: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/66.jpg)
Parquet Columnar Format
Column chunk co-locatedMetadata and headers for skipping
![Page 67: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/67.jpg)
Recommend Parquet
![Page 68: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/68.jpg)
Compression is a factor
gzip <100MB/s vs snappy 500MB/sTradeoffs: faster or smaller?Spark 2.0+ defaults to snappy
![Page 69: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/69.jpg)
Sidenote: Table Partitioning
Storage data into groups of partitioning columnsEncoded path structure matches Hivetable/event_date=2017-02-01
![Page 70: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/70.jpg)
Spark UITimeline view
https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
![Page 71: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/71.jpg)
![Page 72: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/72.jpg)
Spark UIDAG view
https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
![Page 73: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/73.jpg)
Executor tab
![Page 74: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/74.jpg)
SQL tab
![Page 75: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/75.jpg)
![Page 76: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/76.jpg)
Understanding Queries
explain() is your friendbut it could be hard to understand at times == Parsed Logical Plan ==
Aggregate [count(1) AS count#79L]+- Sort [speed_y#49 ASC], true
+- Join Inner, (speed_x#48 = speed_y#49):- Project [speed#2 AS speed_x#48, dist#3]: +- LogicalRDD [speed#2, dist#3]+- Project [speed#18 AS speed_y#49, dist#19]
+- LogicalRDD [speed#18, dist#19]
![Page 77: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/77.jpg)
![Page 78: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/78.jpg)
== Physical Plan ==*HashAggregate(keys=[], functions=[count(1)], output=[count#79L])+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#83L])
+- *Project+- *Sort [speed_y#49 ASC], true, 0
+- Exchange rangepartitioning(speed_y#49 ASC, 200)+- *Project [speed_y#49]
+- *SortMergeJoin [speed_x#48], [speed_y#49], Inner:- *Sort [speed_x#48 ASC], false, 0: +- Exchange hashpartitioning(speed_x#48, 200): +- *Project [speed#2 AS speed_x#48]: +- *Filter isnotnull(speed#2): +- Scan ExistingRDD[speed#2,dist#3]+- *Sort [speed_y#49 ASC], false, 0
+- Exchange hashpartitioning(speed_y#49, 200)+- *Project [speed#18 AS speed_y#49]
![Page 79: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/79.jpg)
UDF
Write you own custom transformsBut... Catalyst can't see through it (yet?!)Always prefer to use builtin transforms as much as possible
![Page 80: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/80.jpg)
UDF vs Builtin Example
Remember Predicate Pushdown?val isSeattle = udf { (s: String) => s == "Seattle" }cities.where(isSeattle('name))*Filter UDF(name#2)+- *FileScan parquet [id#128L,name#2] Batched: true, Format: ParquetFormat, InputPaths: file:/Users/b/cities.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint,name:string>
![Page 81: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/81.jpg)
UDF vs Builtin Example
cities.where('name === "Seattle")*Project [id#128L, name#2]+- *Filter (isnotnull(name#2) && (name#2 = Seattle))
+- *FileScan parquet [id#128L,name#2] Batched: true, Format: ParquetFormat, InputPaths: file:/Users/b/cities.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Seattle)], ReadSchema: struct<id:bigint,name:string>
![Page 82: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/82.jpg)
UDF in Python
Avoid!Why? Pickling, transfer, extra memory to run Python interpreter- Hard to debug errors!
from pyspark.sql.types import IntegerTypesqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType())sqlContext.sql("SELECT stringLengthInt('test')").take(1)
![Page 83: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/83.jpg)
Going for Performance
Stored in compressed ParquetPartitioned tablePredicate PushdownAvoid UDF
![Page 84: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/84.jpg)
Shuffling for Join
Can be veryexpensive
![Page 85: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/85.jpg)
Optimizing for Join
Partition!Narrow transform if left and right partitioned with same scheme
![Page 86: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/86.jpg)
Optimizing for JoinBroadcast Join (aka Map-Side Join in Hadoop)
Smaller table against large table - avoid shuffling large table
Default 10MB auto broadcast
![Page 87: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/87.jpg)
BroadcastHashJoinleft.join(right, Seq("id"), "leftanti").explain
== Physical Plan ==*BroadcastHashJoin [id#50], [id#60], LeftAnti, BuildRight:- LocalTableScan [id#50, left#51]+- BroadcastExchangeHashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [id#60]
![Page 88: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/88.jpg)
Repartition
To numPartitions or by ColumnsIncrease parallelism – will shuffle
coalesce() – combine partitions in place
![Page 89: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/89.jpg)
Cache
cache() or persist()Flush least-recently-used (LRU)- Make sure there is enough memory!MEMORY_AND_DISK to avoid expensive recompute (but spill to disk is slow)
![Page 90: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/90.jpg)
Streaming
Use Structured Streaming (2.1+)If not...
If reliable messaging (Kafka) use Direct DStream
![Page 91: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/91.jpg)
Metadata - ConfigPosition from streaming source (aka offset)- could get duplicates! (at-least-once)Pending batches
![Page 92: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/92.jpg)
Persist stateful transformations- data lost if not savedCut short execution that could grow indefinitely
![Page 93: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/93.jpg)
Direct DStream
Checkpoint also store offset
Turn off auto commit- do when in good state for exactly-once
![Page 94: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/94.jpg)
CheckpointingStream/ML/Graph/SQL- more efficient indefinite/iterative- recoveryGenerally not versioning-safe
Use reliable distributed file system(caution on “object store”)
![Page 95: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/95.jpg)
![Page 96: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/96.jpg)
![Page 97: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/97.jpg)
Hadoop
WebLog Spark SQL
ExternalDataSource
BI ToolsHDFS
Hive
HiveMetastore
FrontEnd
Hourly
![Page 98: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/98.jpg)
SparkStreaming
SparkML
Kafka
HDFS
FrontEnd
Near-RealTime(end-to-end roundtrip:8-20 sec)
OfflineAnalysis
![Page 99: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/99.jpg)
![Page 100: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/100.jpg)
Spark SQL
RDBMS
BI Tools
Hive
SQLAppliance
BI Tools
![Page 101: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/101.jpg)
![Page 102: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/102.jpg)
SparkStreaming
MessageBus
SparkMLKafka
Storage
Spark SQL
Visualization
ExternalDataSource
BI Tools
Data Lake
HiveMetastore
SparkML
![Page 103: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/103.jpg)
![Page 104: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/104.jpg)
SparkStreaming
Kafka
HDFS
Spark SQLVisualization
Presto
BI ToolsHive
SQL
Flume
SQL
DataScienceNotebook
![Page 105: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/105.jpg)
![Page 106: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/106.jpg)
SparkStreaming
MessageBus
StorageSQL
Spark SQL Spark SQL
DataFactory
![Page 107: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/107.jpg)
Moving
![Page 108: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/108.jpg)
![Page 109: Apache: Big Data - Starting with Apache Spark, Best Practices](https://reader037.fdocuments.in/reader037/viewer/2022100222/5a675e2d7f8b9a656a8b4adf/html5/thumbnails/109.jpg)
https://www.linkedin.com/in/felix-cheung-b4067510https://github.com/felixcheung