Post on 21-Apr-2017
IBM | spark.tc
Cassandra Summit 2015Real time Advanced Analytics with Spark and
CassandraChris Fregly, Principal Data Solutions Engineer
IBM Spark Technology CenterSept 24, 2015
Power of data. Simplicity of design. Speed of innovation.
IBM | spark.tc
Who am I?Streaming Platform EngineerNot a Photographer or Model
Streaming Data EngineerNetflix Open Source Committer
Data Solutions EngineerApache Contributor
Principal Data Solutions Engineer
IBM Technology Center
IBM | spark.tc
Advanced Apache Spark Meetup (Organizer)Total Spark Experts: ~1000Mean RSVPs per Meetup: ~300Mean Attendance: ~50-60% of RSVPs
I’m lucky to work for a company/bossthat let’s me do this full-time!
Come work with me!
We’ll kick ass!
IBM | spark.tc
Recent and Future MeetupsSpark-Cassandra Connector w/ Russell Spitzer (DataStax) & Me
Sept 21st, 2015 <-- Great turnout and interesting questions!Project Tungsten Data Structs+Algos: CPU & Memory Optimizations
Nov 12th, 2015Text-based Advanced Analytics and Machine Learning
Jan 14th, 2016ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & Me
Feb 16th, 2016Spark Internals Deep Dive
Mar 24th, 2016Spark SQL Catalyst Optimizer Deep Dive
Apr 21st, 2016
IBM | spark.tc
Topics of this Talk① Recommendations② Live Interactice Demo!③ DataFrames④ Catalyst Optimizer and Query Plans⑤ Data Sources API⑥ Creating and Contributing Custom Data Source⑦ Partitions, Pruning, Pushdowns⑧ Native + Third-Party Data Source Impls⑨ Spark SQL Performance Tuning
New fe
atures
of Sp
ark 1.
5!!
Audience ParticipationRequired!!
Live, Interactive Demo!
Spark After DarkGenerating High-Quality Dating
RecommendationsReal-time Advanced Analytics
Machine Learning, Graph Processing
IBM | spark.tc
RecommendationsNon-Personalized
“Cold Start” ProblemTop KPageRank
PersonalizedUser-User, User-Item, Item-ItemCollaborative Filtering
IBM | spark.tc
Types of User FeedbackExplicitratings, likes
Implicitsearches, clicks, hovers, views, scrolls,
pauses
Used to train models for future recommenda-tions
IBM | spark.tc
Similarity①Euclidean: linear measure
Suffers from magnitude bias
②Cosine: angle measureAdjust for magnitude bias
③Jaccard: Set intersection / union
Suffers from popularity bias
④Log LikelihoodAdjust for popularity bias
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
IBM | spark.tc
Comparing SimilarityAll-pairs Similarity
aka. Pair-wise Similarity, Similarity joinNaïve Implementation
O(m * n^2) shuffle; m = rows, n = colsClever
ApproximateReduce m: Sampling and bucketingReduce n: Sparse matrix, factor out frequent vals
(0?)Locality Sensitive Hashing
IBM | spark.tc
Audience Participation Required!!Instructions for you:①Navigate to sparkafterdark.com
②Click on
3 actors &3 actresses
->You are here
->
github.com/fluxcapacitor/hub.docker.com/r/fluxcapacitor/
pipeline/
IBM | spark.tc
DataFramesInspired by R and Pandas DataFramesCross language support
SQL, Python, Scala, Java, RLevels performance of Python, Scala, Java, and R
Generates JVM bytecode vs serialize/pickle objects to PythonDataFrame is Container for Logical Plan
Transformations are lazy and represented as a treeCatalyst Optimizer creates physical plan
DataFrame.rdd returns the underlying RDD if neededCustom UDF using registerFunction()New, experimental UDAF support
Use DataFrames instead of
RDDs!!
IBM | spark.tc
Catalyst OptimizerConverts logical plan to physical planManipulate & optimize DataFrame transformation tree
Subquery elimination – use aliases to collapse sub-queries
Constant folding – replace expression with constantSimplify filters – remove unnecessary filtersPredicate/filter pushdowns – avoid unnecessary data
loadProjection collapsing – avoid unnecessary projections
Hooks for custom rulesRules = Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)
Implementsoas.sql.catalyst.rules.Rule
Apply to any plan stage
IBM | spark.tc
Plan DebugginggendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)
Requires explain(true)
DataFrame.queryExecution.logicalDataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.executedPlan
IBM | spark.tc
Plan Visualization & Join/Aggregation Metrics
Effectiveness of Filter
Cost-based Optimization
is Applied
Peak Memory forJoins and Aggs
Optimized CPU-cache-aware
Binary FormatMinimizes GC &
Improves Join Perf(Project Tungsten)
New in Spark 1.5!
IBM | spark.tc
Data Sources APIExecution (o.a.s.sql.execution.commands.scala)RunnableCommand (trait/interface)
ExplainCommand(impl: case class)CacheTableCommand(impl: case class)
Relations (o.a.s.sql.sources.interfaces.scala)BaseRelation (abstract class)
TableScan (impl: returns all rows)PrunedFilteredScan (impl: column pruning and predicate push-
down)InsertableRelation (impl: insert or overwrite data using Save-
Mode)Filters (o.a.s.sql.sources.filters.scala)
Filter (abstract class for all filter pushdowns for this data source)EqualToGreaterThanStringStartsWith
IBM | spark.tc
Creating a Custom Data SourceStudy Existing Native and Third-Party Data Source Impls
Native: JDBC (o.a.s.sql.execution.datasources.jdbc)class JDBCRelation extends BaseRelation
with PrunedFilteredScan with InsertableRelation
Third-Party: Cassandra (o.a.s.sql.cassandra)class CassandraSourceRelation extends BaseRela-
tion with PrunedFilteredScan with InsertableRelation
IBM | spark.tc
Contributing a Custom Data Sourcespark-packages.orgManaged byContains links to externally-managed github
projectsRatings and commentsSpark version requirements of each package
Exampleshttps://github.com/databricks/spark-csvhttps://github.com/databricks/spark-avrohttps://github.com/databricks/spark-redshift
Partitions, Pruning, Pushdowns
IBM | spark.tc
Demo Dataset (from previous Spark After Dark talks)
RATINGS ========
UserID,ProfileID,Rating
(1-10)
GENDERS========
UserID,Gender (M,F,U)
<-- Totally -->
Anonymous
IBM | spark.tc
PartitionsPartition based on data usage patterns/genders.parquet/gender=M/… /gender=F/… <-- Use case: access users by
gender /gender=U/…
Partition DiscoveryOn read, infer partitions from organization of data (ie. gen-
der=F)Dynamic Partitions
Upon insert, dynamically create partitionsSpecify field to use for each partition (ie. gender)SQL: INSERT TABLE genders PARTITION (gender) SELECT …DF:
gendersDF.write.format(”parquet").partitionBy(”gender”).save(…)
IBM | spark.tc
PruningPartition PruningFilter out entire partitions of rows on partitioned
dataSELECT id, gender FROM genders where gender = ‘U’
Column PruningFilter out entire columns for all rows if not re-
quiredExtremely useful for columnar storage formats
Parquet, ORCSELECT id, gender FROM genders
IBM | spark.tc
PushdownsPredicate (aka Filter) Pushdowns
Predicate returns {true, false} for a given function/condition
Filters rows as deep into the data source as possibleData Source must implement PrunedFilteredScan
IBM | spark.tc
Cassandra Pushdown RulesDetermines which filter predicates can be pushed down to Cassandra.* 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate* 2. Only push down primary key column predicates with = or IN predicate.* 3. If there are regular columns in the pushdown predicates, they should have* at least one EQ expression on an indexed column and no IN predicates.* 4. All partition column predicates must be included in the predicates to be pushed down,* only the last part of the partition key can be an IN predicate. For each partition column,* only one predicate is allowed.* 5. For cluster column predicates, only last predicate can be non-EQ predicate* including IN predicate, and preceding column predicates must be EQ predicates.* If there is only one cluster column predicate, the predicates could be any non-IN predicate.* 6. There is no pushdown predicates if there is any OR condition or NOT IN condition.* 7. We're not allowed to push down multiple predicates for the same column if any of them* is equality or IN predicate.
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala
Native Spark SQL Data Sources
IBM | spark.tc
Spark SQL Native Data Sources - Source Code
IBM | spark.tc
JSON Data SourceDataFrameval ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.j-son.bz2") -- or --
val ratingsDF = sqlContext.read.json("file:/root/pipeline/datasets/dating/ratings.j-
son.bz2")
SQL CodeCREATE TABLE genders USING jsonOPTIONS
(path "file:/root/pipeline/datasets/dating/genders.j-son.bz2")
Convenience Method
IBM | spark.tc
JDBC Data SourceAdd Driver to Spark JVM System Classpath
$ export SPARK_CLASSPATH=<jdbc-driver.jar>
DataFrameval jdbcConfig = Map("driver" -> "org.postgresql.Driver",
"url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename")
df.read.format("jdbc").options(jdbcConfig).load()
SQLCREATE TABLE genders USING jdbc
OPTIONS (url, dbtable, driver, …)
IBM | spark.tc
Parquet Data SourceConfigurationspark.sql.parquet.filterPushdown=truespark.sql.parquet.mergeSchema=truespark.sql.parquet.cacheMetadata=true
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]DataFrames
val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet")gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")
SQLCREATE TABLE genders USING parquetOPTIONS
(path "file:/root/pipeline/datasets/dating/genders.parquet")
IBM | spark.tc
ORC Data SourceConfiguration
spark.sql.orc.filterPushdown=trueDataFrames
val gendersDF = sqlContext.read.format("orc").load("file:/root/pipeline/datasets/dating/genders")
gendersDF.write.format("orc").partitionBy("gender").save("file:/root/pipeline/datasets/dating/genders")
SQLCREATE TABLE genders USING orcOPTIONS
(path "file:/root/pipeline/datasets/dating/genders")
Third-Party Data Sources
spark-packages.org
IBM | spark.tc
CSV Data Source (Databricks)Github
https://github.com/databricks/spark-csv
Mavencom.databricks:spark-csv_2.10:1.2.0
Codeval gendersCsvDF = sqlContext.read
.format("com.databricks.spark.csv")
.load("file:/root/pipeline/datasets/dating/gen-der.csv.bz2")
.toDF("id", "gender") toDF() defines column names
IBM | spark.tc
Avro Data Source (Databricks)Github
https://github.com/databricks/spark-avro
Mavencom.databricks:spark-avro_2.10:2.0.1
Codeval df = sqlContext.read
.format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gen-der.avro")
IBM | spark.tc
Redshift Data Source (Databricks)Github
https://github.com/databricks/spark-redshift
Mavencom.databricks:spark-redshift:0.5.0
Codeval df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load()
Copies to S3 for fast, parallel reads vs
single Redshift Master bottleneck
IBM | spark.tc
ElasticSearch Data Source (Elastic.co)Githubhttps://github.com/elastic/elasticsearch-hadoop
Mavenorg.elasticsearch:elasticsearch-spark_2.10:2.1.0
Codeval esConfig = Map("pushdown" -> "true", "es.nodes" -> "<host-
name>", "es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document>")
IBM | spark.tc
Cassandra Data Source (DataStax)Githubhttps://github.com/datastax/spark-cassandra-connector
Mavencom.datastax.spark:spark-cassandra-connector_2.10:1.5.0-
M1
CoderatingsDF.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"dating","table"->"rat-ings"))
.save()
IBM | spark.tc
REST Data Source (Databricks)Coming Soon!
https://github.com/databricks/spark-rest?
Michael ArmbrustSpark SQL Lead @ Databricks
IBM | spark.tc
DynamoDB Data Source (IBM Spark Tech Center) Coming Soon!
https://github.com/cfregly/spark-dynamodb
Me Erlich
IBM | spark.tc
SparkSQL Performance Tuning (oas.sql.SQL-Conf)spark.sql.inMemoryColumnarStorage.compressed=trueAutomatically selects column codec based on data
spark.sql.inMemoryColumnarStorage.batchSizeIncrease as much as possible without OOM – improves compression and GC
spark.sql.inMemoryPartitionPruning=trueEnable partition pruning for in-memory partitions
spark.sql.tungsten.enabled=trueCode Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode)
spark.sql.shuffle.partitionsIncrease from default 200 for large joins and aggregations
spark.sql.autoBroadcastJoinThresholdIncrease to tune this cost-based, physical plan optimization
spark.sql.hive.metastorePartitionPruningPredicate pushdown into the metastore to prune partitions early
spark.sql.planner.sortMergeJoinPrefer sort-merge (vs. hash join) for large joins
spark.sql.sources.partitionDiscovery.enabled & spark.sql.sources.parallelPartitionDiscovery.threshold
Enable automatic partition discovery when loading data
IBM | spark.tc
Freg-a-palooza Upcoming World Tour① New York Strata (Sept 29th – Oct 1st)② London Spark Meetup (Oct 12th)③ Scotland Data Science Meetup (Oct 13th)④ Dublin Spark Meetup (Oct 15th)⑤ Barcelona Spark Meetup (Oct 20th)⑥ Madrid Spark Meetup (Oct 22nd)⑦ Amsterdam Spark Summit (Oct 27th – Oct 29th)⑧ Delft Dutch Data Science Meetup (Oct 29th) ⑨ Brussels Spark Meetup (Oct 30th)⑩ Zurich Big Data Developers Meetup (Nov 2nd)
High probabilityI’ll end up in jail
IBM Spark Tech Center is Hiring! Only Fun, Collaborative People - No
Erlichs!
IBM | spark.tc
Sign up for our newsletter at
Thank You!
Power of data. Simplicity of design. Speed of innovation.
Chris Fregly @cfregly
Power of data. Simplicity of design. Speed of innovation.IBM Spark