Cassandra + Spark (You’ve got the lighter, let’s start a fire)
-
Upload
robert-stupp -
Category
Software
-
view
369 -
download
4
Transcript of Cassandra + Spark (You’ve got the lighter, let’s start a fire)
You’ve got the lighter, let’s start a fire:Spark and CassandraRobert StuppSolutions Architect @ DataStax, Committer to Apache Cassandra@snazy
A journey throughDistributed Computing
© 2016 DataStax, All Rights Reserved. 2
KillrPowr Application
© 2016 DataStax, All Rights Reserved. 3
Show current power consumption in Germany
in real time
• Power consumption of all cities on a map
• Graph of data for one city
One word…
© 2016 DataStax, All Rights Reserved. 4
Apache CassandraDistributed & resilient database
Cassandra Use-Cases
© 2016 DataStax, All Rights Reserved. 6
• Time Series• Personalization• (or a mix of both)
Cassandra?
© 2016 DataStax, All Rights Reserved. 7
• Joining data … hmm• Analyzing the whole data set … hmm• Ad-Hoc Queries (SQL) … hmm
Apache SparkDistributed & resilient analytics
“Spark lets you run programs up to 100x faster in memory,or 10x faster on disk, than Hadoop.”
Spark : Simplify the…
© 2016 DataStax, All Rights Reserved. 9
challenging and compute-intensive task of processinghigh volumes of real-time or archived data,
both structured and unstructured,
seamlessly integrating relevantcomplex capabilities
such as machine learning andgraph algorithms.
Source: http://www.toptal.com/spark/introduction-to-apache-spark
Spark Use Cases
© 2016 DataStax, All Rights Reserved. 10
• Event detection• Behavior / anomalies detection• Fraud & intrusion detection• Targeted advertising• E-commerce transaction processing• Recommendations
Spark Use Cases
© 2016 DataStax, All Rights Reserved. 11
• Data migration• Ad hoc queries• Business Intelligence
Spark
© 2016 DataStax, All Rights Reserved. 12
Spark Architecture
© 2016 DataStax, All Rights Reserved. 13
Spark RDDs
© 2016 DataStax, All Rights Reserved. 14
Resilient Distributed Datasets
• Fault-tolerant collection of elements• Can be operated on in parallel• Immutable• In-Memory (as much as possible)
Spark RDD Partitioning
© 2016 DataStax, All Rights Reserved. 15
Spark RDD Resiliency
© 2016 DataStax, All Rights Reserved. 16
Spark Transformations
© 2016 DataStax, All Rights Reserved. 17
Create a new data set from an existing one
Spark Actions
© 2016 DataStax, All Rights Reserved. 18
Return a value after running a computation
Spark DAG
© 2016 DataStax, All Rights Reserved. 19
Spark Example (word count)
© 2016 DataStax, All Rights Reserved. 20
Given is a text file with movies and their categories.We want to know the occurences of each genre.
Pirates of the Carribbean,Adventure,Action,FantasyStar Trek I,SciFi,Adventure,ActionDie Hard,Action,Thriller
Why Scala
© 2016 DataStax, All Rights Reserved. 21
Scala is a functional language
Analytics is a functional “problem“
So – use a functional language
Spark Example (word count)
© 2016 DataStax, All Rights Reserved. 22
sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)
Spark Example (word count)
© 2016 DataStax, All Rights Reserved. 23
sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)Anonymous Scala Function
Spark Example (word count)
© 2016 DataStax, All Rights Reserved. 24
sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)
Flat the collection
Spark Example (word count)
© 2016 DataStax, All Rights Reserved. 25
sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)Pair RDD
Spark Example (word count)
© 2016 DataStax, All Rights Reserved. 26
sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)
Values of TWO Pair RDDs
Spark Example (word count)
© 2016 DataStax, All Rights Reserved. 27
sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)
Spark Example (word count)
© 2016 DataStax, All Rights Reserved. 28
sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)
Cassandra-Spark-Driver
Spark-Cassandra-Connector
© 2016 DataStax, All Rights Reserved. 30
• Scala, Java• Open Source (ASF2)• Developed by DataStax and community guys
https://github.com/datastax/spark-cassandra-connector
Spark Batch Processing
© 2016 DataStax, All Rights Reserved. 31
Other REALLY useful stuff like
• Re-partition data & processing• Allows efficient use of Cassandra
SQL and Data Frames
Spark SQL
© 2016 DataStax, All Rights Reserved. 33
csc.sql(" SELECT COUNT(*) AS total " +" FROM killr_video.movies_by_actor " +" WHERE actor = 'Johnny Depp' ")
.show
+-----+|total|+-----+| 54|+-----+
Spark Data Frames
© 2016 DataStax, All Rights Reserved. 34
val movies = csc.read.format("org.apache.spark.sql.cassandra").options(Map( "keyspace" -> "killr_video",
"table" -> "movies_by_actor" )).load
movies.filter("actor = 'Johnny Depp'").agg(Map("*" -> "count")).withColumnRenamed("COUNT(1)", "total").show
Same as:
SELECT COUNT(*) AS totalFROM killr_video.movies_by_actorWHERE actor = 'Johnny Depp'
Spark Data Frames
© 2016 DataStax, All Rights Reserved. 35
Data frame is RDD that has named columns
• Has schema• Has rows• Has columns• Has data types
Spark Straming
Spark Streaming
© 2016 DataStax, All Rights Reserved. 37
Continuous Micro Batching
• Collect data of time slides• Compute on that data
Spark Streaming
© 2016 DataStax, All Rights Reserved. 38
Spark Streaming Architecture
Confid© 2016 DataStax, All Rights Reserved.ential
39
Spark Streaming
© 2016 DataStax, All Rights Reserved. 40
Spark Streaming Windows
© 2016 DataStax, All Rights Reserved. 41
time
Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide
00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:10
Window
Window
Window
Window
Window
Window
Window
Window
Window
Process data for this slide –i.e. 1 second in this example
Process data for this slide –i.e. 1 second in this example
Process data for this slide –i.e. 1 second in this example
Time window ~=Grouping data by time
Apache KafkaDistributed & resilient messaging
Kafka
© 2016 DataStax, All Rights Reserved. 43
Publish-subscribe messagingrethought as a distributed commit log
Kafka
© 2016 DataStax, All Rights Reserved. 44
Kafka ClusterProducer
Producer
Producer
Consumer
Consumer
Kafka Node
Kafka Node
Kafka Node
Kafka Use-Cases
© 2016 DataStax, All Rights Reserved. 45
• Messaging• Website activity tracking• Metrics• Log aggregation• Event sourcing• Stream processing
Common Characteristics
© 2016 DataStax, All Rights Reserved. 46
Kafka, Spark & Cassandra are
• Scalable• Distributed• Partitioned• Replicated
KillrPowr
© 2016 DataStax, All Rights Reserved. 47
Show current power consumption in Germanyin real time
• Power consumption of all cities on a map• Graph of data for a selected city
KillrPowr – Data modeling 101
© 2016 DataStax, All Rights Reserved. 48
1. Collect UI Requirements2. Model data (flow) per web UI request3. Build Spark Streaming code to generate that data4. Connect sensors and Spark Streaming using Kafka
KillrPowr – Tech Stack
© 2016 DataStax, All Rights Reserved. 49
Power Sensor
Power Sensor
Kafka Spark Streaming
Cassandra
Web Site
Queue dataAggregate data
and savefor web UI
KillrPowr – Tech Stack
© 2016 DataStax, All Rights Reserved. 50
Techs used in the demo
• Spray (HTTP for Akka Actors)• AngularJS• Kafka• Spark Streaming• Cassandra
Tips
© 2016 DataStax, All Rights Reserved. 51
• Push down predicates to Cassandra
• Partition your data “right”• Think distributed
Cassandra + Spark in DSE
© 2016 DataStax, All Rights Reserved. 52
DSE integratesSpark + Cassandra
https://academy.datastax.com/courses/getting-started-apache-spark
KillrPowr Live Demo
© 2016 DataStax, All Rights Reserved. 53
You’ve got the lighter, let’s start a fire:Spark and CassandraRobert StuppSolutions Architect @ DataStax, Committer to Apache Cassandra@snazy
Thank You
© 2016 DataStax, All Rights Reserved. 55
BACKUP SLIDES
Confidential 56
BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 57
BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 58
BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 59
BACKUP - DDL
© 2014 DataStax, All Rights Reserved. Company Confidential 60
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 61
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 62
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 63
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 64
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 65
BACKUP – Spark Streaming
© 2014 DataStax, All Rights Reserved. Company Confidential 66
© 2014 DataStax, All Rights Reserved. Company Confidential 67