Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Feeding Cassandra with Spark-Streaming and Kafka
-
Upload
datastax-academy -
Category
Technology
-
view
1.199 -
download
2
Transcript of Feeding Cassandra with Spark-Streaming and Kafka
![Page 1: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/1.jpg)
Feeding Cassandra with Spark Streaming & KafkaCary Bourgeois Solutions Engineer DataStax, Central Region
![Page 2: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/2.jpg)
Who Am I
• Datastax < 2 Years • Not a “developer” • Legacy BI/Database
• Business Objects• SAP
• Demo Development • R • Java (If I have to) • Scala (Someday)
2
![Page 3: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/3.jpg)
3
Cassandra Summit 2015 September 22-24, Santa Clara Convention Center
7,000 Attendees
![Page 4: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/4.jpg)
Last Week - Mission Impossible?A Stretch but possible.
4
![Page 5: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/5.jpg)
Sunday Afternoon - I’m getting my A$$ kicked
5
![Page 6: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/6.jpg)
Monday Afternoon - Arghhhhh!
6
![Page 7: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/7.jpg)
Monday Night - I got this!
7
![Page 8: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/8.jpg)
8
Capture Raw Data
Analyze & ∑ummarize
![Page 9: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/9.jpg)
Why Mess with Success?
• Spark 1.3+ • New/Improved Kafka
Support • Dataframes
• Datastax Enterprise 4.8 • Spark 1.4 support
9
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
![Page 10: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/10.jpg)
Why Mess with Success?
• Spark 1.3+ • New/Improved Kafka
Support • Dataframes
• Datastax Enterprise 4.8 • Spark 1.4 support
10https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
![Page 11: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/11.jpg)
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.FastA single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.ScalableKafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumersDurableMessages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.Distributed by DesignKafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. 11
![Page 12: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/12.jpg)
• Producers • Consumers • Persistence • Topics • Partitions • Replication
12
http://kafka.apache.org/documentation.html
![Page 13: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/13.jpg)
• Create a Kafka topic bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic stream_ts
• List all topics bin/kafka-topics.sh --zookeeper localhost:2181 --list
• Monitor a topic bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic stream_ts --from-beginning
13
![Page 14: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/14.jpg)
Confidential
Kafka and the Producer
14
![Page 15: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/15.jpg)
The Producer App
• Lots of Options • I chose
• Scala • Not steep enough
• Akka
• Producing this message
15
Edge 1;1;401843;2015-11-04 06:23:49.001;64.44286233060423;82.79653847181152
![Page 16: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/16.jpg)
Destination - Cassandra Tables
16
CREATE TABLE demo.data (edge_id text,sensor text,epoch_hr text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor, epoch_hr ), ts)
)
CREATE TABLE demo.last (edge_id text,sensor text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor ))
)
CREATE TABLE demo.count (pk int,ts timestamp,count bigint,count_ma double,PRIMARY KEY (pk, ts)
)
![Page 17: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/17.jpg)
DSE Analytics => Spark
• No ETL • Spark 1.4.1 certification • Simplified map and reduce • Very developer Friendly
• SparkSQL • Spark Streaming • Machine Learning
• DSE Analytics and Search Integration • Cassandra benefits (scaling, availability)
17
“I want to do processing on data before it hits Cassandra.” “I need my sums, avgs, group by’s ETC.” “I want to run real-time analytics on my Cassandra data.”
![Page 18: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/18.jpg)
Processing the Stream
• Simple Scala Job • Deal with the raw flow
• Capture the raw data • Capture the latest sensor
reading • Summarize and Analyze
• Windowing the Stream • Count Records every x
seconds • Calculate a moving average
of every x seconds over a number of periods. 18
![Page 19: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/19.jpg)
Confidential
Full Demo
19
![Page 20: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/20.jpg)
Next Steps
• SparkR • MLLib workflows • Notebooks
• Spark • Jupyter
20
![Page 21: Feeding Cassandra with Spark-Streaming and Kafka](https://reader031.fdocuments.in/reader031/viewer/2022020213/58eca19a1a28ab072a8b4601/html5/thumbnails/21.jpg)
If you would like the code:
21
https://github.com/CaryBourgeois/KafkaSparkCassandraDemo