Spark and MapR Streams: A Motivating Example
-
Upload
ian-downard -
Category
Technology
-
view
156 -
download
2
Transcript of Spark and MapR Streams: A Motivating Example
© 2017 MapR Technologies 1
Spark and MapR Streams: A Motivating Example
© 2017 MapR Technologies 2
Abstract• Businesses are discovering the untapped potential of large datasets and data
streams through the use of technologies for big data processing and storage. By leveraging these assets they’re creating a new generation of applications that derive value from data they used to throw away.
• In this presentation we’ll discuss how to build operational environments for these types of applications with the MapR Converged Data Platform and we’ll walk through an example of a next-generation application that uses Java APIs for MapR Streams, Apache Spark, Apache Hive, and MapR-DB.
• We’ll see how these technologies can be used to join and transform unbounded datasets to find signals and derive new data streams for a financial scenario involving real-time algorithmic trading and historical analysis using SQL.
• We’ll also discuss how MapR enables you to run real-time data applications with the speed, reliability, and security you need for a production environment.
• Keywords: MapR, Spark, Kafka, NoSQL, JSON, Zeppelin, Hive, streaming
© 2017 MapR Technologies 3
Contact Info
Ian DownardTechnical Evangelist at MapR [email protected]
Personal Blog: http://bigendiandata.com
Twitter: @iandownard
© 2017 MapR Technologies 4
Learning Goals
1. Appreciate the opportunity of the time we’re in.
2. Become familiar with MapR
3. Become familiar with Spark
4. Feel empowered.
© 2017 MapR Technologies 5
Why Now?• But Moore’s law has applied for a long time
• Why is data exploding now?
• Why not 10 years ago?
• Why not 20?
© 2017 MapR Technologies 6
Because data wasn’t available?• If it were just availability of data then existing big companies
would adopt big data technology first
© 2017 MapR Technologies 7
Because data wasn’t available?• If it were just availability of data then existing big companies
would adopt big data technology first
They didn’t
© 2017 MapR Technologies 8
Because processing it was too expensive?• If it were just a net positive value then finance companies
should adopt first because they have higher opportunity value / byte
© 2017 MapR Technologies 9
Because processing it was too expensive?• If it were just a net positive value then finance companies
should adopt first because they have higher opportunity value / byte
They didn’t
© 2017 MapR Technologies 10
Backwards adoption• Under almost any argument, startups would not have adopted
big data technology first
© 2017 MapR Technologies 11
Backwards adoption• Under almost any argument, startups would not have adopted
big data technology first
They did
© 2017 MapR Technologies 12
Everywhere at Once?• Something very strange is happening
– Big data is being applied at many different scales– By large companies and small
© 2017 MapR Technologies 13
Everywhere at Once?• Something very strange is happening
– Big data is being applied at many different scales– By large companies and small
Why?
© 2017 MapR Technologies 14
Data Analytics Scaling Laws• Analytics scaling is all about:
– Big gains for little initial effort– Rapidly diminishing returns
• The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant
• Cost/performance has radically changed– Cluster computing, commodity hardware, data science frameworks…
© 2017 MapR Technologies 15
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Most data isn’t worth much in isolation
First data is valuable
Later data is dregs
© 2017 MapR Technologies 16
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Suddenly worth processing
First data is valuable
Later data is dregs
But has high aggregate value
© 2017 MapR Technologies 17
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
If we can handle the scale
It’s really big
© 2017 MapR Technologies 18
So what makes
that possible?
© 2017 MapR Technologies 19
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
© 2017 MapR Technologies 20
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value Net value optimum has
a sharp peak well before maximum effort
© 2017 MapR Technologies 21
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
But scaling laws are changing both slope and shape
© 2017 MapR Technologies 22
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
More than just a little
© 2017 MapR Technologies 23
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
They are changing a LOT!
© 2017 MapR Technologies 24
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
© 2017 MapR Technologies 25
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
© 2017 MapR Technologies 26
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
© 2017 MapR Technologies 27
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
© 2017 MapR Technologies 28
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Initially, linear cost scaling actually makes things worse
Then a tipping point is reached and things change radically …
© 2017 MapR Technologies 30
MapR Overview
© 2017 MapR Technologies 31
How do you persist data?
© 2017 MapR Technologies 32
All major persistence abstractions are one of these:
Files
tokyo
Streams
User profiles
Tables
© 2017 MapR Technologies 33
HDFS
SOURCEDATA
STREAMPROCESSING & STORAGE
FINALOUTPUT
STORAGE
KafkaKafkaKafka
KafkaKafkaSpark
Cassandra / MongoCassandra /
MongoCassandra / Mongo
“Classic” streaming involves single-purpose clusters.
© 2017 MapR Technologies 34
MapR-FS
SOURCEDATA
STREAMPROCESSING & STORAGE
FINALOUTPUT
STORAGE
MapR Streams
Spark
MapR-DB
MapR converges the data layer into a single cluster.
© 2017 MapR Technologies 35
What is MapR?
A Data PlatformConverged^
© 2017 MapR Technologies 36
Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
Dat
aPr
oces
sing
Web-Scale StorageMapR-FS MapR-DB
Search and Others
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR Streams
Cloud and Managed Services
Search and Others
Unified M
anagement and M
onitoring
Search and Others
Event StreamingDatabase
Custom Apps
HDFS API POSIX, NFS HBase API JSON API Kafka API
MapR Converged Data Platform
© 2017 MapR Technologies 37
“Convergence” means…• One cluster that does it all: Files + Tables + Streams• Standard APIs for everything• A distributed file system that looks “normal” (POSIX)• Unified Management• Global Namespace• Mirroring, Replication, and Snapshots
– Synchronize files, tables, and streams across datacenters– True failover for your applications
© 2017 MapR Technologies 38
How do I use MapR?• Installs on Linux (e.g. Ubuntu, Redhat) typically to a block
device, and typically to a cluster of 3 or more nodes.
• Packaged as a scriptable / web-based installer, cloud marketplace offers, Docker containers
• Sandbox VMs for your laptops.
© 2017 MapR Technologies 39
MapR In Action
© 2017 MapR Technologies 40
Apply MapR as a data layer for containers.
Producer Servlet Engine
HTTP Log
Browser
© 2017 MapR Technologies 41
Procedure1. Download Sandbox
– Configure for Host-only Adapter
2. Download github repo3. Compile code4. Build Docker images5. Create the MapR Stream topics6. Run the Docker containers
© 2017 MapR Technologies 42
Procedure• Download Sandbox
– Configure for Host-only Adapter
© 2017 MapR Technologies 43
Docker / MapR demo commands1. git clone https://github.com/mapr-demos/mapr-pacc-sample
2. maprcli stream create -path /apps/sensors -produceperm p -consumeperm
p -topicperm p
3. maprcli stream topic create -path /apps/sensors -topic computer
4. /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-consumer.sh --new-
consumer --bootstrap-server this.will.be.ignored:9092 --topic
/apps/sensors:computer
5. docker run -it -e MAPR_CLDB_HOSTS=192.168.99.3 -e
MAPR_CLUSTER=demo.cluster.com -e MAPR_CONTAINER_USER=mapr --name
producer -i -t mapr-sensor-producer
6. docker run -it --privileged --cap-add SYS_ADMIN --cap-add SYS_RESOURCE
--device /dev/fuse -e MAPR_CLDB_HOSTS=192.168.99.3 -e
MAPR_CLUSTER=demo.cluster.com -e MAPR_CONTAINER_USER=mapr -e
MAPR_MOUNT_PATH=/mapr -p 8080:8080 --device /dev/fuse --name web -i -t
mapr-web-consumer
7. Open http://localhost:8080
8. Open http://192.168.99.3:8443
© 2017 MapR Technologies 44
References• MapR Sandbox
http://maprdocs.mapr.com/home/SandboxHadoop/t_install_sandbox_vbox.html
• MapR sample applicationhttps://mapr.com/blog/getting-started-mapr-client-container/
• MapR Tutorialshttps://mapr.com/developercentral/code/
© 2017 MapR Technologies 45
Apache Spark
© 2017 MapR Technologies 46
https://databricks.com/spark/about
© 2017 MapR Technologies 47
Resilient Distributed Datasets (RDDs)• RDDs – lets programmers perform in-memory computations on
large distributed datasets in a fault-tolerant manner• RDD is a representation of data that may or may not be on
your local machine. It’s partitioned across the cluster. (like a distributed java Collection).
• RDD is immutable– JavaRDD<String> lines = sc.textFile(“/path/to/data.log”)
• When you read data, nothing gets loaded. You’re not even opening it. We first declare the operations that we’re going to perform, then in the end the data is loaded and operated upon when we perform an action that materializes the data.
© 2017 MapR Technologies 48
Resilient Distributed Datasets (RDDs)1. Start by reading from files, DB, etc. to create a top level RDD2. Lazy Transformations
.filter(), .map(), shuffle(), sample()
3. Actions (retrieval of the data) trigger stuff to finally run. Pulls all the data into the JVM. .savetoCassandra() .count(), .collect()
4. Once you have an RDD you like to work on, you can call .cache() on it to keep it around, so you don’t have to derive it again. By default cache will save to disk.
© 2017 MapR Technologies 49
Resilient Distributed Datasets (RDDs)• RDD is building block of Spark.
– Dataframe, Dataset, DStream, etc are all abstractions for RDD
• immutable • Operated on by lambda functions.• Lazily evaluated• Kick off parallel execution with actions like collect(), count(),
etc.
© 2017 MapR Technologies 50
What is Spark Streaming?• enables scalable, high-throughput, fault-tolerant stream
processing of live data
• Run continuous SQL queries on data pushed into Kafka
Data Sources Data Sinks
© 2017 MapR Technologies 51
tail -f
MapR Streams storeand expose stream data
for processing
Outputaction
© 2017 MapR Technologies 52
Spark Streaming Architecture
• processedresultsarepushedoutinbatches
Spark
batches of processed results
Spark Streaming
input data stream
data from time 0 to 1
data from time 1 to 2
RDD @ time 2
data from time 2 to 3
RDD @ time 3RDD @ time 1
Batchinterval
© 2017 MapR Technologies 53
Spark In Action
© 2017 MapR Technologies 54
Spark In Action• Spark Shell• Spark SQL in Zeppelin• Spark SQL Databricks Notebook• Spark Streaming Java API• Debugging Spark with IntelliJ
© 2017 MapR Technologies 55
Databricks Cloud• Spark notebook in
the cloud– https://community.cl
oud.databricks.com/
• Sample notebooks:– https://databricks.co
m/resources/type/example-notebooks
© 2017 MapR Technologies 56
Spark Shell (aka REPL)
• If you install spark locally, you get this.
• Evals commands immediately when you type it in, and shows you the output.
• Fantastic way to experiment, with tab completion.
© 2017 MapR Technologies 57
Apache Zeppelin
© 2017 MapR Technologies 58
Debugging Spark with IntelliJ
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport= dt_socket,server=y,suspend=y,address=4000
© 2017 MapR Technologies 59
Monitoring
http://[hostname]:4040/jobs/
© 2017 MapR Technologies 60
Spark Streaming + ML on MapR• Predict the location and time of Taxi requests.
– https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-1– https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2
streaming topic:location and time oftaxi requests
Predicted and actual pickup locations and times
Classification Models (Spark ML)
Ridership analytics(Zeppelin)
Kmeans Clustering (Spark ML on Uber dataset)
© 2017 MapR Technologies 61
Streaming + ML demo procedure1. Create topics:
maprcli stream create -path /user/mapr/stream -produceperm p -consumeperm p -topicperm p
maprcli stream topic create -path /user/mapr/stream -topic ubers -partitions 3
maprcli stream topic create -path /user/mapr/stream -topic uberp -partitions 3
2. Create and save the kmeans model to /mapr/my.cluster.com/user/mapr/data/savemodel:/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class com.sparkml.uber.ClusterUber --master
local[2] /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-1.0.jar
3. Send test dataset to a stream (just to illustrate using a stream):java -cp /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-
1.0.jar:`mapr classpath` com.streamskafka.uber.MsgProducer /user/mapr/stream:ubers
/mapr/my.cluster.com/user/mapr/data/uber.csv
4. Monitor the test dataset (optional, on nodeb):java -cp /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-
1.0.jar:`mapr classpath` com.streamskafka.uber.MsgConsumer /user/mapr/stream:ubers
5. Use the model to predict cluster for incoming taxi telemetry, output predictions to a topic:/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class
com.sparkkafka.uber.SparkKafkaConsumerProducer --master local[2] /home/mapr/mapr-sparkml-
streaming-uber/target/mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar
/user/mapr/data/savemodel /user/mapr/stream:ubers /user/mapr/stream:uberp
6. Read the predictions topic and put it into a format that we can adhoc analyze in SQL/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class com.sparkkafka.uber.SparkKafkaConsumer --
master local[2] /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-1.0-
jar-with-dependencies.jar /user/mapr/stream:uberp
7. Open http://nodea:4040
© 2017 MapR Technologies 62
Real-Time Stock Market Analysis
https://mapr.com/appblueprinthttps://github.com/mapr-demos/finserv-application-blueprint
© 2017 MapR Technologies 63
Advanced Concept:Look Back for n Seconds on a Topic
Time
Data Topic
Offset Topic
t₀ t₁ t₂ t₃ t₄ t₅
3253 3347 3467 3608 3798 3913
Offset Topic: Key = Time t, Value = Offset of Data Topic at t
© 2017 MapR Technologies 64
MapR Streams vs Kafka
© 2017 MapR Technologies 65
Call To Action
© 2017 MapR Technologies 66
Call To Action• You can foster innovation just by making data available.
• Seeking career advancement?– Coursera classes on data science, ML, Spark, etc.– Be a polyglot.– Enable data science from development to production.
• You can apply those skills in ANY industry. • Don’t be afraid by not knowing much.• 87% of career builders attribute career benefit to completing
online courses (Harvard Business Review, Coursera)– Be better equipped for current job, find a new job, change career.
© 2017 MapR Technologies 67
All Industries Web 2.0 Healthcare Telecom
• ETL / DW optimization• Mainframe optimization• Real-time application & network monitoring
• Security information & event management
• Recommendation engines & targeting
• Customer 360• Click-stream analysis• Social media analysis• Ad optimization
• Patient system of record• Smart hospitals• Biometrics• Patient vital monitoring• Fraud detection
• Crowd-based antenna optimization
• Charging & billing• Equipment monitoring & preventative maintenance
• Smart meter analysis
Have an interesting use case? Let’s talk!
Oil & Gas Financial Services Retail Ad Tech
• Pump monitoring & alerting• Seismic trace identification• Equipment maintenance• Safety & environment• Security
• Real-time fraud/risk monitoring
• Mobile notifications of transactions
• Real-time supply chain optimization
• Customer location optimization
• Real-time coupons
• Ad targeting & optimization• Global campaign dashboards