Introduction to streaming and messaging flume,kafka,SQS,kinesis
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
-
Upload
chris-fregly -
Category
Documents
-
view
1.319 -
download
5
description
Transcript of East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
![Page 1: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/1.jpg)
Spark and Friends:Spark Streaming,
Machine Learning, Graph Processing,
Lambda Architecture, Approximations
Chris Fregly
East Bay Java User GroupOct 2014
Kinesis
Streaming
![Page 2: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/2.jpg)
Who am I?Former Netflix’er
(netflix.github.io)
Spark Contributor
(github.com/apache/spark)
Founder
(fluxcapacitor.com)
Author
(effectivespark.com,
sparkinaction.com)
![Page 3: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/3.jpg)
Spark Streaming-Kinesis Jira
![Page 4: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/4.jpg)
Not-so-Quick Poll• Spark, Spark Streaming?
• Hadoop, Hive, Pig?
• Parquet, Avro, RCFile, ORCFile?
• EMR, DynamoDB, Redshift?
• Flume, Kafka, Kinesis, Storm?
• Lambda Architecture?
• PageRank, Collaborative Filtering, Recommendations?
• Probabilistic Data Structs, Bloom Filters, HyperLogLog?
![Page 5: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/5.jpg)
Quick Poll
Raiders or 49’ers?
![Page 6: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/6.jpg)
“Streaming”
Kinesis
Streaming
Video
Streaming
Piping
Big Data
Streaming
![Page 7: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/7.jpg)
Agenda• Spark, Spark Streaming
• Use Cases
• API and Libraries
• Machine Learning
• Graph Processing
• Execution Model
• Fault Tolerance
• Cluster Deployment
• Monitoring
• Scaling and Tuning
• Lambda Architecture
• Probabilistic Data Structures
• Approximations
![Page 8: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/8.jpg)
Timeline of Spark Evolution
![Page 9: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/9.jpg)
Spark and Berkeley AMP Lab
Berkeley Data Analytics Stack (BDAS)
![Page 10: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/10.jpg)
Spark Overview
• Based on 2007 Microsoft Dryad paper
• Written in Scala
• Supports Java, Python, SQL, and R
• Data fits in memory when possible, but not
required
• Improved efficiency over MapReduce
– 100x in-memory, 2-10x on-disk
• Compatible with Hadoop
– File formats, SerDes, and UDFs
![Page 11: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/11.jpg)
Spark Use Cases
• Ad hoc, exploratory, interactive analytics
• Real-time + Batch Analytics
– Lambda Architecture
• Real-time Machine Learning
• Real-time Graph Processing
• Approximate, Time-bound Queries
![Page 12: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/12.jpg)
Explosion of Specialized Systems
![Page 13: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/13.jpg)
Unified High-level Spark Libraries
• Spark SQL (Data Processing)
• Spark Streaming (Streaming)
• MLlib (Machine Learning)
• GraphX (Graph Processing)
• BlinkDB (Approximate Queries)
• Statistics (Correlations, Sampling, etc)
• Others
– Shark (Hive on Spark)
– Spork (Pig on Spark)
![Page 14: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/14.jpg)
Benefits of Unified Libraries
• Advancements in higher-level libraries are
pushed down into core and vice-versa
• Examples
– Spark Streaming
• GC and memory management improvements
– Spark GraphX
• IndexedRDD for random, hashed-based access within
a partition versus scanning entire partition
– Spark Core
• Sort-based Shuffle
![Page 15: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/15.jpg)
Resilient Distributed Dataset
(RDD)
![Page 16: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/16.jpg)
RDD Overview
• Core Spark abstraction
• Represents partitions
across the cluster nodes
• Enables parallel processing
on data sets
• Partitions can be in-memory
or on-disk
• Immutable, recomputable,
fault tolerant
• Contains transformation history (“lineage”) for
whole data set
![Page 17: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/17.jpg)
Types of RDDs
• Spark Out-of-the-Box
– HadoopRDD
– FilteredRDD
– MappedRDD
– PairRDD
– ShuffledRDD
– UnionRDD
– DoubleRDD
– JdbcRDD
– JsonRDD
– SchemaRDD
– VertexRDD
– EdgeRDD
• External
– CassandraRDD (DataStax)
– GeoRDD (Esri)
– EsSpark (ElasticSearch)
![Page 18: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/18.jpg)
RDD Partitions
![Page 19: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/19.jpg)
RDD Lineage Examples
![Page 20: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/20.jpg)
Demo!
• Load user data
• Load gender data
• Join user data
with gender data
• Analyze lineage
![Page 21: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/21.jpg)
Join Optimizations• When joining large dataset with small dataset (reference
data)
• Broadcast small dataset to each node/partition of large
dataset (one broadcast per node)
![Page 22: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/22.jpg)
Spark API
![Page 23: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/23.jpg)
Spark API Overview
• Richer, more expressive than MapReduce
• Native support for Java, Scala, Python,
SQL, and R (mostly)
• Unified API across all libraries
• Operations
– Transformations (lazy evaluation)
– Actions (execute transformations)
![Page 24: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/24.jpg)
Transformations
![Page 25: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/25.jpg)
Actions
![Page 26: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/26.jpg)
Spark Execution Model
![Page 27: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/27.jpg)
Job Scheduling
• Job– Contains many stages
• Contains many tasks
• FIFO– Long-running jobs may starve resources for other
jobs
• Fair– Round-robin to prevent resource starvation
• Fair Scheduler Pools– High-priority pool for important jobs
– Separate user pools
– FIFO within the pool
– Modeled after Hadoop Fair Scheduler
![Page 28: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/28.jpg)
Spark Execution Model Overview
• Parallel, distributed
• DAG-based
• Lazy evaluation
• Allows optimizations– Reduce disk I/O
– Reduce shuffle I/O
– Single pass through dataset
– Parallel execution
– Task pipelining
• Data locality and rack awareness
• Worker node fault tolerance using RDD lineage graphs per partition
![Page 29: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/29.jpg)
Execution Optimizations
![Page 30: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/30.jpg)
Task Pipelining
![Page 31: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/31.jpg)
Spark Cluster Deployment
![Page 32: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/32.jpg)
Types of Cluster Deployments
• Spark Standalone (default)
• YARN
– Allows hierarchies of resources
– Kerberos integration
– Multiple workloads from disparate execution frameworks
• Hive, Pig, Spark, MapReduce, Cascading, etc
• Mesos
– Coarse-grained
• Single, long-running Mesos tasks runs Spark mini tasks
– Fine-grained
• New Mesos task for each Spark task
• Higher overhead
• Not good for long-running Spark jobs like Spark Streaming
![Page 33: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/33.jpg)
Spark Standalone Deployment
![Page 34: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/34.jpg)
Spark YARN Deployment
![Page 35: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/35.jpg)
Master High Availability
• Multiple Master Nodes
• ZooKeeper maintains current Master
• Existing applications and workers will be
notified of new Master election
• New applications and workers need to
explicitly specify current Master
• Alternatives (Not recommended)
– Local filesystem
– NFS Mount
![Page 36: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/36.jpg)
Spark After Dark
(sparkafterdark.com)
![Page 37: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/37.jpg)
Spark After Dark ~ Tinder
![Page 38: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/38.jpg)
Mutual Match!
![Page 39: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/39.jpg)
Goal: Increase Matches (1/2)
• Top Influencers – PageRank
– Most desirable people
• People You May Know– Shortest Path
– Facebook-enabled
• Recommendations– Alternating Least
Squares (ALS)
![Page 40: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/40.jpg)
Goal: Increase Matches (2/2)
• Cluster on Interests,Height, Religion, etc
– K-Means
– Nearest Neighbor
• Textual Analysisof Profile Text
– TF/IDF
– N-gram
– LDA Topic Extraction
![Page 41: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/41.jpg)
Demo!
![Page 42: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/42.jpg)
Spark Streaming
![Page 43: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/43.jpg)
Spark Streaming Overview
• Low latency, high throughput, fault-tolerance (mostly)
• Long-running Spark application
• Supports Flume, Kafka, Twitter, Kinesis, Socket, File, etc.
• Graceful shutdown, in-flight message draining
• Uses Spark Core, DAG Execution Model, and Fault Tolerance
• Submits micro-batch jobs to the cluster
![Page 44: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/44.jpg)
Spark Streaming Overview
![Page 45: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/45.jpg)
Spark Streaming Use Cases• ETL and enrichment
of streaming data
on injest
• Operational
dashboards
• Lambda
Architecture
![Page 46: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/46.jpg)
Discretized Stream (DStream)• Core Spark Streaming abstraction
• Micro-batches of RDDs
• Operations similar to RDD – operates on underlying RDDs
![Page 47: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/47.jpg)
Spark Streaming API
![Page 48: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/48.jpg)
Spark Streaming API Overview
• Rich, expressive API similar to core
• Operations
– Transformations (lazy)
– Actions (execute transformations)
• Window and State Operations
• Requires checkpointing to snip long-
running DStream lineage
• Register DStream as a Spark SQL table for querying?! Wow.
![Page 49: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/49.jpg)
DStream Transformations
![Page 50: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/50.jpg)
DStream Actions
![Page 51: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/51.jpg)
Window and State DStream Operations
![Page 52: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/52.jpg)
DStream Window and State Example
![Page 53: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/53.jpg)
Spark Streaming Cluster
Deployment
![Page 54: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/54.jpg)
Spark Streaming Cluster Deployment
![Page 55: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/55.jpg)
Adding Receivers
![Page 56: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/56.jpg)
Adding Workers
![Page 57: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/57.jpg)
Spark Streaming
+
Kinesis
![Page 58: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/58.jpg)
Spark Streaming + Kinesis Architecture
![Page 59: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/59.jpg)
Throughput and Pricing
![Page 60: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/60.jpg)
Demo!
Kinesis
Streaming
Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scalaJava: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/…
![Page 61: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/61.jpg)
Spark Streaming
Fault Tolerance
![Page 62: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/62.jpg)
Characteristics of Sources
• Buffered
– Flume, Kafka, Kinesis
– Allows replay and back pressure
• Batched
– Flume, Kafka, Kinesis
– Improves throughput at expense of duplication on failure
• Checkpointed
– Kafka, Kinesis
– Allows replay from specific checkpoint
![Page 63: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/63.jpg)
Message Delivery Guarantees• Exactly once [1]
– No loss
– No redeliver
– Perfect delivery
– Incurs higher latency for transactional semantics
– Spark default per batch using DStream lineage
– Degrades to less guarantees depending on source
• At least once [1..n]– No loss
– Possible redeliver
• At most once [0,1]– Possible loss
– No redeliver
– *Best configuration if some data loss is acceptable
• Ordered– Per partition: Kafka, Kinesis
– Global across all partitions: Hard to scale
![Page 64: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/64.jpg)
Types of Checkpoints
Spark
1. Spark checkpointing of StreamingContextDStreams and metadata
2. Lineage of state and window DStreamoperations
Kinesis
3. Kinesis Client Library (KCL) checkpoints current position within shard
– Checkpoint info is stored in DynamoDB per Kinesis application keyed by shard
![Page 65: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/65.jpg)
Fault Tolerance
• Points of Failure
– Receiver
– Driver
– Worker/Processor
• Possible Solutions
– Use HDFS File Source for durability
– Data Replication
– Secondary/Backup Nodes
– Checkpoints
• Stream, Window, and State info
![Page 66: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/66.jpg)
Streaming Receiver Failure
• Use a backup receiver
• Use multiple receivers pulling from multiple shards– Use checkpoint-enabled, sharded streaming
source (ie. Kafka and Kinesis)
• Data is replicated to 2 nodes immediately upon ingestion– Will spill to disk if doesn’t fit in memory
• Possible loss of most-recent batch
• Possible at-least once delivery of batch
• Use buffered sources for replay– Kafka and Kinesis
![Page 67: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/67.jpg)
Streaming Driver Failure
• Use a backup Driver
– Use DStream metadata checkpoint info to recover
• Single point of failure – interrupts stream processing
• Streaming Driver is a long-running Spark application
– Schedules long-running stream receivers
• State and Window RDD checkpoints to HDFS to help avoid data loss (mostly)
![Page 68: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/68.jpg)
Stream Worker/Processor Failure• No problem!
• DStream RDD partitions will be recalculated from lineage
• Causes blip in processing during node failover
![Page 69: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/69.jpg)
Spark Streaming
Monitoring and Tuning
![Page 70: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/70.jpg)
Monitoring
• Monitor driver, receiver, worker nodes, and
streams
• Alert upon failure or unusually high latency
• Spark Web UI
– Streaming tab
• Ganglia, CloudWatch
• StreamingListener callback
![Page 71: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/71.jpg)
Spark Web UI
![Page 72: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/72.jpg)
Tuning• Batch interval
– High: reduce overhead of submitting new tasks for each batch
– Low: keeps latencies low
– Sweet spot: DStream job time (scheduling + processing) is
steady and less than batch interval
• Checkpoint interval
– High: reduce load on checkpoint overhead
– Low: reduce amount of data loss on failure
– Recommendation: 5-10x sliding window interval
• Use DStream.repartition() to increase parallelism of processing
DStream jobs across cluster
• Use spark.streaming.unpersist=true to let the Streaming Framework
figure out when to unpersist
• Use CMS GC for consistent processing times
![Page 73: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/73.jpg)
Lambda Architecture
![Page 74: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/74.jpg)
Lambda Architecture Overview
• Batch Layer– Immutable,
Batch read,Append-only write
– Source of truth
– ie. HDFS
• Speed Layer– Mutable,
Random read/write
– Most complex
– Recent data only
– ie. Cassandra
• Serving Layer– Immutable,
Random read, Batch write
– ie. ElephantDB
![Page 75: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/75.jpg)
Spark + AWS + Lambda
![Page 76: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/76.jpg)
Spark + Lambda + GraphX + MLlib
![Page 77: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/77.jpg)
Demo!
• Load JSON
• Convert to Parquet file
• Save Parquet file to disk
• Query Parquet file directly
![Page 78: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/78.jpg)
Approximations
![Page 79: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/79.jpg)
Approximation Overview
• Required for scaling
• Speed up analysis of large datasets
• Reduce size of working dataset
• Data is messy
• Collection of data is messy
• Exact isn’t always necessary
• “Approximate is the new Exact”
![Page 80: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/80.jpg)
Some Approximation Methods
• Approximate time-bound queries
– BlinkDB
• Bernouilli and Poisson Sampling
– RDD: sample(), takeSample()
• HyperLogLog
PairRDD: countApproxDistinctByKey()
• Count-min Sketch
• Spark Streaming and Twitter Algebird
• Bloom Filters
– Everywhere!
![Page 81: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/81.jpg)
Approximations In Action
Figure: Memory Savings with Approximation Techniques(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
![Page 82: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/82.jpg)
Spark Statistics Library
• Correlations
– Dependence between 2 random variables
– Pearson, Spearman
• Hypothesis Testing
– Measure of statistical significance
– Chi-squared test
• Stratified Sampling
– Sample separately from different sub-populations
– Bernoulli and Poisson sampling
– With and without replacement
• Random data generator
– Uniform, standard normal, and Poisson distribution
![Page 83: East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning](https://reader033.fdocuments.in/reader033/viewer/2022060121/5594530f1a28abe14f8b473c/html5/thumbnails/83.jpg)
Summary
• Spark, Spark Streaming Overview
• Use Cases
• API and Libraries
• Machine Learning
• Graph Processing
• Execution Model
• Fault Tolerance
• Cluster Deployment
• Monitoring
• Scaling and Tuning
• Lambda Architecture
• Probabilistic Data Structures
• Approximations
http://effectivespark.com http://sparkinaction.com
Thanks!!
Chris Fregly
@cfregly