Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming...
Transcript of Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming...
![Page 1: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/1.jpg)
Evolution of an Apache Spark Architecture for Processing
Game Data
Nick Afshartous
WB Analytics Platform
May 17th 2017
![Page 2: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/2.jpg)
About Me
• WB Analytics Core Platform Lead
• Contributor to Reactive Kafka
• Based in Turbine Game Studio (Needham, MA)
• Hobbies• Sailing
• Chess
![Page 3: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/3.jpg)
WB Games
![Page 4: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/4.jpg)
• Intro
• Ingestion pipeline
• Redesigned ingestion pipeline
• Summary and lessons learned
![Page 5: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/5.jpg)
Problem Statement
• How to evolve Spark Streaming architecture to address challenges in streaming data into Amazon Redshift
![Page 6: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/6.jpg)
Tech Stack
![Page 7: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/7.jpg)
Tech Stack – Data Warehouse
![Page 8: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/8.jpg)
Partitions
Producers Consumers
(Key, Value) at Partition, Offset
Kafka
Topics
• Message is a (key, value)
• Optional key used to assign message to partition
• Consumers can start processing from earliest, latest, or from specific offsets
![Page 9: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/9.jpg)
Game Data
• Game (mobile and console) instrumented to send event data
• Volume varies up to 100,000 events per second per game
• Games have up to ~ 70 event types
• Data use-cases• Development
• Reporting
• Decreasing player churn
• Increase revenue
![Page 10: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/10.jpg)
Ingestion Pipelines• Batch Pipeline
• Input JSON
• Processing Hadoop Map Reduce
• Storage Vertica
• Spark / Redshift Real-time Pipeline • Input Avro
• Processing Spark Streaming
• Storage Redshift
![Page 11: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/11.jpg)
Spark Versions
• In process of upgrading to Spark 2.0.2 from 1.5.2
• Load tested Spark 2.1• Blocked by deadlock issue
• SPARK-19300 Executor is waiting for lock
![Page 12: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/12.jpg)
• Intro
• Ingestion pipeline
• Re-designed ingestion pipeline
• Summary and lessons learned
![Page 13: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/13.jpg)
Process for Game Sending Events
Avro Data Event IngestionSchema
Hash
Avro
SchemaSchema Registry
Schema
Hash
Returned hash based on schema fields/types
Registration triggers Redshift table create/alter statements
![Page 14: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/14.jpg)
Ingestion Pipeline
Event Ingestion Service
HTTPS
Event Avro
Schema Hash
Micro Batch
Spark Streaming
S3
Data flow
Invocation
Kafka
Data topic
Run COPY
![Page 15: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/15.jpg)
Redshift Copy Command
Table
person.txt
1|john doe2|sarah smith
create table if not exists public.person (
id integer,
name varchar
)
• Redshift optimized for loading from S3
• COPY is a SQL statement executed by Redshift
• Example COPY
copy public.person from 's3://mybucket/person.txt'
![Page 16: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/16.jpg)
Ingestion Pipeline
Event Ingestion Service
HTTPS
Event Avro
Schema Hash
Micro Batch
Spark Streaming
S3
Data flow
Invocation
KafkaData topic
![Page 17: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/17.jpg)
Challenges• Redshift designed for loading large data files
• Not for highly concurrent workloads (single-threaded commit queue)
• Redshift latency can destabilize Spark streaming• Data loading competes with user queries and reporting workloads
• Weekly maintenance
![Page 18: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/18.jpg)
• Intro
• Ingestion pipeline
• Redesigned ingestion pipeline
• Summary and lessons learned
![Page 19: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/19.jpg)
Redesign The Pipeline
• Goals• De-couple Spark streaming job from Redshift
• Tolerate Redshift unavailability
• High-level Solution• Spark only writes to S3
• Spark sends copy tasks to Kafka topic consumed by (new) Redshift loader
• Design Redshift loader to be fault-tolerant w.r.t. Redshift
![Page 20: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/20.jpg)
Technical Design Options
• Options considered for building Redshift loader• 2nd Spark streaming
• Build a lightweight consumer
![Page 21: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/21.jpg)
Redshift Loader
• Redshift loader built using Reactive Kafka
• API’s for Scala and Java
• Reactive Kafka• High-level Kafka consumer API
• Leverages Akka streams and Akka
Reactive Kafka
Akka Streams
Akka
![Page 22: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/22.jpg)
Akka
• Akka is an implementation of Actors• Actors: a model of concurrent computation in distributed systems, Gul Agha, 1986
• Actors • Single-threaded entities with an asynchronous message queue (mailbox) • No shared memory
• Features• Location transparency
• Actors can be distributed over a cluster
• Fault-tolerance• Actors restarted on failure
ActorQueue
http://akka.io
![Page 23: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/23.jpg)
Akka Streams
• Hard to implement stream processing considering• Back pressure – slow down rate to that of slowest part of stream
• Not dropping messages
• Akka Streams is a domain specific language for stream processing• Stream executed by Akka
![Page 24: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/24.jpg)
Akka Streams DSL
• Source generates stream elements
• Flow is a transformer (input and output)
• Sink is stream endpoint
Source SinkFlow
![Page 25: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/25.jpg)
Akka Streams Example
• Run stream to process two elements
val s = Source(1 to 2)
s.map(x => println("Hello: " + x))
.runWith(Sink.ignore)
Hello: 1Hello: 2
Output
Not executed by calling thread
Nothing happens until run method is invoked
![Page 26: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/26.jpg)
Reactive Kafka
• Reactive Kafka stream is a type of Akka Stream
• Supported version is from Kafka 0.10+• 0.8 branch is unsupported and less stable
https://github.com/akka/reactive-kafka
![Page 27: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/27.jpg)
Reactive Kafka – Example• Create consumer config
implicit val system = ActorSystem("Example")
val consumerSettings = ConsumerSettings(system,
new ByteArrayDeserializer,
new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
Consumer.plainSource(consumerSettings, Subscriptions.topics("topic.name"))
.map { message => println("message: " + message.value()) }
.runWith(Sink.ignore)
Creates Source that streams elements from Kafka
Deserializersfor key, value
Consumer group
• Create and run stream
Kafka endpoint
message has type ConsumerRecord(Kafka API)
![Page 28: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/28.jpg)
Backpressure
• Slows consumption when message rate is faster than consumer can process
• Asynchronous operations inside map bypass backpressure mechanism• Use mapAsync instead of map for asynchronous operations (futures)
![Page 29: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/29.jpg)
Revised Architecture
S3
Copy Tasks
Data flow
Invocation
Game Clients
Event Ingestion Service
HTTPS
Event
Avro
Spark Streaming
Redshift Loader
Kafka
Data topic
COPY topic
![Page 30: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/30.jpg)
Goals
• De-couple Spark streaming job from Redshift
• Tolerate Redshift unavailability
![Page 31: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/31.jpg)
Redshift Cluster Status
• Cluster status displayed on AWS console
• Can be obtained programmatically via AWS SDK
![Page 32: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/32.jpg)
Redshift Fault Tolerance
• Loader Checks health of Redshift using AWS SDK
• Start consuming when Redshift available
• Shut down consumer when Redshift not availableConsumer.Control.shutdown()
• Run test query to validate database connections• Don’t rely on JDBC driver’s Connection.isClosed() method
![Page 33: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/33.jpg)
Transactions
• With auto-commit enabled each COPY is a transaction• Commit queue limits throughput
• Better throughput by executing multiple COPY’s in a single transaction
• Run several concurrent transactions per job
![Page 34: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/34.jpg)
Deadlock
• Concurrent transactions create potential for deadlock since COPY statements lock tables
• Redshift will detect and return deadlock exception
![Page 35: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/35.jpg)
Deadlock
Copy table A Copy table B A, B locked
Copy table B Copy table A Wait for lock
Deadlock
Transaction 1 Transaction 2
Time
![Page 36: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/36.jpg)
Deadlock Avoidance
• Master hashes to worker based on Redshift table name• Ensures that all COPY’s for the same table are dispatched to the same worker
• Alternatively could order COPY’s within transaction to avoid deadlock
![Page 37: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/37.jpg)
• Intro
• Ingestion pipeline
• Redesigned ingestion pipeline
• Summary and lessons learned
![Page 38: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/38.jpg)
Lessons (Re) learned• Distinguish post-processing versus data processing
• Use Spark for data processing
• Assume dependencies will fail
• Load testing • Don’t focus exclusively on load volume
• Number of event types was a factor
• Load test often • Facilitated by automation
![Page 39: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/39.jpg)
Monitoring
• Monitor individual components• Redshift loader sends a heartbeat via CloudWatch metric API
• Monitor flow through the entire pipeline• Send test data and verify successful processing
• Catches all failures
![Page 40: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/40.jpg)
Possible Future Directions
• Explore using Akka Persistence
• Implement using more of Reactive Kafka• Use Source.groupedWithin for batching COPY’s instead of Akka
![Page 41: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/41.jpg)
Related Note
• Kafka Streams released as part of Kafka 0.10+• Provides streams API similar to Reactive Kafka
• Reactive Kafka has API integration with Akka
![Page 42: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/42.jpg)
Questions ?
![Page 43: Evolution of an Apache Spark Architecture for Processing ...€¢How to evolve Spark Streaming architecture to address challenges in ... •Data loading competes with user queries](https://reader034.fdocuments.in/reader034/viewer/2022051801/5add6b427f8b9a595f8cd352/html5/thumbnails/43.jpg)
Imports for Code Examples
import org.apache.kafka.clients.consumer.ConsumerRecord
import akka.actor.{ActorRef, ActorSystem}
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Keep, Sink, Source}
import scala.util.{Success, Failure}
import scala.concurrent.ExecutionContext.Implicits.global
import akka.kafka.ConsumerSettings
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.{StringDeserializer, ByteArrayDeserializer}
import akka.kafka.Subscriptions
import akka.kafka.ConsumerMessage.{CommittableOffsetBatch, CommittableMessage}
import akka.kafka.scaladsl.Consumer