From stream to recommendation using apache beam with cloud pubsub and cloud dataflow
-
Upload
neville-li -
Category
Technology
-
view
1.626 -
download
3
Transcript of From stream to recommendation using apache beam with cloud pubsub and cloud dataflow
Speakers: Igor Maravić & Neville Li, Spotify
From stream to recommendation withCloud Pub/Sub and Cloud Dataflow
DATA & ANALYTICS
22
Current Event Delivery System
3
Client
Client
Client
Client
Current event delivery system
Gateway
Syslog
SyslogProducer
Any Data Centre
Groupers RealtimeBrokers
ETL job
CheckpointMonitor
Hadoop
Hadoop Data Center
Service Discovery
ACKBrokers
SyslogConsumer
LivenessMonitor
Brokers
4
Client
Client
Client
Client
Complex
Gateway
Syslog
SyslogProducer
Any Data Centre
Groupers RealtimeBrokers
ETL job
CheckpointMonitor
Hadoop
Hadoop Data Center
Service Discovery
ACKBrokers
SyslogConsumer
LivenessMonitor
Brokers
5
Client
Client
Client
Client
Stateless
Gateway
Syslog
SyslogProducer
Any Data Centre
Groupers RealtimeBrokers
ETL job
CheckpointMonitor
Hadoop
Hadoop Data Center
Service Discovery
ACKBrokers
SyslogConsumer
LivenessMonitor
Brokers
6
Delivered data growth
2007 2008 2009 2010 2011 2012 2013 2014 2015
77
Redesigning Event Delivery
8
Redesigning event delivery
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client Event Delivery Service
Reliable Persistent Queue
ETL
9
Same API
Gateway
Syslog
File Tailer
Any data centreHadoop
Event Delivery Service
Reliable Persistent Queue
ETL
Client
Client
Client
Client
10
Persistence
Gateway
Syslog
File Tailer
Any data centreHadoop
Event Delivery Service
Reliable Persistent Queue
ETL
Client
Client
Client
Client
11
Keep it simple
Gateway
Syslog
File Tailer
Any data centreHadoop
Event Delivery Service
Reliable Persistent Queue
ETL
Client
Client
Client
Client
Build it!
1313
Choosing reliable persistent queue
Kafka 0.8
14
Proven technology
15
16
Strong community
1717
Reliable persistent queue
18
Event delivery with Kafka 0.8
Gateway
Syslog
File Tailer
Any data centre
ClientHadoop
Client
Client
ClientEvent
Delivery Service
Hadoop data centre
Camus(ETL)
Brokers MirrorMakers
Brokers
19
Gateway
Syslog
File Tailer
Any data centre
ClientHadoop
Client
Client
ClientEvent
Delivery Service
Hadoop data centre
Camus(ETL)
Brokers MirrorMakers
Brokers
Event delivery with Kafka 0.8
Cloud Pub/Sub
20
Retains undelivered data
22
At least once delivery
2323
Globally available
24
Simple REST API
2525
No operational responsibility*
2626
SHUT UP AND
TAKE MY MONEY!
2727
Caution advised!
Building up trust in Cloud Pub/Sub
28
29
Delivered data growth
2007 2008 2009 2010 2011 2012 2013 2014 2015
Demo time!
30
31
2M events per second.
Cloud Pub/Sub, Spotify chooses You!
32
33
Event delivery with Cloud Pub/Sub
Gateway
Any data centre
Client
HadoopClient
Client
Client
Cloud Pub/Sub
Event Delivery Service
File Tailer
Syslog
Cloud Storage
Dataflow
ETL using Cloud Dataflow
3434
Streaming ETL job with Cloud Dataflow
35
Dataflow SDK is a framework
36
Cloud Dataflow is a managed service
37
ETL job
38
Single Cloud Pub/Sub subscription
ConsumeRunning
39
GCS and HDFS in parallel.
40
2016-03-22 03H
2016-03-2204H
Event time based hourly buckets
2016-03-2123H
2016-03-2200H
2016-03-2201H
2016-03-2202H
41
Incremental bucket fill
2016-03-2123H
2016-03-2200H
2016-03-2201H
2016-03-2202H
2016-03-22 04H
2016-03-2203H
42
2016-03-2200H
2016-03-2201H
2016-03-2123H
2016-03-2203H
Bucket completeness
2016-03-2202H
2016-03-2204H
43
2016-03-2123H
2016-03-2204H
Late data handling
2016-03-2203H
2016-03-2200H
2016-03-2201H
2016-03-2202H
2016-03-2200H
2016-03-2201H
2016-03-2123H
2016-03-2202H
44
Event time based hourly bucketsIncremental bucket fillBucket completeness
Late data handling
45
Windowing
Window4,061 elements/s
ConsumeRunning
Shard4,061 elements/s
Write to HDFSRunning
Write to GCSRunning
46
Windowing@Override
public PCollection<KV<String, Iterable<EventMessage>>> apply(
final PCollection<KV<String, EventMessage>> shardedEvents) {
return shardedEvents
.apply("Assign Hourly Windows",
Window.<~>into(
FixedWindows.of(ONE_HOUR))
.withAllowedLateness(ONE_DAY)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(maxEventsInFile))
.withLateFirings(AfterFirst.of(
AfterPane.elementCountAtLeast(maxEventsInFile),
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(TEN_SECONDS))))
.discardingFiredPanes())
.apply("Aggregate Events", GroupByKey.create());
}
4747
Streaming
Where are we right now?
49
Preliminary resultsWatermark Lag
Minutes
5050
ScioScala API for Google Cloud Dataflow
51
Origin story
Scalding and Spark popular for ML, recommendations, analytics @ Spotify
50+ users, 400+ unique jobs
Early 2015 - Dataflow Scala hack project
52
Why not Scalding on GCE
Pros
● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud
● Stable and proven
Cons
● Hadoop cluster operations
● Multi-tenancy, resource contention and utilization
● No streaming mode
53
Why not Spark on GCE
Pros
● Batch, streaming, interactive and SQL
● MLlib, GraphX
● Scala, Python, and R support
Cons
● Hard to tune and scale
● Cluster lifecycle management
54
Why Dataflow with Scala
Dataflow
● Hosted solution, no operations
● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtable
● Simple unified model for batch and streaming
Scala
● High level DSL, easy transition for developers
● Reusable and composable code via functional programming
● Numerical libraries: Breeze, Algebird
55
Cloud Storage Pub/Sub Datastore BigtableBigQuery
Batch Streaming Interactive REPL
Scio Scala API
Dataflow Java SDK Scala Libraries
Extra features
56
Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i ̯o]
Verb: I can, know, understand, have knowledge.
Core API similar to spark-core, some ideas from scalding
github.com/spotify/scio
57
WordCount
Almost identical to Spark version
val sc = ScioContext()sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")
58
PageRank in 13 lines
def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks}
59
SQL and Big Data Pipelines
SQL is easier to write than data pipelines, but
Hive with TSV or Avro
● Row based storage, inefficient full scan
● No integration with other frameworks
Parquet
● Inspired by Google Dremel which powers BigQuery
● Immature Hive integration, hard to scale with Spark SQL
● Poor impedance matching with Scalding, Avro, etc.
60
BigQuery and Scio BigQuery
● Slicing and dicing, aggregation, etc.
● Scaling independently
● Web UI, Tableau, QlikView etc.
Scio
● Custom logic hard to express in SQL
● Seamless integration with BigQuery IO
● Scala macros for type safety
61
JSON vs Type Safe BigQuery
JSON approach, a.k.a. everything is Object
sc.bigQuerySelect("...").map { r => (r.get("track").asInstanceOf[TableRow] .get("name").asInstanceOf[String], r.get("audio").asInstanceOf[TableRow] .get("tempo").toString.toInt )}
Compile Run job Wait NullPointerException or ClassCastException Repeat
Type safe approach
@BigQueryType.fromQuery("...")class TrackTempo
sc.typedBigQuery[TrackTempo]().map { t => (t.track.name, t.audio.tempo.getOrElse(-1))}
Compile Run Profit
62
Spotify Running
60 million tracks
30 million users * 10 tempo buckets * 25 personalized tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories
Latent vectors from collaborative filtering
63
Rapid prototyping with Bigquery
64
Spotify Running
SELECT user_id, vectorFROM UserEntity WHERE ...
SELECTtrack_id, audio.tempo ...FROM TrackEntityWHERE ...
most popularper recording
top N tracksper artist
bucket bytempo
vector LSH per bucket
GBK GBK GBK
RB
K
top tracks per user + bucket side input
Cloud Datastore
65
typedBigQuery@(Runni...
typedBigQuery@(Runni...
[email protected]:1...
typedBigQuery@(Runni...
typedBigQuery@(Runni...
Succeeded
Succeeded
Succeeded
Succeeded
Running...
Running...
4,788 elements/s
✔
✔
✔
✔
66
67
What’s the catch?
Early stage, some rough edges
No interactive mode → Scio REPL (WIP), BigQuery + Datalab
No machine learning → TensorFlow
Licensed under Apache 2, contribution welcome!
Learnings?
69
Blog posts @ labs.spotify.com
Spotify’s Event Delivery - The Road To The CloudPart I, Part II, Part III
7070
Thank YouIgor Maravić <[email protected]>Neville Li <[email protected]>