Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally...
Click here to load reader
-
Upload
helena-edelson -
Category
Technology
-
view
7.027 -
download
0
Transcript of Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally...
@helenaedelson #kafkasummit 1
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows
Helena Edelson @helenaedelson Kafka Summit 2016
@helenaedelson #kafkasummit
VP of Engineering, Tuplejump
Previously: Sr Cloud / Big Data / Analytics Engineer: DataStax, CrowdStrike, VMware, SpringSource...
Event-Driven systems, Analytics, Machine Learning, Scala
Committer: Kafka Connect Cassandra, Spark Cassandra Connector
Contributor: Akka, previously: Spring Integration
Speaker: Kafka Summit, Spark Summit, Strata, QCon, Scala Days, Scala World, Philly ETE
2
twitter.com/helenaedelson github.com/helena
slideshare.net/helenaedelson
@helenaedelson #kafkasummit
The Real Topic
3
http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/42
@helenaedelson #kafkasummit
Chaos Of Distribution
One of the more
fascinating problems is
that of solving the chaos
of distributed systems.
Regardless of the
domain.
4
@helenaedelson #kafkasummit
Aproaching this within the use case of:
High-Level Landscape
Platform & Infrastructure
Strategies and Patterns
Four-Letter Acronyms
Can't Touch This
Architecture
5
@helenaedelson #kafkasummit 6
The Landscape
@helenaedelson #kafkasummit 7
The Digital Ad Industry
@helenaedelson #kafkasummit
An RTB Drive-By
Real time auction for ad spaces, all devices High throughput, low-Latency (similar to FIN Tech but not quite) OpenRTB API Spec - but not everyone uses it
8
Open protocol for automated trading of digital media across
platforms, devices, and advertising solutions
@helenaedelson #kafkasummit 9
Ad Delivered to User
In A Nutshell
User hits a
Publisher'spage
Advertiser
Advertiser
Advertisers send Bid Requests
Highest Bid
Accepted
@helenaedelson #kafkasummit 10
Site: Ad supported
content
Real Time Exchange & Auction (SSP):
OpenRTB Server used to bid
Bidder Service (DSP):
OpenRTB client
Advertiser:Buyer wants ad
impressions. Uses bidders to bid on
behalf
Publisher:Seller has ad spaces to sell to highest
bidders
User Devices
ad request
winning ad
bid request
win notice & settlement price
insert orders
bid response
winning ad
RTB Auction for Impressions
@helenaedelson #kafkasummit 11
Time Is Money
RTB: Maximum response latency of 100 ms
@helenaedelson #kafkasummit 12
Time Is Money
Assume some network latency!
@helenaedelson #kafkasummit
Sampling of RTB Events
Ad Request
Bid Request - JSON 100 bytes
Compute optimal bid for advertiser
Bid Response - JSON 1000 bytes (may include ad metadata)
Win Notification (may or may not exist) with settlement price
Ad Impression - when the ad is viewed
Ad Click
Ad Conversion
13
@helenaedelson #kafkasummit
Event Streams
Auctions: auction data + bid requests
Ad Impressions: which ad ids were shown
Ad Clicks: which auction ids resulted in a click
Ad Conversions: streams joined on auction id
Analytics Aggregations & ML to derive hundreds of metrics and dimensions
14
@helenaedelson #kafkasummit 15
Real TimeJust means Event-Driven or processing events as they arrive.
Does not automatically equal sub-second latency requirements.
Seen / Ingestion TimeWhen an event is ingested into the system
Event TimeWhen an event is created, e.g. on a device.
@helenaedelson #kafkasummit 16
The Platform
@helenaedelson #kafkasummit
Platform Requirements24 / 7 Uptime
Brokerage model: DSPs only make $ on successful ad deliveries, so uptime is critical
Security
Enable service across the globe
Handle thousands of concurrent requests per second
Scale to traffic of 700TB per day
Manage 700TB per day of data
Derive Metrics
17
@helenaedelson #kafkasummit
Business RequirementsSupport SLAs for bid transactions
Legal constraints - user data crossing borders
The critical path must be fast to win
No data loss on ingestion path
Bid & Campaign Optimization
Frequency Capping
Management UI for Publishers & Advertisers
18
@helenaedelson #kafkasummit
Questions To Answer% Writes on ingestion, analytics pre-aggregation, etc.
% Reads of raw data by analytics, aggregated views by customer management UI
How much in memory on RTB app nodes?
Dimensions of data in analytics queries
Optimization Algos
What needs real time feedback loops, what does not
Which data flows are low-lateny/high frequency, which not
Where are potential bottlenecks
19
@helenaedelson #kafkasummit
ConstraintsResources - I need to build highly functioning teams that are psyched about the work and working together
Budget
Cloud Resources
JDK Version (What?!)
Existing infrastructure & technologies that will be replaced later but you have to deal with now :(
20
Pro Tip: Pay well,
Allow people to grow & be
creative
@helenaedelson #kafkasummit 21
Strategies
To Avoid
@helenaedelson #kafkasummit
Beware of the C word
Consistency?
22
Convergence?
@helenaedelson #kafkasummit 23
http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/39
he went there
@palvaro
@helenaedelson #kafkasummit
Complexity
24
Can't Ops your way out of that
@helenaedelson #kafkasummit 25
Occam's razor: Simpler theories are preferable to more complex
@helenaedelson #kafkasummit 26
Strategies
@helenaedelson #kafkasummit
ApproachesEventual/Tunable consistency
Time & Clocks in globally-distributed systems
Location Transparency
Asynchrony
Pub-Sub
Design for scale
Design for Failure
27
@helenaedelson #kafkasummit
Kafka as Platform Fabric
28
@helenaedelson #kafkasummit
From MVP to Scalable with KafkaMicroservices
Does One Thing, Knows One Thing Separate low-latency hot path Separate deploy artifacts
Separate data mgmt clusters by concern
analytics, timeseries, etc.
CQRS: Separate Read Write paths
29
Scalpel...
Separate The Monolith
@helenaedelson #kafkasummit
Immutable events stream to Kafka, partitioned by event type, time, etc.
Subscribers & Publishers
RTB microservices - receives raw, receives
Analytics cluster - receives raw, publishes aggregates
Management / Reporting nodes
30
Services communicate indirectly via Kafka
@helenaedelson #kafkasummit
CQRS: Command Query Responsibility Segregation
Decouple Write streams from Read streams
Different schemas / data structures
Writers (Publishers) publish without having awareness who needs to receive it or how to reach them (location, protocol...)
Readers (Subscribers) should be able to subscribe and asynchronously receive from topics of interest
31
@helenaedelson #kafkasummit 32
Eventually Consistent Across DCs
US-East-1
MirrorMakerEU-west-1
RTB micro
services
RTB micro
services
RTB micro
services
Publishers
Subscribers
Subscribers
Publishers
Kafka Cluster Per Region
ZK
ZK
Mgmt micro
services
Mgmt micro
services
Mgmt micro
servicesQuery Layer
Analytics & ML Cluster
Timeseries Cluster
Spark Streaming
& ML
Cassandra
Cross DC Replication
Topology Aware
Spark Streaming
& ML
Cassandra
Spark Streaming
& ML
Cassandra
Cross DC Replication
Topology Aware
Spark Streaming
& ML
Cassandra
Compute Layer
@helenaedelson #kafkasummit 33
MirrorMaker
RTB micro
services
RTB micro
services
RTB micro
services
Publishers
Subscribers
Subscribers
Publishers
C*
C*
Eventually Consistent Across DCs
Mgmt micro
services
Mgmt micro
services
Mgmt micro
services
US-East-1
EU-west-1
Kafka Cluster Per Region
Analytics & ML Cluster
Timeseries Cluster
Spark Streaming
& ML
Cassandra
Cross DC Replication
Topology Aware
Spark Streaming
& ML
Cassandra
Spark Streaming
& ML
Cassandra
Cross DC Replication
Topology Aware
Spark Streaming
& ML
Cassandra
Compute Layer
Query Layer
@helenaedelson #kafkasummit
Kafka Cross Datacenter Mirroring
bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config config/consumer_source_cluster.properties --producer.config config/producer_target_cluster.properties --whitelist bidrequests --num.producers 2 --num.streams 4
34
Publish messages from various datacenters around the world
@helenaedelson #kafkasummit
Users in the US and UK connect DCs in their geo region for lower latency
Both DCs are part of the same cluster for X-DC Replication
Configure LB policies to prefer local DC
LOCAL_QUORUM reads
Data is available cluster-wide for backup, analytics, and to account for user travel across regions
35
Cassandra Cross DC ReplicationIt's out of the box. Multi-region live backups for free:
[ NetworkTopologyStrategy ]
@helenaedelson #kafkasummit 36
Cassandra Cross DC ReplicationKeep EU User Data in the EU
CREATE KEYSPACE rtb WITH REPLICATION = {
‘class’: ‘NetworkTopologyStrategy’,
‘eu-east-dc’: ‘3’,‘eu-west-dc’: ‘3’
};
@helenaedelson #kafkasummit 37
Cassandra Time Windowed Buckets with TTL
CREATE TABLE rtb.fu_events ( id int, seen_time timeuuid, event_time timestamp, PRIMARY KEY (id,date)
) WITH CLUSTERING ORDER BY (event_time DESC) AND compaction = { 'compaction_window_unit': 'DAY', 'compaction_window_size': '3', 'class':'com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy'
} AND compression = { 'crc_check_chance': '0.5', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor' } AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"100"}' AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 60 AND gc_grace_seconds = 0 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE';
3 DAY buckets -
larger SSTables on disk minimizes bootstrapping issues when adding nodes to a cluster
3 MINUTE buckets 1 HOUR buckets 1 DAY buckets
MICROSECOND resolution:
@helenaedelson #kafkasummit 38
Want Can Or Currently Use Status ButKafka Security Kafka Security TLS, Kerberos, SASL, Auth,
Encryption, Authenticationv0.9.0
Thanks Jun!
Integrated Streaming Kafka Streams processing inside Kafka, no alternate cluster setup or ops.
v0.10 Thanks Guozhang!
It's java :( Iw
Cassandra CDC Cassandra CDC. Triggers? Tiggers are a pre-commit
hook :(
The Epic JIRA: https://issues.apache.org/jira/browse/CASSANDRA-8844
no comment
And... Kafka Streams & Kafka Connect Integration
..wait for it..no comment
Always on, X-DC Replication, Flexible Topologies
Kafka, Cassandra
OOTB
Fault Tolerance Kafka, Spark, Mesos, Cassandra, Akka
Baked In
Location Transparency Kafka, Cassandra, Akka Check!
Asynchrony Kafka, Cassandra, Akka Check!
Decoupling Kafka, Akka Check!
Pub-Sub Kafka, Cassandra, Akka Check!
Immutability Kafka, Akka, Scala Check!
My Nerdy Chart v2.0
@helenaedelson #kafkasummit
Kafka Streams in v 0.10
39
val builder = new KStreamBuilder()
val stream: KStream[K,V] = builder.stream(des, des, "raw.data.topic")
.flatMapValues(value -> Arrays.asList(value.toLowerCase.split(" ")
.map((k,v) -> new KeyValue(k,v))
.countByKey(ser, ser, des, des, "kTable")
.toStream
stream.to("results.topic", ...)
val streams = new KafkaStreams(builder, props)
streams.start()
@helenaedelson #kafkasummit
Kafka Streams & Kafka Connect?
40
val builder = new KStreamBuilder()
val stream1: KStream[K,V] = builder.stream(new CassandraConnect(configs))
.flatMapValues(..)
.map((k,v) -> new KeyValue(k,v))
.countByKey(ser, ser, des, des, "kTable")
.toStream
stream.to("results.topic", ...)
val streams = new KafkaStreams(builder, props)
streams.start()
YES
@helenaedelson #kafkasummit 41
/** Writes records from Kafka to Cassandra asynchronously and non-blocking. */ override def put(records: JCollection[SinkRecord]): Unit
/** Returns a list of records when available by polling for new records. */ override def poll: JList[SourceRecord])
https://github.com/tuplejump/kafka-connect-cassandra
@helenaedelson #kafkasummit
Frequency Capping
1. Count the number of times user X has seen ad Y from Advertiser A's Campaign C
2. Limit the max number of impressions of an ad within T1...T2
42
Use Case:
Continuously count impressions grouped by campaign across DCs
low-latency reads & writes
Must scale
Cross DC Counters
Translation: Distributed Counters
@helenaedelson #kafkasummit
Redis? Broke under the load
Aerospike? Great candidate
Eventuate? Interesting, much lighter
Kafka streams when it's out? Interesting, already in the infra
Flink? Very interesting but...
Cassandra Counters - not applicable for this
43
Frequency Capping
@helenaedelson #kafkasummit
As a distributed counting microservice
As a key-value store for in-memory caching
Fast reads - Very read heavy
99% reads are < 1 ms latency (sweet)
30,000 writes per second
350,000 reads per second on 7 nodes
Replication factor 2:
Cross datacenter replication (XDC), SSD-backed
Excellent few posts by Dag, Tapads CTO on in-memory infrastructure + Ad Tech: (see resources slide)
44
Aerospike
@helenaedelson #kafkasummit
CRDT: Conflict Free Replicated Data TypeState-based: objects require only eventual communication between pairs of replicas
Operation-based: replication requires reliable broadcast communication with delivery in a well-defined delivery order
Both guaranteed to converge towards common, correct state
Keep replicas available for writes during a network partition requires resolution of conflicting writes when the partition heals
45
@helenaedelson #kafkasummit
EventuateA toolkit for building distributed, HA & partition-tolerant event-sourced applications. Developed by Martin Krasser (@mrt1nz) for Red Bull Media (open source)
Interactive, automated conflict resolution (via op-based CRDTs)
Separates command side of an app from its query side (CQRS)
Primary Goals: preserving causality, idempotency & event ordering guarantees even under chaotic conditions
AP of CAP - conflicts cannot be prevented & must be resolved.
Causality - tracked with Vector Clocks
Adapters provide connectivity to other stream processing solutions
Can currently chose Cassandra if desired
Kafka coming soon!
46
@helenaedelson #kafkasummit
Replication of application state through async event replication across locations
Locations consume replicated events to re-construct application state locally
Multiple locations concurrently update as multi-master
47
Eventuate as Distributed CRDT Microservice
@helenaedelson #kafkasummit 48
Applications can continue writing to a local replica during
a network partition
-> To Cassandra-> To Kafka
(soon)
Pass To Pipeline:
@helenaedelson #kafkasummit 49
import scala.concurrent.Futureimport akka.actor.{ActorRef, ActorSystem}import com.rbmhtechnology.eventuate.crdt.{CRDTServiceOps, Counter, CounterService}
class CappingService(val id: String, override val log: ActorRef) (implicit val system: ActorSystem, val integral: Integral[Int], override val ops: CRDTServiceOps[Counter[Int], Int]) extends CounterService[Int](id, log) { /** Increment only op: adds `delta` to the counter identified by `id` * and returns the updated counter value. */ def increment(id: String, delta: Int): Future[Int] = value(id) flatMap { case v if v >= 0 && (delta > 0 || delta > v) => update(id, delta) case v => Future.successful(v) } start()}
import scala.concurrent.Future import akka.actor.ActorSystem
val a = new CappingService(id1, eventLog)a.increment(id1, 3) // Future(3) 3 impressionsa.value(id1) // Future(3) 3 impressionsa.increment(id1, -2) // increments only, idempotent.
val b = new CappingService(id2, eventLog) b.value(id1) // Future(a.value(id1))
Knows the same count over n-instances, all geo-locations, for the same id
class CounterService[A : Integral](val replicaId: String, val log: ActorRef) {
def value(id: String): Future[A] = { ... }
def update(id: String, delta: A): Future[A] = { ... }
}
@helenaedelson #kafkasummit 50
Eventuate
@helenaedelson #kafkasummit
Eventuate TakeawayIt's just a jar!
OOTB async internal component messaging and fault tolerance
Integrate with relevant microservices
No store/cache cluster to deploy, just keep monitoring your apps Written in Scala Built on Akka - a toolkit for building highly concurrent, distributed, and resilient event-driven applications on the JVM
51
@helenaedelson #kafkasummit 52
Analytics & ML
@helenaedelson #kafkasummit
Refresher: Sampling of RTB Events
Ad Request
Bid Request - JSON 100 bytes
Compute optimal bid for advertiser
Bid Response - JSON 1000 bytes (may include ad metadata)
Win Notification (may or may not exist) with settlement price
Ad Impression - when the ad is viewed
Ad Click
Ad Conversion
53
@helenaedelson #kafkasummit 54
OpenRTB: objects in the Bid Request model
@helenaedelson #kafkasummit
TopK most high performing campaigns
Number of views served in the last 7 days, by country, by city
What determined successful ad conversions
Age distribution per campaign
55
Streaming Analytics
@helenaedelson #kafkasummit
Spark Streaming Kafkaclass KafkaStreamingActor(ssc: StreamingContext) extends MyAggregationActor {
val stream = KafkaUtils.createDirectStream(...).map(RawData(_))
stream .foreachRDD(_.toDF.write.format("filodb.spark")
.option("dataset", "rawdata") .save())
/* Pre-Aggregate data in the stream for fast querying and aggregation later
stream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)
).saveToCassandra(timeseriesKeyspace, dailyPrecipTable)
}
56
Can write to Cassandra, FiloDB...
@helenaedelson #kafkasummit
Machine LearningTrain on 1+ week of data for
Recommendations
Bid Optimization
Campaign Optimization
Consumer Profiling
...and much more
57
@helenaedelson #kafkasummit
Machine Learning
The probability of an ad, from a specific ISP, OS, website, demographic, etc. resulting in a conversion
Which attributes of impressions are good predictors of better ad performance?
58
@helenaedelson #kafkasummit
Bid Optimization & Predictive Models
Which impressions should an Advertiser bid for?
Per campaign, per country it may run in..?
What is the best bid for each impression
59
@helenaedelson #kafkasummit 60
Compute optimal bid
price
Train the model
Score bid requests
Determine value of bid reqest
Train on every bid req attribute
Based on Campaign Objectives
Against Budget Send bid decision to bidder
Machine Learning
@helenaedelson #kafkasummit
Spark Streaming, MLLib & FiloDB
61
val ssc = new StreamingContext(sparkConf, Seconds(5))
val kafkaStream = KafkaUtils.createDirectStream[..](..)
.map(transformFunc) .map(LabeledPoint.parse)
kafkaStream.foreachRDD(_.toDF.write.format("filodb.spark")
.option("dataset", "training").save())
val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.dense(weights)) .trainOn(dataStream.join(historicalEvents)) model.predictOnValues(dataStream.map(lp => (lp.label, lp.features))) .insertIntoFilo("predictions")
@helenaedelson #kafkasummit
700 Queries Per Second: Spark Streaming & FiloDB
Even for datasets with 15 million rows! Using FiloDB's
InMemoryColumnStore
Single host / MBP
5GB RAM
SQL to DataFrame caching
https://github.com/tuplejump/FiloDB
Evan Chan's (@velvia) blog post
NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics
62
@helenaedelson #kafkasummit 63
Eventually Consistent Across DCs
US-East-1
MirrorMakerEU-west-1
RTB micro
services
RTB micro
services
RTB micro
services
Publishers
Subscribers
Subscribers
Publishers
Kafka Cluster Per Region
ZK
ZK
Mgmt micro
services
Mgmt micro
services
Mgmt micro
servicesQuery Layer
Analytics & ML Cluster
Timeseries Cluster
Spark Streaming
& ML
Cassandra
Cross DC Replication
Topology Aware
Spark Streaming
& ML
Cassandra
Spark Streaming
& ML
Cassandra
Cross DC Replication
Topology Aware
Spark Streaming
& ML
Cassandra
Compute Layer
@helenaedelson #kafkasummit
Self-Healing SystemsMassive event spikes & bursty traffic
Fast producers / slow consumers
Network partitioning & out of sync systems
DC down
Not DDOS'ing ourselves from fast streams No data loss when auto-scaling down
64
@helenaedelson #kafkasummit
Byzantine Fault Tolerance?
65
Looks like I'll miss standup
@helenaedelson #kafkasummit
Everything fails, all the time
Monitor Everything
66
@helenaedelson #kafkasummit
Non-Monotonic Snapshot Isolation: scalable and strong consistency
for geo-replicated transactional systems
Conflict-free Replicated Data Types
Implementing operation-based CRDTs
http://codebetter.com/gregyoung/2010/02/16/cqrs-task-based-uis-event-sourcing-agh
http://martinfowler.com/bliki/CQRS.html
http://github.com/openrtb/OpenRTB
http://akka.io
http://rbmhtechnology.github.io/eventuate
https://github.com/RBMHTechnology/eventuate
http://rbmhtechnology.github.io/eventuate/user-guide.html#commutative-replicated-data-types
http://www.planetcassandra.org/data-replication-in-nosql-databases-explained
http://wikibon.org/wiki/v/Optimizing_Infrastructure_for_Analytics-Driven_Real-Time_Decision_Making
Resources
67
twitter.com/helenaedelson
github.com/helena
slideshare.net/helenaedelson
Thanks!
@helenaedelson #kafkasummit