SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
-
Upload
snappydata -
Category
Data & Analytics
-
view
354 -
download
1
Transcript of SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
Speed of thought Analytics for Ad impressions using Spark 2.0 and Snappydata
www.snappydata.io
Jags RamnarayanCTO, Co-founder @ SnappyData
Our Pedigree
SnappyData
SpinOut
● New Spark-based open source project started by Pivotal GemFire founders+engineers
● Decades of in-memory data management experience
● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database
Funded by Pivotal, GE, GTD Capital
Mixed workloads are common – Why is Lambda the answer?
Lambda Architecture is Complex
• Complexity: learn and master multiple products data models, disparate APIs, configs
• Slower
• Wasted resources
Source: Highscalability.com
Can we simplify & optimize?
Perhaps a single clustered DB that can manage stream state, transactional data and run OLAP
queries?
Ad Analytics using Lambda
Ad Impression AnalyticsAd Network Architecture – Analyze log impressions in real time
Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
Bottlenecks in the write pathStream micro batches in
parallel from Kafka to each Spark executor
Filter and collect event for 1 minute. Reduce to 1 event per Publisher,
Geo every minute
Execute GROUP BY … Expensive Spark Shuffle …
Shuffle again in DB cluster … data format changes …
serialization costs
val input :DataFrame= sqlContext.read .options(kafkaOptions).format("kafkaSource") .stream()
val result :DataFrame = input .where("geo != 'UnknownGeo'") .groupBy( window("event-time", "1min"), Publisher", "geo") .agg(avg("bid"), count("geo"),countDistinct("cookie"))
val query = result.write.format("org.apache.spark.sql.cassandra") .outputMode("append") .startStream("dest-path”)
Bottlenecks in the Write Path
• Aggregations – GroupBy, MapReduce• Joins with other streams, Reference data
Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores
Impedance mismatch with KV stores
We want to run interactive “scan”-intensive queries:
- Find total uniques for Ads grouped on geography
- Impression trends for Advertisers(group by query)
Two alternatives: row-stores vs. column stores
Goal – Localize processingFast key-based lookup
But, too slow to run aggregations, scan based interactive queries
Fast scans, aggregations
Updates and random writes are very difficult
Consumes too much memory
Why columnar storage in-memory?
Source: MonetDB
How did Spark 2.0 do?- Spark 2.0, MacBook Pro 4 core, 2.8 Ghz Intel i7, enough RAM- Data set : 105 Million records
Parquet files in OS Buffer
Managed in Spark memory
~ 3 seconds ~ 600 millisecondsselect AVG(bid) from AdImpressions
Spark 2.0 query plan What is different?
Scan over 105 million Integers
Shuffle results from each partition so we can compute Avg across all partitions - is cheap in this case … only 11 partitions
select AVG(ArrDelay) from airline
Whole Stage Code Generation
Typical Query engine design- Each Operator implemented using
functions- And, functions imply chasing
pointers … Expensive
Code Generation in Spark -- Remove virtual function calls -- Array, variables instead of objects -- Capitalize on modern CPU cache
Aggregate
Filter
Scan
Project
How to remove complexity? Add a layerHow to improve perf? Remove a layer
Filter() { getNextRow { get a row from scan() //child Apply filter condition true: return row }
Scan() { getNextRow { get row from fileInputStream }
Good enough? Hitting the CPU Wall?
select count(*) , advertiser From history t1, current t2, adimpress t3
Where t1 Join t2 Join t3
group by geo
order by count desc limit 8
Distributed Joins can be very expensive
1 10
Concur-rency
Concur-rency
0
40
80
120
160
200
ResponseTime in seconds
ResponseTime in seconds
- DRAM is still relatively expensive for the deluge of data
- Analytics in the cloud requires fluid data movement -- How do you move large volumes to/from clouds?
Challenges with In-memory Analytics
• Most apps happy to tradeoff 1% accuracy for 200x speedup!
• Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data!
• Often can make perfectly accurate decisions without having perfectly accurate answers!
• A/B Testing, visualization, ... • The data itself is usually noisy
• Processing entire data doesn’t necessarily mean exact answers!
• Inference is probabilistic anyway
Use statistical techniques to shrink data?
SnappyData
A Hybrid Open source system for Transactions, Analytics, Streaming
(https://github.com/SnappyDataInc/snappydata)
SnappyData – In-memory Hybrid DB with Spark
A Single Unified Cluster: OLTP + OLAP + Streaming
for real-time analytics
Batch design, high throughput
Real-time designLow latency, HA,
concurrency
Vision: Drastically reduce the cost and complexity in modern big data
Rapidly Maturing Matured over 13 years
SnappyData fuses speed + serving layer
Process, store streams
Kafka
Snappy Data Server – Spark Executor + Store
Batch compute
Reference data
Lazy write, Fetch on demand
RDB
HDFS
In-memory compute, state
Current Operational
data
External data S3, Rdb, XML… Spark API ++ - Java, Scala, Python, R, REST
Synopses data
Interactive analytic queries
History data
Realizing ‘speed-of-thought’ Analytics
Rows
Columnar
Stream processingKafka
Queue (partition)
Snappy Data Server – Spark Executor + Store
Index
ProcessSpark or SQL
Program
Batch compute
Hybrid Store
Random writes
RDB(Reference
data)
HDFS
MPP DB
In-memory compute, state
overflow
Local persist
Spark API ++ - Java, Scala, Python, R, REST
Synopses
Interactive analytic queries
• Fast - Stream, ingested data colocated on shared key - Tables colocated on shared key - Far less copying, serialization - Improvements to vectorization (20X faster than spark)
• Use less memory, CPU - Maintain only “Hot/active” data in RAM - Summarize all data using Synopses
• Flexible - Spark. Enough said.
Fast, Fewer resources, Flexible
Features
- Deeply integrated database for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP
- OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc
TPC-H: 10X-20X faster than Spark 2.0
Interactive-Speed Analytic Queries – Exact or Approximate
Select avg(Bid), Advertiser from T1 group by AdvertiserSelect avg(Bid), Advertiser from T1 group by Advertiser with error 0.1
Speed/Accuracy tradeoffEr
ror (
%)
30 mins
Time to Execute onEntire Dataset
InteractiveQueries
2 secExecution Time 27
100 secs2 secs
1% Error
Query execution with accuracy guaranteePARSE QUERY
Can Query be executed on
Cache? - Recent time window- Computable from samples- Within error constraints
- Point query on history- Outlier query- Very complex query
Parallely Execute on Base table
In-memory Execution with
Error bar
Response
Response
No
Yes
Synopses Data Engine Features• Support for uniform sampling• Support for stratified sampling
- Solutions exist for stored data (BlinkDB)- SnappyData works for infinite streams of
data too• Support for exponentially decaying windows over
time• Support for synopses
- Top-K queries, heavy hitters, outliers, ...• [future] Support for joins• Workload mining (http://CliffGuard.org)
Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001
Uniform SampleID Advertiser Geo Bid Sampling
Rate3 adv20 NY 0.0002 1/35 adv20 NY 0.0001 1/3
SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’
Original Table
Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001
Uniform SampleID Advertiser Geo Bid Sampling
Rate3 adv20 NY 0.0002 2/35 adv20 NY 0.0001 2/31 adv10 NY 0.0001 2/32 adv10 VT 0.0005 2/3
SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’
Original Table Larger
Stratified Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001
Stratified Sample on GeoID Advertiser Geo Bid Sampling
Rate3 adv20 NY 0.0002 1/42 adv10 VT 0.0005 1/2
SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’
Original Table
Sketching techniques● Sampling not effective for outlier detection
○ MAX/MIN etc● Other probabilistic structures like CMS, heavy hitters, etc● SnappyData implements Hokusai
○ Capturing item frequencies in timeseries● Design permits TopK queries over arbitrary time intervals
(Top100 popular URLs)SELECT pageURL, count(*) frequency FROM TableWHERE …. GROUP BY ….ORDER BY frequency DESCLIMIT 100
Synopses Data Engine Demo
Zeppelin Spark
Interpreter(Driver)
Zeppelin Server
Row cacheColumnarcompressed
Spark Executor JVM
Row cacheColumnarcompressed
Spark Executor JVM
Row cacheColumnarcompressed
Spark Executor JVM
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.○ CPU contention drops
● Far less complex○ single cluster for stream ingestion, continuous queries,
interactive queries and machine learning● Much faster
○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source● Ad Analytics example/benchmark - https://github.com/
SnappyDataInc/snappy-poc
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog● Connect:
○ twitter: www.twitter.com/snappydata○ facebook: www.facebook.com/snappydata○ slack: http://snappydata-slackin.herokuapp.com
EXTRAS
Use Case Patterns1. Operational Analytics DB
- Caching for Analytics over disparate sources- Federate query between samples and backend’
2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query
3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP
How SnappyData Extends Spark
Snappy Spark Cluster Deployment topologies • Snappy store and Spark
Executor share the JVM memory
• Reference based access – zero copy
• SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput
Unified Cluster
Split Cluster
Simple API – Spark Compatible
● Access Table as DataFrameCatalog is automatically recovered
● Store RDD[T]/DataFrame can be stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates, inserts, deletes
//Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );
//save using DataFrame APIdataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable)<… Now use any of DataFrame APIs … >
Extends SparkCREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",….. [AS select_statement];
Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog(<Columns>) using directkafka_stream options (
<socket endpoints> "topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds)where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Append)
.saveAsTable("adImpressions")
Unified Cluster Architecture
How do we extend Spark for Real Time?• Spark Executors are long
running. Driver failure doesn’t shutdown Executors
• Driver HA – Drivers run “Managed” with standby secondary
• Data HA – Consensus based clustering integrated for eager replication
How do we extend Spark for Real Time?• By pass scheduler for low
latency SQL
• Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc
• Full SQL support – Persistent Catalog, Transaction, DML
AdImpression DemoSpark, SQL Code Walkthrough, interactive SQL
Concurrent Ingest + Query Performance
• AWS 4 c4.2xlarge instances - 8 cores, 15GB mem
• Each node parallely ingests stream from Kafka
• Parallel batch writes to store (32 partitions)
• Only few cores used for Stream writes as most of CPU reserved for OLAP queries
Spark-Cassandra Spark-InMemoryDB SnappyData0
100000
200000
300000
400000
500000
600000
700000
Stream ingestion rate(On 4 nodes with cap on CPU to allow for queries)
Per s
econ
d Th
roug
hput
https://github.com/SnappyDataInc/snappy-poc
2X – 45X faster (vs Cassandra, Memsql)
Concurrent Ingest + Query Performance
30M 60M 90M 30M 60M 90M 30M 60M 90M
Spark-Cassandra
Spark-InMemoryDBl
SnappyData
0
5000
10000
15000
20000
25000
30000
35000
40000
20346
65061 93960
3649 5801 7295
1056 1571 2144
Q1
Sample “scan” oriented OLAP query(Spark SQL) performance executed while ingesting data
select count(*) AS adCount, geo from adImpressions group by geo order by adCount desc limit 20;
Response Time(millis)
https://github.com/SnappyDataInc/snappy-poc
2X – 45X faster