Real-time analytics with Druid at Appsflyer

Meet Druid!Real-time analytics with Druid at

Appsflyer

Publisher

Install

Appsflyer Flow

Advertiser

Appsflyer as Marketing Platform

Fraud detection

StatisticsAttribution

Life time value

Retargeting

Prediction

A/B testing

Appsflyer Technology● ~8B events / day● Hundreds of machines in Amazon● Tens of micro-services

Apache Kafka

service

DBAmazon S3

MongoDB

Redshift

Realtime● New buzzword● Ingestion latency - seconds● Query latency - seconds

Analytics● Roll-up

○ Summarizing over a dimension● Drill-down

○ Focusing (zooming in)● Slicing and dicing

○ Reducing dimensions (slice)○ Picking values of specific dimensions (dice)

● Pivoting○ Rotating multi-dimensional cube

Analytics in 3D

We tried...● MongoDB

○ Operational issues○ Performance is not great

● Redshift○ Concurrency limits

● Aurora (MySQL)○ Aggregations are not optimized

● Memsql○ Insufficient performance○ Too pricy

● Cassandra○ Not flexible

Druid● Storage optimized for analytics● Lambda architecture inside● JSON-based query language● Developed by analytics SAAS company● Free and open source● Scalable to petabytes...

Druid Storage● Columnar● Inverted index● Immutable segments

Columnar Storage

Original data: 100MB

Queried columns: 10MB

Compressed: 3MB

Index● Values are dictionary encoded

{“USA” -> 1, “Canada” -> 2, “Mexico” -> 3, …}

● Bitmap for every dimension value (used by filters)

“USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0]

● Column values (used by aggregation queries)

[2, 1, 3, 15, 1, 1, 2, 8, 7]

Data Segments● Per time interval

○ Skip segments when querying

● Immutable○ Cache friendly○ No locking

● Versioned (MVCC)○ No locking○ Read-write concurrency

Data Ingestion

Real-time Data Historical Data

Broker

Streaming Hand-offBatch indexing

Real-time Ingestion● Via Real-Time Node and Firehose

○ No redundancy or HA, thus not recommended

● Via Indexing Service and Tranquility API○ Core API○ Integrations with Streaming Frameworks○ HTTP Server○ Kafka Consumer

Batch Ingestion● File based (HDFS, S3, …)● Indexers

○ Internal Indexer■ For datasets < 1G

○ External Hadoop Cluster○ Spark Indexer

■ Work in progress

Ingestion Spec● Parsing configuration (Flat JSON, *SV)● Dimensions● Metrics● Granularity

○ Segment granularity○ Query granularity

● I/O configuration○ Where to read data from

● Tuning configuration○ Indexer tuning

● Partitioning and replication

Real-time ingestion

Task 1

Task 2

Interval Window

Minimum indexing slots = Data sources x Partitions x Replicas x 2

Query Types● Group by

○ grouping by multiple dimensions

● Top N○ like grouping by a single dimension

● Timeseries○ w/o grouping over dimensions

● Search○ Dimensions lookup

● Time boundary○ Find available data timeframe

● Metadata queries

Tips for Querying● Prefer topN over groupBy● Prefer timeseries over topN and groupBy● Use limits (and priorities)

Query Spec● Data source● Dimensions● Interval● Filters● Aggregations● Post aggregations● Granularity● Context (query configuration)● Limit

Sample Query~# curl -X POST -d@query.json -H "Content-Type: application/json" http://druidbroker:8082/druid/v2?pretty

{ "queryType": "groupBy", "dataSource": "inappevents", "granularity": "hour", "dimensions": ["media_source", "campaign"], "filter": { "type": "and", "fields": [{ "type": "selector", "dimension": "app_id", "value": "com.comuto" }, { "type": "selector", "dimension": "country", "value": "RU" }] }, "aggregations": [ { "type": "count", "name": "events_count" }, { "type": "doubleSum", "name": "revenue", "fieldName": "monetary" } ], "intervals": [ "2015-12-01T00:00:00.000/2016-01-01T00:00:00.000" ]}

Caching● Historical node level

○ By segment

● Broker level○ By segment and query○ “groupBy” is disabled on purpose!

● By default - local caching● In production - use memcached

Load Rules● Can be defined

○ On data source○ On “tier”

● What can be set○ Replication factor○ Load period○ Drop period

● Can be used to separate “hot” data from “cold” one

Druid ComponentsHistorical Nodes

Real-time NodesCoordinator

Middle ManagerOverlord

Indexing Service

Broker Nodes

Deep Storage

Metadata Storage

Druid ComponentsHistorical Nodes

Real-time NodesCoordinator

Middle ManagerOverlord

Indexing Service

Broker Nodes

Deep Storage

Metadata Storage

Load Balancer

Druid Components (Explained)● Coordinator

○ Manages segments

● Real-time Nodes○ Pulling data in real-time, and indexing it

● Historical Nodes○ Keeps historical segments

● Overlord○ Accepts tasks and distributes them to Middle Managers

● Middle Manager○ Executes submitted tasks via Peons

● Broker Nodes○ Routes query to Real-time and Historical nodes, merges results

● Deep Storage○ Segments backup (HDFS, S3, …)

Failover● Coordinator and Overlord

○ HA

● Real-time nodes○ Tasks are replicated○ Pool of nodes

● Historical nodes○ Data is replicated○ Pool of nodes○ All segments are backed up in the deep storage

● Brokers○ Pool of nodes○ Load balancer at the front

Druid at Appsflyer

Druid Sink

Druid SinkTranquility API

Probably not needed anymore due to native support in Tranquility package

Druid in Production● Provisioning using Chef● r3.8xlarge (sample configuration is OK)● Redundancy for coordinator and overlord (node per AZ)● Historical and real-time nodes are spread between AZ● LB - Consul from Hashicorp● Service discovery - Consul again● Memcached● Monitoring via Graphite Emitter extension

○ https://github.com/druid-io/druid/pull/1978

● Alerting via Sensu

IAP Distribution● 3 different node types (instead of 6)● Unpack and run● Some useful wrappers● Built-in examples for quick start● Commercial support● PyQL, Pivot inside

http://imply.io

Tips● ZooKeeper is heavily used

○ Choose appropriate hardware/network for ZK machines

● Use latest version (0.8.3)○ Restartable tasks○ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960)○ Data sketches library

● All exceptions are useful

When Not to Choose Druid?● When data is not time-series● When data cardinality is high● When number of output rows is high● When setup costs must be avoided

Non-time Series Workarounds● Must have some timestamp still● Rebuild everything to order by your timestamp● Or, use single-dimension partitioning

○ Segments partitioned by timestamp first, then by dimension range○ Find optimal target segment size

Still, please don’t use Druid for non-time series!

Tools: Pivot

Tools: Panoramix

Thank you!

Real-time analytics with Druid at Appsflyer

Engineering

Transcript of Real-time analytics with Druid at Appsflyer

Scalable Real-time analytics using Druid

BigQuery at AppsFlyer - past, present and future

Druid regolamento

Real-time Analytics with Apache Flink and Druid

July 2014 HUG : Pushing the limits of Realtime Analytics using Druid

#MBC2016 Alexander Grach, Appsflyer

DRUID Working Paper No. 97-2 Capabilities and Governance ...webdoc.gwdg.de/ebook/lm/1999/druid/druid-attach/pdf_files/97-2.pdf · DRUID Working Paper No. 97-2 Capabilities and Governance:

Treantmonk's Druid Handbook Part 3 Druid Spells Examined

Data Analytics with Druid

Comparing Rockset and Apache Druid for Real-Time Analytics

Case Study: Realtime Analytics with Druid

Interactive analytics at scale with druid

How to optimize your App Retention with Data-based Analytics - Oren Kaniel, AppsFlyer

AppsFlyer SDK Integration - iOS V 4.3.9 – Help Center · 2019-01-15 · 12/11/2018 AppsFlyer SDK Integration - iOS V 4.3.9 – Help Center ﬂyer.com/hc/en-us/articles/208293976-AppsFlyer-SDK-Integration-iOS-V-4-3-9

AppsFlyer iOS SDK Integration Guide

Druid Handbook Part 3_ Druid Spells Examined - Έγγραφα Google

Druid Collection

Distilling Insights @ Appsflyer

Druid - gsllc.files.wordpress.com

LITERARY DRUID