Down the event-driven road: Experiences of integrating
streaming into analytic dataplatforms
Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH
Confluent Meetup Munich, 8.10.2018
2
Integrateexisting (batch) data sources?
Check consistency
with datasources?
Build realtimedata
visualizations?
https://flic.kr/p/5eQA7ehttps://flic.kr/p/bpFt7U
3
Down the event-driven road ..
Analytic(Streaming)
Data Platforms
Integrating existing(batch) data sources
Checkingconsistency
Building realtimevisualizations
Wrap up & Summary
4
A typical analytic data platform
raw processed datahub analysisingress egress
Scheduling, orchestration, metadata
user access, system integration,development
(Hive) Tables
Airflow, HiveMetastore
Batch Processing (Spark, Hive, ..)
Flat files, Databases, APIs, ...
SQL, Notebooks (Zeppelin, ..)
5
A typical (?) streaming data platform
raw processed datahub analysisingress egress
Scheduling, orchestration, metadata
user access, system integration,development
(Kafka) Topics, KTables, ..
(Confluent) Schema Registry
Stream Processing (Kafka Streams, Nifi,
..)Kafka Connect
Input Data (Streams)
KSQL
6
Down the event-driven road ..
Analytic(Streaming)
Data Platforms
Integrating existing(batch) data sources
Checkingconsistency
Building realtimevisualizations
Wrap up & Summary
7
Integrating web tracking
companywebsite tracking
service
tracking pixel
rawtrackingdata
› Hortonworks-based platform, including Nifiand Confluent Platform
› Apache Airflow established scheduling / workflowtool, integrated into monitoring, alerting, ..
› Tracking Service: Currently batch-oriented API (request data, get download links, ..),but click event stream planned
› Developers / Analysts with mixed backgroundw.r.t. programming skills
8
Integrating web tracking: setup / constraints
› drag-and-drop visual definition of datapipelines
› various built-in connectors (file, stream, database, service, ...)
› event-based processing paradigm
› built-in queues, data provenance, backpressure handling, registry, ...
› focus: ingest & lightweight (!) transformation
› not a complex event processor (like Kafka Streams, Flink, Spark Streaming, ...)
› integrated into HDP stack
9
Apache Nifi in a Nutshell
› python library to define & schedule batchworkflows
› programmatic specification of a „DAG“ (= tasks + dependencies)
› clean handling of job run metadata (success, duration, ..)
› developed by AirBnB, open-sourced 2015
› built-in standard operators (bash, hive, spark, kubernetes, ..)
› easily extendible (custom operators, ..)
› once used -> never Oozie again J
10
Apache Airflow in a nutshell
11
Integrating web tracking: options
trackingservice
trackingdata
Option Aspects
Airflow only + integrated into monitoring, ..+ job status handling, reloading- not prepared for future streamAPI- handling file content complicated
Unified Abstraction(e.g. Apache Beam)
+ one model for batch / streamingest- comparatively high entry barrier
Nifi only + visual pipeline definition+ easy handling of file content+ event-based paradigm+ operators available- custom status handling, reloading
Kafka-Connect + fault-tolerant+ scalable setup- custom connector coding- custom status handling, reloading
› Combinesadvantagesof Airflow & Nifi
› Prepared for futurestreaming API
› Integrated intomonitoring, alerting, ..
› Status handling / reloading easy
12
Integrating web tracking: chosen solution – Airflow + Nifi
trackingservice
trigger(hourly)download
check status(sensors)
trigger, fetchdownload links
download,process, storedata
13
Down the event-driven road ..
Analytic(Streaming)
Data Platforms
Integrating existing(batch) data sources
Checkingconsistency
Building realtimevisualizations
Wrap up & Summary
14
Checking consistency: Customer Consent
customerportal
grants / revokesconsent
writesconsentto hive
kafka
consentevent
in sync?
https://flic.kr/p/9yHuk8
Customer(consent)database
storesconsent
› Analysts need up-to-date version of customerconsent information in platform
› Hard correctness requirements (especiallyregarding revoked consent)
› Continuous monitoring of correctness
› Alerting in case of differences
15
Checking consistency: setup / constraints
16
Checking Consistency: Statistics Events
customerportal
kafka
› use existing channel (kafka)
› source inject periodic „statistics events“ into stream with defined measure point(in time)
{type:GRANT, cid:12, ts:2018-10-01 11:00:00 ..}
{type:GRANT, cid:10, ts:2018-10-01 11:01:00 ..}
{type:REVOK, cid:09, ts:2018-10-01 11:01:05 ..}
{type=STAT, measure_ts=2018-10-01 11:01:20,stats={num_consent_v1:72625,
num_consent_v2: 6252, ..}}
time
17
Checking Consistency: Evaluate Statistics Event
› perform count on target side (Hive) upto$measurePoint
› compare counts
› counts = simple plausibility check, but more elaboratedchecks (hashes) thinkable
{type=STAT, measure_ts=2018-10-01 11:01:20,stats={num_consent_v1:72625,
num_consent_v2: 6252, ..}}
in sync?
{ measure_ts=2018-10-01 11:01:20,hive_stats={
num_consent_v1:72625, num_consent_v2: 6252, ..}
}
Customer
(consent)
database
18
Down the event-driven road ..
Analytic(Streaming)
Data Platforms
Integrating existing(batch) data sources
Checkingconsistency
Building realtimevisualizations
Wrap up & Summary
19
Realtime visualizations: Online Shop Purchases
onlineshop
JMS
purchaseevent
normalization,filtering,
aggregation, ..
https://flic.kr/p/9yHuk8
realtimedashboard
› Goal: timely insights into various purchaseaspects (items bought last 5min, ..)
› flexible / configurable frontend (time window,aggregation dimension, ..)
› scalable to 100s / 1000s of dashboard users
› low latency of dashboard backend
20
Realtime visualizations: setup / constraints
21
Realtime visualizations: components / options
JMS
transport layer
service backend
service API
processing
Kafka-connect
KafkaKafka-streams
Kafka-connect
HBase
Phoenix / JDBC
Spring Boot
Nifi
Kafka
Tranquility
Druid
Spring Boot
aggregation duringprocessing
aggregation at query-time
Built-in, configurableaggregation
Nifi
Kafka
Kafka-connect
HBase
Phoenix / JDBC
Spring Boot
22
Realtime visualizations: chosen solution
JMS
Nifi
Kafka
Tranquility
Druid
Spring Boot› Druid: time series database with focus on
› Realtime ingestion, good Kafka integation
› „slice-and-dice“ queries
› distributed scale-out architecture
› Event processing kept simple in Nifi› mainly cleaning, transformation
› aggregation is pushed down to Druid
› But: yet another distributed system .. L› Experiences good so far, but needs work / skills
23
Down the event-driven road ..
Analytic(Streaming)
Data Platforms
Integrating existing(batch) data sources
Checkingconsistency
Building realtimevisualizations
Wrap up & Summary
› Technology moves from batch to stream – whatabout people?
› Analysts‘ world = often batch world› tooling centered around static datasets› can (and must) be generated from streams› but: education towards stream / event-based
thinking necessary!
› Incremental / stream-based data exchange = paradigm shift› efforts / commitment „from both ends“ necessary
24
The human factor ..
https://flic.kr/p/f2Wx6t
25
Stream me up, Scotty ..
The future is event-based, but on the way:
› Existing batch-oriented APIs› use (scheduled) event-based tools for easier later migration
› Checking consistency› inject plausibility checks into data stream
› Realtime visualizations› Druid + Kafka powerful and flexible combination
› Don‘t forget the human in the loop!
Vielen Dank
Dr. Dominik Benz
inovex GmbH
Park Plaza
Ludwig-Erhard-Allee 6
76131 Karlsruhe
Top Related