Download - Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

Down the event-driven road: Experiences of integrating

streaming into analytic dataplatforms

Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH

Confluent Meetup Munich, 8.10.2018

2

Integrateexisting (batch) data sources?

Check consistency

with datasources?

Build realtimedata

visualizations?

https://flic.kr/p/5eQA7ehttps://flic.kr/p/bpFt7U

3

Down the event-driven road ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

4

A typical analytic data platform

raw processed datahub analysisingress egress

Scheduling, orchestration, metadata

user access, system integration,development

(Hive) Tables

Airflow, HiveMetastore

Batch Processing (Spark, Hive, ..)

Flat files, Databases, APIs, ...

SQL, Notebooks (Zeppelin, ..)

5

A typical (?) streaming data platform

raw processed datahub analysisingress egress

Scheduling, orchestration, metadata

user access, system integration,development

(Kafka) Topics, KTables, ..

(Confluent) Schema Registry

Stream Processing (Kafka Streams, Nifi,

..)Kafka Connect

Input Data (Streams)

KSQL

6


Analytic(Streaming)

Data Platforms


Checkingconsistency


Wrap up & Summary

7

Integrating web tracking

companywebsite tracking

service

tracking pixel

rawtrackingdata

› Hortonworks-based platform, including Nifiand Confluent Platform

› Apache Airflow established scheduling / workflowtool, integrated into monitoring, alerting, ..

› Tracking Service: Currently batch-oriented API (request data, get download links, ..),but click event stream planned

› Developers / Analysts with mixed backgroundw.r.t. programming skills

8

Integrating web tracking: setup / constraints

› drag-and-drop visual definition of datapipelines

› various built-in connectors (file, stream, database, service, ...)

› event-based processing paradigm

› built-in queues, data provenance, backpressure handling, registry, ...

› focus: ingest & lightweight (!) transformation

› not a complex event processor (like Kafka Streams, Flink, Spark Streaming, ...)

› integrated into HDP stack

9

Apache Nifi in a Nutshell

› python library to define & schedule batchworkflows

› programmatic specification of a „DAG“ (= tasks + dependencies)

› clean handling of job run metadata (success, duration, ..)

› developed by AirBnB, open-sourced 2015

› built-in standard operators (bash, hive, spark, kubernetes, ..)

› easily extendible (custom operators, ..)

› once used -> never Oozie again J

10

Apache Airflow in a nutshell

11

Integrating web tracking: options

trackingservice

trackingdata

Option Aspects

Airflow only + integrated into monitoring, ..+ job status handling, reloading- not prepared for future streamAPI- handling file content complicated

Unified Abstraction(e.g. Apache Beam)

+ one model for batch / streamingest- comparatively high entry barrier

Nifi only + visual pipeline definition+ easy handling of file content+ event-based paradigm+ operators available- custom status handling, reloading

Kafka-Connect + fault-tolerant+ scalable setup- custom connector coding- custom status handling, reloading

› Combinesadvantagesof Airflow & Nifi

› Prepared for futurestreaming API

› Integrated intomonitoring, alerting, ..

› Status handling / reloading easy

12

Integrating web tracking: chosen solution – Airflow + Nifi

trackingservice

trigger(hourly)download

check status(sensors)

trigger, fetchdownload links

download,process, storedata

13


Analytic(Streaming)

Data Platforms


Checkingconsistency


Wrap up & Summary

14

Checking consistency: Customer Consent

customerportal

grants / revokesconsent

writesconsentto hive

kafka

consentevent

in sync?

https://flic.kr/p/9yHuk8

Customer(consent)database

storesconsent

› Analysts need up-to-date version of customerconsent information in platform

› Hard correctness requirements (especiallyregarding revoked consent)

› Continuous monitoring of correctness

› Alerting in case of differences

15

Checking consistency: setup / constraints

16

Checking Consistency: Statistics Events

customerportal

kafka

› use existing channel (kafka)

› source inject periodic „statistics events“ into stream with defined measure point(in time)

{type:GRANT, cid:12, ts:2018-10-01 11:00:00 ..}

{type:GRANT, cid:10, ts:2018-10-01 11:01:00 ..}

{type:REVOK, cid:09, ts:2018-10-01 11:01:05 ..}

{type=STAT, measure_ts=2018-10-01 11:01:20,stats={num_consent_v1:72625,

num_consent_v2: 6252, ..}}

time

17

Checking Consistency: Evaluate Statistics Event

› perform count on target side (Hive) upto$measurePoint

› compare counts

› counts = simple plausibility check, but more elaboratedchecks (hashes) thinkable

{type=STAT, measure_ts=2018-10-01 11:01:20,stats={num_consent_v1:72625,

num_consent_v2: 6252, ..}}

in sync?

{ measure_ts=2018-10-01 11:01:20,hive_stats={

num_consent_v1:72625, num_consent_v2: 6252, ..}

}

Customer

(consent)

database

18


Analytic(Streaming)

Data Platforms


Checkingconsistency


Wrap up & Summary

19

Realtime visualizations: Online Shop Purchases

onlineshop

JMS

purchaseevent

normalization,filtering,

aggregation, ..

https://flic.kr/p/9yHuk8

realtimedashboard

› Goal: timely insights into various purchaseaspects (items bought last 5min, ..)

› flexible / configurable frontend (time window,aggregation dimension, ..)

› scalable to 100s / 1000s of dashboard users

› low latency of dashboard backend

20

Realtime visualizations: setup / constraints

21

Realtime visualizations: components / options

JMS

transport layer

service backend

service API

processing

Kafka-connect

KafkaKafka-streams

Kafka-connect

HBase

Phoenix / JDBC

Spring Boot

Nifi

Kafka

Tranquility

Druid

Spring Boot

aggregation duringprocessing

aggregation at query-time

Built-in, configurableaggregation

Nifi

Kafka

Kafka-connect

HBase

Phoenix / JDBC

Spring Boot

22

Realtime visualizations: chosen solution

JMS

Nifi

Kafka

Tranquility

Druid

Spring Boot› Druid: time series database with focus on

› Realtime ingestion, good Kafka integation

› „slice-and-dice“ queries

› distributed scale-out architecture

› Event processing kept simple in Nifi› mainly cleaning, transformation

› aggregation is pushed down to Druid

› But: yet another distributed system .. L› Experiences good so far, but needs work / skills

23


Analytic(Streaming)

Data Platforms


Checkingconsistency


Wrap up & Summary

› Technology moves from batch to stream – whatabout people?

› Analysts‘ world = often batch world› tooling centered around static datasets› can (and must) be generated from streams› but: education towards stream / event-based

thinking necessary!

› Incremental / stream-based data exchange = paradigm shift› efforts / commitment „from both ends“ necessary

24

The human factor ..

https://flic.kr/p/f2Wx6t

25

Stream me up, Scotty ..

The future is event-based, but on the way:

› Existing batch-oriented APIs› use (scheduled) event-based tools for easier later migration

› Checking consistency› inject plausibility checks into data stream

› Realtime visualizations› Druid + Kafka powerful and flexible combination

› Don‘t forget the human in the loop!

Vielen Dank

Dr. Dominik Benz

[email protected]

inovex GmbH

Park Plaza

Ludwig-Erhard-Allee 6

76131 Karlsruhe