Christian Kreuzfeld – Static vs Dynamic Stream Processing

Post on 08-Jan-2017

5.715 views 1 download

Transcript of Christian Kreuzfeld – Static vs Dynamic Stream Processing

STATIC VS DYNAMIC STREAM PROCESSING

Christian Kreutzfeldt@mnxfst

STATIC VS DYNAMIC STREAM PROCESSING

Christian Kreutzfeldt@mnxfst

1. Introduction

2. Stream Processing - First Encounter

3. Increasing number of Use Cases

4. Arising Implementation Issues

5. Requirements for Stream Processing Framework

6. Way to SPQR (+ short demo)

7. Way to Apache Flink (extension points + short demo)

8. Future (hope to come)

9. Q&A

Christian Kreutzfeldt (@mnxfst)

Senior Software Developer & Architect atOtto Group Business Intelligence Department

Tech Lead “Real-Time Stream Processing”

Computer Science at University of Luebeck

w/ catalogue business, e-commerce and over-the-counter retail

Multichannel Retail

covering the entire portfolio of retail services across the value-added chain

Services

World’s Second-Largest Online Retailer in End-Consumer BusinessEurope’s Largest Online Retailer in End-Consumer Fashion & Lifestyle Business

providing retail-related financial services across the value-added chain

Financial Services

definition of business intelligence strategy

BI Strategy

talent recruitment & training,networking & consulting

Consulting

evaluation & impl. of data driven business models

Business Development

maintaining & providing data pools

Data Pool

software-as-a-service solutions

SaaS Products

Otto Group Business Intelligence Departmentdriven by data, inspired by our customers

Otto Group Business Intelligence Departmentdedicated to open source

stream processing framework

SPQR

scheduling framework for painfree agile development of your datahub

Schedoscope

framework for developing real-world machine learning solutions

Palladium

follow us on github.com/ottogroup

Stream Processingfirst steps w/ unified tracking

Unified

Tracking

Stream Processingprevent quality problems

Unified

Tracking

Tagging Template

Tagging Template

Tagging Template

Tagging Template

Stream Processingprevent quality problems

Unified

Tracking

Tagging Template

Tagging Template

Tagging Template

Tagging Template

EventStream

Event Validatorakka

-based

real stream

processi

ng

customer sessions

search sessions

user-agent identification

dynamic profile selection dynamic stream

queries

Stream Processingdeveloping project ideas

Umberto Salvagnin https://www.flickr.com/photos/kaibara/4688161016 (cc by 2.0)

Stream Processingsoftware development issues

resource intensive use-case implementation

required ops support for topology deployment and

monitoring

rather static implementations than highly flexible ones

highly time consuming

Static Topologies (Queries)

Dynamic Data

Highly Flexible Context

Stream Processingrequirements to ease the pain

unified runtime environment

operations support

support for multiple sources and sinks

real stream processing

easy-to-extend

steep learning curve

Stream Processingworking w/ data the business way

no-code topology definition(the SQL way)

self dependent, immediate deployments

consistent monitoring(behavior / result retrieval)

adjustment through re-deployments

Dynamic Topologies (Queries)

Dynamic Data

Highly Flexible Context

Stream Processingframework decision

unified runtime environment

operations support

support for multiple sources and sinks

real stream processing

easy-to-extend

steep learning curve

S P

Q R

(spo

oker

)

no-code topology definition

self dependent deployments

consistent monitoring

immediate deployments

short feedback circuit

SPQRconcepts

independent library deployments into node repositories for later use

library deployment

configuration based pipeline descriptions

zero-codetopologies

support for ad hoc queries, immediate adjustments and short feedback circuits

ad hoc queries

https://github.com/ottogroup/spqr

SPQRarchitecture

D E M O

Dynamic Stream Processingimportance for (business) acceptance

no-code topology definition

self dependent deployments

consistent monitoring

immediate deployments

short feedback circuit

steep learning curve, focus on functionality instead of implementation, better representation

no or less ops support, shorter time-to-execution, independency from tech teams, easier to use

short feedback circuit, easier to adjust

support people to try out new ideas, get more people to work with data streams

choose representation defined by topology author as foundation for monitoring to have common understanding (topology author, ops team)

Dynamic Stream Processingfrom spqr to apache flink - it’s all there

Martin Grandjean - http://www.martingrandjean.ch/wp-content/uploads/2013/10/Graphe3.png (cc by-sa 3.0)

akka

Dynamic Stream Processingvariety of ways to interact with apache flink

Martin Grandjean - http://www.martingrandjean.ch/wp-content/uploads/2013/10/Graphe3.png (cc by-sa 3.0)

variety to message types (request/response) available to interact with job manager / cluster:

● RequestNumberRegisteredTaskManager● RequestTotalNumberOfSlots● SubmitJob● CancelJob● RequestPartitionState● RequestJobStatus● RequestRunningJobs● RequestRunningJobsStatus● RequestJob● RequestRegisteredTaskManagers● RequestStackTrace● RequestJobManagerStatus● AccumulatorMessage (RequestAccumulatorResultsStringified,...)● ...

Apache Flinkshort feedback circuit & consistent monitoring (impl)

Martin Grandjean - http://www.martingrandjean.ch/wp-content/uploads/2013/10/Graphe3.png (cc by-sa 3.0)

akka

FlinkMetricsCollector RunningJobsManagerspawns

queriesJobManager

JobMetricsCollector

spawns for each job

queriesJobManager

Apache Flinkshort feedback circuit & consistent monitoring (impl)

Martin Grandjean - http://www.martingrandjean.ch/wp-content/uploads/2013/10/Graphe3.png (cc by-sa 3.0)

akka

public void preStart() throws Exception { context().system().scheduler().schedule( FiniteDuration.Zero(), FiniteDuration.apply(5, TimeUnit.SECONDS), this.remoteJobManagerRef, new RequestAccumulatorResults(this.jobId), context().dispatcher(), getSelf() ); } AccumulatorResultsFound

public void preStart() throws Exception {

context().system().scheduler().schedule( FiniteDuration.Zero(), FiniteDuration.apply(5, TimeUnit.SECONDS), this.remoteJobManagerRef, JobManagerMessages.getRequestRunningJobsStatus(), context().dispatcher(), getSelf() ); }

receive RunningJobsStatus

extract job identifier

start job metrics collector

RunningJobsManager

JobMetricsCollector

Apache Flinkmetrics retrieval through accumulators

D E M O

https://nifi.apache.org/

Apache Flinkhow to move on

deploy metrics

under construction

Apache Flinktopology definition & deployments (integration points)

akka

Martin Grandjean - http://www.martingrandjean.ch/wp-content/uploads/2013/10/Graphe3.png (cc by-sa 3.0)

no-code topology definition

self dependent deployments immediate deployments

expects code

requires far too much framework

modifications

the place to be

https://nifi.apache.org/

metricsdeploy

Apache Flinkrelevance

Static DataStatic Queries

Static DataDynamic Queries

Dynamic DataStatic Queries

Dynamic DataDynamic Queries

SQL

https://nifi.apache.org/

metricsdeploy

Apache Flinkapache zeppelin points the right direction

Static DataStatic Queries

Static DataDynamic Queries

Dynamic DataStatic Queries

Dynamic DataDynamic Queries

SQL

http://www.ottogroup.com/en/karriere/

We are hiring!