Big Data Architecture

41
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Big Data Architectures Guido Schmutz

Transcript of Big Data Architecture

Page 1: Big Data Architecture

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Big Data Architectures

Guido Schmutz

Page 2: Big Data Architecture

Guido Schmutz

Working for Trivadis for more than 18 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis

More than 25 years of software development experience

Contact: [email protected]: http://guidoschmutz.wordpress.comTwitter: gschmutz

Page 3: Big Data Architecture

Agenda

1. Introduction2. Traditional Architecture for Big Data3. Streaming Analytics Architecture for Fast Data4. Lambda/Kappa/Unifed Architecture for Big Data5. Summary

Page 4: Big Data Architecture

Introduction

Page 5: Big Data Architecture

Big Data is still “work in progress”

Choosing the right architecture is key for any (big data) project

Big Data is still quite a young field and therefore there are no standard architectures available which have been used for years

In the past few years, a few architectures have evolved and have been discussed online

Know the use cases before choosing your architecture

To have one/a few reference architectures can help in choosing the right components

Page 6: Big Data Architecture

Hadoop Ecosystem – many choices ….Management/Monitoring

Core

Analytics Workflow/JobUnstructuredDataSources

StructuredDataSources

SQLonHadoop

SerializationDataStorage Security

Page 7: Big Data Architecture

Important Properties to choose a Big Data Architecture

Latency

Keep raw and un-interpreted data “forever” ?

Volume, Velocity, Variety, Veracity

Ad-Hoc Query Capabilities needed ?

Robustness & Fault Tolerance

Scalability

Page 8: Big Data Architecture

From Volume and Variety to Velocity

Big Data has evolved …

and the Hadoop Ecosystem as well ….

Past

BigData =Volume&Variety

Present

BigData =Volume&Variety&Velocity

Past

BatchProcessing

TimetoinsightofHours

Present

Batch &StreamProcessing

Timetoinsight inSeconds

Adapted fromClouderablogarticle

Page 9: Big Data Architecture

Traditional Architecture for Big Data

Page 10: Big Data Architecture

“Traditional Architecture” for Big Data

DataCollection (Analytical)DataProcessing ResultStoreData

Sources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

Batchcompute

Stage

ResultStore

QueryEngine

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

Page 11: Big Data Architecture

Use Case 1) – Click Stream analysis: 360 degree view of customer

DataCollection (Analytical)DataProcessing ResultStoreData

SourcesData

Consumer

ChannelBatch

compute

ComputedInformation

RawData(Reservoir)

ResultStore

QueryEngine

Reports

AnalyticTools

Logfiles

Page 12: Big Data Architecture

Use Case 2) – Ingest Relational Data into Hadoop and make it accessible

DataCollection (Analytical)DataProcessing ResultStoreData

SourcesData

Consumer

RDBMSBatch

compute

ComputedInformation

RawData(Reservoir)

ResultStore

QueryEngine

Reports

Service

AnalyticTools

AlertingTools

Page 13: Big Data Architecture

Use Case 2a) – Ingest Relational Data into Hadoop and make it accessible

DataCollection (Analytical)DataProcessing ResultStoreData

SourcesData

Consumer

RDBMS(CDC) Batch

compute

ComputedInformation

RawData(Reservoir)

ResultStore

QueryEngine

Reports

Service

AnalyticTools

AlertingTools

Channel

Page 14: Big Data Architecture

“Hadoop Ecosystem” Technology Mapping

DataCollection (Analytical)DataProcessing ResultStoreData

Sources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

Batchcompute

StagingResultStore

QueryEngine

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

Page 15: Big Data Architecture

Apache Spark – the new kid on the block

Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x

faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+

organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation• Supported by many vendors

Page 16: Big Data Architecture

Motivation – Why Apache Spark?

Apache Hadoop MapReduce: Data Sharing on Disk

Apache Spark: Speed up processing by using Memory instead of Disks

map reduce . . .Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

op1 op2 . . .Input

Output

Output

Page 17: Big Data Architecture

Apache Spark “Ecosystem”

SparkSQL(BatchProcessing)

BlinkDB(ApproximateQuerying)

SparkStreaming(Real-Time)

MLlib,SparkR(MachineLearning)

GraphX(GraphProcessing)

SparkCoreAPIandExecutionModel

SparkStandalone MESOS YARN HDFS Elastic

Search NoSQL S3

Libraries

CoreRuntime

ClusterResourceManagers Data/DataStores

Page 18: Big Data Architecture

Use Case 3) – Predictive Maintenance through Machine Learning on collected data

DataCollection (Analytical)DataProcessing ResultStoreData

SourcesData

Consumer

MachineBatch

compute

ComputedInformation

RawData(Reservoir)

ResultStore

QueryEngine

Reports

Service

AnalyticTools

AlertingTools

DB

StagingFile

Channel

Page 19: Big Data Architecture

“Spark Ecosystem” Technology Mapping

DataCollection (Analytical)DataProcessing ResultStoreData

Sources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

Batchcompute

StageResultStore

QueryEngine

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

Page 20: Big Data Architecture

Traditional Architecture for Big Data

• Batch Processing

• Not for low latency use cases

• Spark can speed up, but if positioned as alternative to Hadoop Map/Reduce, it’s still Batch Processing

• Spark Ecosystems offers a lot of additional advanced analytic capabilities (machine learning, graph processing, …)

Page 21: Big Data Architecture

Streaming Analytics Architecturefor Big Data

Page 22: Big Data Architecture

Streaming Analytics Architecture for Big Dataaka. (Complex) Event Processing)

DataCollection

Batchcompute

DataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore

=DatainMotion =DataatRest

Page 23: Big Data Architecture

Use Case 4) Alerting in Internet of Things (IoT)

DataCollection

Batchcompute

DataSources

Channel

DataConsumer

AnalyticTools

AlertingTools

Sensor

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore

Page 24: Big Data Architecture

Use Case 5) Real-Time Analytics on Sensor Events

DataCollection

Batchcompute

DataSources

Channel

DataConsumer

AnalyticTools

Sensor

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore

Page 25: Big Data Architecture

Unified Log (Event) Processing

Stream processing allows for computing feeds off of other feeds

Derived feeds are no different than original feedsthey are computed off

Single deployment of “Unified Log” but logically different feeds

MeterReadings

Collector

Enrich/Transform

AggregatebyMinute

RawMeterReadings

MeterwithCustomer

Meterby CustomerbyMinute

Customer AggregatebyMinute

MeterbyMinute

Persist

MeterbyMinute

Persist

RawMeterReadings

Page 26: Big Data Architecture

Streaming Analytics Technology Mapping

DataCollection

Batchcompute

DataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore

=DatainMotion =DataatRest

Page 27: Big Data Architecture

Streaming Analytics Architecture for Big Data

The solution for low latency use cases

Process each event separately => low latency

Process events in micro-batches => increases latency but offers better reliability

Previously known as “Complex Event Processing”

Keep the data moving / Data in Motion instead of Data at Rest => raw events are (often) not stored

Page 28: Big Data Architecture

Lambda Architecture for Big Data

Page 29: Big Data Architecture

“Lambda Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

Page 30: Big Data Architecture

Use Case 6) Social Media and Social Network Analysis

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

Page 31: Big Data Architecture

Lambda Architecture for Big Data

Combines (Big) Data at Rest with (Fast) Data in Motion

Closes the gap from high-latency batch processing

Keeps the raw information forever

Makes it possible to rerun analytics operations on whole data set if necessary => because the old run had an error or => because we have found a better algorithm we want to apply

Have to implement functionality twice• Once for batch• Once for real-time streaming

Page 32: Big Data Architecture

„Kappa“ Architecture for Big Data

Page 33: Big Data Architecture

“Kappa Architecture” for Big Data

DataCollection

“RawDataReservoir”

Batchcompute

DataSources

Messaging

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore

RawData(Reservoir)

=DatainMotion =DataatRest

Page 34: Big Data Architecture

Kappa Architecture for Big Data

Today the stream processing infrastructure are as scalable as Big Data processing architectures• Some using the same base infrastructure, i.e. Hadoop YARN

Only implement processing / analytics logic once

Can Replay historical events out of an historical (raw) event store• Provided by either the Messaging or Raw Data (Reservoir) component

Updates of processing logic / Event replay are handled by deploying new version of logic in parallel to old one• New logic will reprocess events until it caught up with the current events and then

the old version can be de-commissioned.

Page 35: Big Data Architecture

„Unified“ Architecture for Big Data

Page 36: Big Data Architecture

“Unified Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing(CalculateModelsofincomingdata)

Batchcompute

ResultStoreDataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

PredictionModels

Page 37: Big Data Architecture

Use Case 7) Fraud Detection

DataCollection

(Analytical)BatchDataProcessing(CalculateModelsofincomingdata)

Batchcompute

ResultStoreDataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

PredictionModels

Page 38: Big Data Architecture

Summary

Page 39: Big Data Architecture

Summary

Know your use cases and then choose your architecture and the relevant components/products/frameworks

You don’t have to use all the components of the Hadoop Ecosystem to be successful

Big Data is still quite a young field and therefore there are no standard architectures available which have been used for years

Lambda, Kappa Architecture are best practices architectures which you have to adapt to your environment

Page 40: Big Data Architecture
Page 41: Big Data Architecture

Guido SchmutzTechnology Manager

[email protected]