DECK36 - Log everything! and Realtime Datastream Analytics with Storm

Post on 15-Jan-2015

1.251 views 3 download

Tags:

description

We from DECK36 show how Log everything! as requirement can be implemented with Hadoop, EMR and Twitter Storm.

Transcript of DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1Dr. Stefan Schadwinkel und Mike Lohmann

22

Who we are.

Log everything

Mike LohmannArchitektur

Author (PHPMagazin, IX, heise.de)

Dr. Stefan SchadwinkelAnalytics

Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.)

33

Agenda.

Log everything

What we did. What we do.

Log everything! - Our way from Requirement to Solution

Infrastructure and technologies: Simple, Scalable, Open Source

Happy business users.

44

What we did.

Log everything

Creating & operating education communities

Webapplications

Multi-language

Different market rules in different countries

Consolidating the technological basis for multiple (new) products

55

DECK36 GmbH & Co. KG

Log everything

DECK36 is a young spin-off from ICANS

7 core engineers with longstanding expertise

(operate, scale, automate, analyze)

Consulting and engineering services for the

etruvian group and external customers

66

Numberfacts of PokerStrategy.com

Log everything

6.000.000 Registered Users

PokerStrategy.comEducation since 2005

19 Languages

2.800.000PI/Day

700.000Posts/Day

7.600.000 Requests/Day

77

Moving on…

Log everything

Build more Education communities like PokerStrategy…

Assume PokerStrategy KPIs(?)

Other Business models

Add mobile and the social web…

Our requirement: Log everything!

88

Logging Tools / Technologies

Producer

Web/Mobile Apps

JS Frontend

Servers

Databases

04/10/2023

Transport

Now:RabbitMQ +Erlang Consumer

OR

Kafka +Any other Consumer

Was:Flume

Storage

Now:S3 Storage +Hadoop with EMR

OR

Any other storage

Was:Virtualized Inhouse Hadoop

Analytics

MapReduce withHive/Pig

Results in any formatExcel, QlikView, RDMS, ...

Realtime Datastream Analytics

Storm / Trident

99

Logging Infrastructure

Producer

04/10/2023

Transport

Storage Analytics

Databases and Server

S3

Rabbit MQ

Consumer

Excel, QlikView, Tableau, SASS, ...

Graylog

Zabbix

Apps1-x

Hadoop- Cluster

RDMS

Realtime Datastream Analytics (Storm)

Nimbus(Master)

ZookeeperZookeeperZookeeper

SupervisorSupervisorSupervisor

WorkerWorker

Worker

NodeJS

1010

Producer

04/10/2023

PageController

Monolog-Logger

Shovel

LocalRabbitMQ

PageHitEvent

Listener

Processor

Handler

Formatter

PageHit-Event

Logger::log()

LogMessage, JSON

/Home

1111

Producer JS (in progress)

04/10/2023

JS Client

DataCollector(NodeJS)

Shovel

LocalRabbitMQ

Local Storage

Validator

Tracks Event

/Home

TriggerWebSocket

1212

Producer

04/10/2023

LoggingComponent: Provides interfaces, filters and handlers

LoggingBundle: Glues all together with Symfony2

Drupal Logging Module: Using the LoggingComponent

JS Frontend Client: LogClient for Browsers (in progress)

https://github.com/ICANS/IcansLoggingComponenthttps://github.com/ICANS/IcansLoggingBundlehttps://github.com/ICANS/drupal-logging-modulehttps://github.com/DECK36/starlog-js-frontend-client

1313

Transport

04/10/2023

1st Solution: Flume

+ Part of the Hadoop Ecosystem

+ Flexible Central config, Extensible via Plugins

- Not mature software (flume, flume-ng, plugin interfaces, ..)

- Central config has problems with puppet

2nd Solution: RabbitMQ

+ Local RabbitMQ Cluster

+ Decentralized config (producers & consumers simply connect)

- HDFS Sink not pre-packaged

1414

Storage

04/10/2023

1st Solution: Self-hosted Hadoop

- Virtualized Infrastructure makes HDFS redundant

- High costs (cluster always running, admin work)

2nd Solution: Cloud Storage

+ Amazon S3

+ Elastic MapReduce: Hadoop on demand

+ cost effective (only pay, what you use)

1515

Compaction

04/10/2023

RabbitMQ consumer (Erlang) stores data to cloud

Yet: we have a mixed message stream, but want:

s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo

MapReduce:

Streaming (stdin/stdout to any tool)

Computation (Hive, Pig, Cascalog, etc.)

Amazon Redshift

PostgreSQL-compatible Data Warehouse

Hive Partitioning!

1616

Analytics

04/10/2023

Cascalog is Clojure, Clojure is Lisp

(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))

Query Operator

CascadingOutput Tap

Columns of the dataset generated

by the query

„Generator“ „Predicate“

as many as you want

both can be any clojure function

clojure can call anything that is

available within a JVM

1717

Analytics

04/10/2023

• We use Cascalog to preprocess and organize that incoming flow of log messages:

1818

Analytics

04/10/2023

Let‘s run the Cascalog processing on Amazon EMR:

./elastic-mapreduce --create --name „Log Message Compaction"

--bootstrap-action s3://[BUCKET]/mapreduce/configure-daemons

--num-instances $NUM

--slave-instance-type m1.large

--master-instance-type m1.large

--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar

--step-action TERMINATE_JOB_FLOW

--step-name "Cascalog"

--main-class icans.cascalogjobs.processing.compaction

--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error

1919

Analytics

04/10/2023

Now we can access the log data within Hive and store results again to S3:

2020

Analytics

04/10/2023

Now, get the stats by executing a query:

We can now simply copy the data from S3 and import in any local analytical tool

Excel, Redshift, QlikView, R, etc.

2121

Realtime Datastream Analytics

04/10/2023

• Storm: Hadoop for realtime analytics

• Rock solid HA concept

• Highly scalable

• Can:Processing Streams (and trigger events)Provide a DRPC functionalityWork on enormous data load

• Fancy names for modules (spouts/bolts/tuple/topology)

• Easy to useSmall and easy to understand APIDevMode

• Add new topologies at run time

2222

Realtime Datastream Analytics

04/10/2023

2323

Happy business users!

04/10/2023

Questions they have often can be automated (ETL, Reports)

New questions can be explored (Ad-hoc, Search)

Insights can be used as feedback into the system (Decisions, Websockets)

Data-driven applications can be created that can be used by multiple websites or

they can be taylored to individual needs.

2424

Merci.

04/10/2023

Questions

?

2525

Contacts.

04/10/2023

Dr. Stefan Schadwinkel

stefan.schadwinkel@deck36.de

ICANS_StScha

Mike Lohmann

mike.lohmann@deck36.de

mikelohmann

2626

Tools/Technologies

04/10/2023

27

DECK36 GmbH & CO. KG

Valentinskamp 18

20354 Hamburg

Germany

Phone: +49 40 22 63 82 9-0

Fax: +49 40 38 67 15 92

Web: www.deck36.de