JOSA TechTalks - Real-Time and Big Data

Post on 17-Jul-2015

277 views 0 download

Transcript of JOSA TechTalks - Real-Time and Big Data

REAL-TIME AND BIG DATA

Mahmoud M. Jalajel

OUTLINE

• Intro: Real-time with Big Data

• The Lambda Architecture

• The Relay Model

WHY SOLVE FOR REAL-

TIME• Real-time offers more business value

• Live Web Analytics

• Recommendations

• Real-time = (semi-) realtime

• Event to index ~ single digit minutes

• Query duration ~ single digit seconds

REAL-TIME

IMPLEMENTATION• Incremental Implementation

• Stream processing / No full data context

• A real-time implementation is:

• Far more useful

• Faster

• Easily adaptable to batch mode

REAL-TIME IN HADOOP

MongoDb Query Time

(optimized, single-node)

Hive Query Time

(5 nodes)

Hangs, crashes, starts

begging for mercy then

commits suicide and

weepingly dies

A few hours

2 Seconds 15 Minutes

LAMBDA ARCHITECTURECreated by: Nathan Marz

lambda-architecture.net

LAMBDA

ARCHITECTURE

BASIC ASSUMPTIONS

1. Query = Function(All Data)

2. Data are immutable timely facts

3. Append-Only (CRUD becomes CR)

4. Human Fault-Tolerance

THE BATCH LAYER

• Accepts stream of data

• Appends to master

dataset

• Uses: HDFS

THE SERVING LAYER

• Precomputes different

views

• Works on full dataset

• Refreshes regularly offline

• Batch views are usually

stored in a key-value store

CHECKPOINT

• Typical Hadoop Setup

• Slow, inefficient

• Outdated. usually lagging by hours or days

• Although accurate for surveyed data

• Costly to re-run. Real-time is not an option

THE SPEED LAYER

• Works with recent data

• Complements results

• Incremental implementation

THE FULL PICTUREQuery Merging

EXAMPLE

TECHNOLOGIES

DRUID EXAMPLE

REVIEWING LA

PROs

• Modular

• Flexible

• Self-Auditing

• Proven components

CONs

• Complex

• Maintainability

• Query Merging

THE RELAY MODEL

RELAY MODELQuery Merging

THE WORKFLOW

REVIEWING RM

PROs

• Coherent, Simpler

than LA

• Extensible to full

LA

• Cheaper

CONs

• Master Data

Storage

• Query flexibility

WHY NOT HADOOP NOW?

• Too much time, no capacity

• Too soon or too late

• Too expensive

• Hammer/nail problem

CONCLUSIONS

• Think big data, now!

• No need to invest years of development to

perfect a big data system.

• Start now! gradually grow system requirements

and engineering skill-set

• Select scalable components

Mahmoud Jalajel – @mjalajel

Questions ?

APPENDIX

Apache Kafka

Apache Storm

Apache Storm with external systems