Simplified Data Management And Process Scheduling in Hadoop

26
Simplified Data and Process Scheduling in Hadoop

Transcript of Simplified Data Management And Process Scheduling in Hadoop

Simplified Data and Process Scheduling in Hadoop

Somebody Still Investigates

Do you think we find the location and the owner of the “streams” dataset today?

STREAMS{trackId:long, userId:long, ts:timestamp, ...}

hdfs://data/core/streams

avro

etl

official=>true, frequency=>hourly

"UserId started to stream trackId at time ts"

users = LOAD 'data.user'

USING HCatLoader();

val users = hiveContext.hql(

"FROM data.user SELECT name, country"

)

users = LOAD

'/data/core/user/part-00000.

avro' USING AvroStorage();Non HCatalog way

in Pig

ID NAME COUNTRY GENDER

1 JOSH US M

2 ADAM PL M

[FALCON-790]

[FALCON-790]

Email

HDFS

HDFS

[FALCON-790]

Switching to ORC requires

reimplementing the Reader Code

in hundreds of productions jobs...

users = LOAD 'data.users' USING HCatLoader();

ORC

The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!

Raw Data Cleansed Data

Conformed Data

Presented Data

Raw Data Presented Data

Which Elephant Is Your?

A. Elephantus Dirtus

B. Elephantus Cleanus

Backup Slides

Falcon’s Adoption

■ Top Level Project since December 2014■ 14 contributors from 3 companies■ Originated and heavily used at inMobi

● 400+ pipelines and 2000+ data feeds■ Also used at Expedia and at some undisclosed companies

Future Enhancements And Ideas

■ Improved Web UI [FALCON-790]● More extensive search box, more widgets● The “today morning” dashboard [FALCON-994]● Re-running processes

■ Automatic discovery of datasets in HDFS and Hive■ Streaming feeds and processes e.g. Storm, Spark Streaming■ Triage of data processing issues [FALCON-796]■ HDFS snapshots■ High availability of the Falcon server

[FALCON-790]