Simplified Data Management And Process Scheduling in Hadoop

26

Simplified Data and Process Scheduling in Hadoop

Upload
getindata
Category

Technology
view
1.326
download
1

TAGS:

Embed Size (px):

Transcript of Simplified Data Management And Process Scheduling in Hadoop

Page 1: Simplified Data Management And Process Scheduling in Hadoop

Simplified Data and Process Scheduling in Hadoop

Page 2: Simplified Data Management And Process Scheduling in Hadoop

Page 3: Simplified Data Management And Process Scheduling in Hadoop

Page 4: Simplified Data Management And Process Scheduling in Hadoop

Somebody Still Investigates

Do you think we find the location and the owner of the “streams” dataset today?

Page 5: Simplified Data Management And Process Scheduling in Hadoop

STREAMS{trackId:long, userId:long, ts:timestamp, ...}

hdfs://data/core/streams

avro

etl

official=>true, frequency=>hourly

"UserId started to stream trackId at time ts"

Page 6: Simplified Data Management And Process Scheduling in Hadoop

Page 7: Simplified Data Management And Process Scheduling in Hadoop

users = LOAD 'data.user'

USING HCatLoader();

val users = hiveContext.hql(

"FROM data.user SELECT name, country"

)

users = LOAD

'/data/core/user/part-00000.

avro' USING AvroStorage();Non HCatalog way

in Pig

ID NAME COUNTRY GENDER

1 JOSH US M

2 ADAM PL M

Page 8: Simplified Data Management And Process Scheduling in Hadoop

[FALCON-790]

Page 9: Simplified Data Management And Process Scheduling in Hadoop

[FALCON-790]

Email

Page 10: Simplified Data Management And Process Scheduling in Hadoop

HDFS

HDFS

Page 11: Simplified Data Management And Process Scheduling in Hadoop

[FALCON-790]

Page 12: Simplified Data Management And Process Scheduling in Hadoop

Page 13: Simplified Data Management And Process Scheduling in Hadoop

Page 14: Simplified Data Management And Process Scheduling in Hadoop

Page 15: Simplified Data Management And Process Scheduling in Hadoop

Switching to ORC requires

reimplementing the Reader Code

in hundreds of productions jobs...

Page 16: Simplified Data Management And Process Scheduling in Hadoop

users = LOAD 'data.users' USING HCatLoader();

ORC

Page 17: Simplified Data Management And Process Scheduling in Hadoop

The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!

http://hortonworks.com/blog/introduction-apache-falcon-hadoop/

Page 18: Simplified Data Management And Process Scheduling in Hadoop

Raw Data Cleansed Data

Conformed Data

Presented Data

Raw Data Presented Data

Page 19: Simplified Data Management And Process Scheduling in Hadoop

Page 20: Simplified Data Management And Process Scheduling in Hadoop

Page 21: Simplified Data Management And Process Scheduling in Hadoop

Which Elephant Is Your?

A. Elephantus Dirtus

B. Elephantus Cleanus

Page 22: Simplified Data Management And Process Scheduling in Hadoop

Page 23: Simplified Data Management And Process Scheduling in Hadoop

Backup Slides

Page 24: Simplified Data Management And Process Scheduling in Hadoop

Falcon’s Adoption

■ Top Level Project since December 2014■ 14 contributors from 3 companies■ Originated and heavily used at inMobi

● 400+ pipelines and 2000+ data feeds■ Also used at Expedia and at some undisclosed companies

Page 25: Simplified Data Management And Process Scheduling in Hadoop

Future Enhancements And Ideas

■ Improved Web UI [FALCON-790]● More extensive search box, more widgets● The “today morning” dashboard [FALCON-994]● Re-running processes

■ Automatic discovery of datasets in HDFS and Hive■ Streaming feeds and processes e.g. Storm, Spark Streaming■ Triage of data processing issues [FALCON-796]■ HDFS snapshots■ High availability of the Falcon server

Page 26: Simplified Data Management And Process Scheduling in Hadoop

[FALCON-790]

Scheduling Large Jobs by Abstraction Reﬁnement · Hadoop) that dynamically schedules large MapReduce jobs. We deployed both systems on Amazon EC2 and used them for scheduling image

Scheduling Large Jobs by Abstraction Reﬁnement · Hadoop) that dynamically schedules large MapReduce jobs. We deployed both systems on Amazon EC2 and used them for scheduling image

BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework

BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework

Table of ContentsTable of Contents Overview Scheduling in Hadoop Heterogeneity in Hadoop The LATE Scheduler(Longest Approximate Time to End)

Table of ContentsTable of Contents Overview Scheduling in Hadoop Heterogeneity in Hadoop The LATE Scheduler(Longest Approximate Time to End)

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

Accelerating Big Data Processing with Hadoop, Spark and ...€¦ · (Cluster Resource Management & Data Processing) (Cluster Resource Management & Job Scheduling) Hadoop Common/Core

Accelerating Big Data Processing with Hadoop, Spark and ...€¦ · (Cluster Resource Management & Data Processing) (Cluster Resource Management & Job Scheduling) Hadoop Common/Core

BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce

BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce

Simplified Scheduling. GOAL: To set up a monthly schedule for an irrigation system using the simplified scheduling method. This is suitable for those.

Simplified Scheduling. GOAL: To set up a monthly schedule for an irrigation system using the simplified scheduling method. This is suitable for those.

Hadoop Scheduling - a 7 year perspective

Hadoop Scheduling - a 7 year perspective

Simplified Scheduling of a Building Construction Process ... · duration obtained using PERT scheduling as compared to the discrete event simulation model. This provides conclusions

Simplified Scheduling of a Building Construction Process ... · duration obtained using PERT scheduling as compared to the discrete event simulation model. This provides conclusions

Analytics for Object Storage Simplified - Unified File … for Object Storage Simplified - Unified File and Object for Hadoop Sandeep R Patil STSM, ... Photo From Phone Service to

Analytics for Object Storage Simplified - Unified File … for Object Storage Simplified - Unified File and Object for Hadoop Sandeep R Patil STSM, ... Photo From Phone Service to

Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

SQL-on-Hadoop - Harvard SEASdaslab.seas.harvard.edu › classes › cs265 › files › presentations › CS2… · The hybrids combine Hadoop scheduling and fault-tolerance with

SQL-on-Hadoop - Harvard SEASdaslab.seas.harvard.edu › classes › cs265 › files › presentations › CS2… · The hybrids combine Hadoop scheduling and fault-tolerance with

Hadoop Ecosystem - cedawi.orgcedawi.org/docs/...Baku-2015--Hadoop-Eco-System.pdf · ... (YARN) Framework for job scheduling ... of files, Each file ... MapReduce is a method for distributing

Hadoop Ecosystem - cedawi.orgcedawi.org/docs/...Baku-2015--Hadoop-Eco-System.pdf · ... (YARN) Framework for job scheduling ... of files, Each file ... MapReduce is a method for distributing

Resource Aware Scheduling in Storm (Hadoop Summit 2016)

Resource Aware Scheduling in Storm (Hadoop Summit 2016)

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Cloud schedulers and Scheduling in Hadoop

Cloud schedulers and Scheduling in Hadoop

Learning Scheduling Algorithms for Data Processing Clustersweb.mit.edu/decima/content/sigcomm-19-slides.pdf•Data analytics frameworks (e.g., Spark, Hadoop) •Machine learning (e.g.,

Learning Scheduling Algorithms for Data Processing Clustersweb.mit.edu/decima/content/sigcomm-19-slides.pdf•Data analytics frameworks (e.g., Spark, Hadoop) •Machine learning (e.g.,

Parallel Programming with Hadoop/MapReducetyang/class/240a13w/slides/LectureMapReduce.pdfMapReduce: Runtime Environment &Hadoop Partitioning the input data. Scheduling program across

Parallel Programming with Hadoop/MapReducetyang/class/240a13w/slides/LectureMapReduce.pdfMapReduce: Runtime Environment &Hadoop Partitioning the input data. Scheduling program across

Simplified Irrigation Scheduling on your Phone or Web Browser€¦ · Simplified Irrigation Scheduling on your Phone or Web Browser ... platform including iPhone, Android, or MS Windows

Simplified Irrigation Scheduling on your Phone or Web Browser€¦ · Simplified Irrigation Scheduling on your Phone or Web Browser ... platform including iPhone, Android, or MS Windows

A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster

A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster

Languages

Pages

Legal

Copyright © 2022 FDOCUMENTS