Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of...

12
1 www.prace-ri.eu Presentation title Hadoop YARN EPCC, The University of Edinburgh Amy Krause, Andreas Vroutsis

Transcript of Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of...

Page 1: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

1 www.prace-ri.euPresentation title

Hadoop YARN

EPCC, The University of Edinburgh

Amy Krause, Andreas Vroutsis

Page 2: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

2 www.prace-ri.euPresentation title

Outline

▶ Hadoop ecosystem

▶ Hadoop YARN

▶ Hadoop use cases

Page 3: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

3 www.prace-ri.euPresentation title

Hadoop system

MapReduceHBaseMahout

YARN

HDFS

Spark

Page 4: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

4 www.prace-ri.euPresentation title

Hadoop YARN overview

▶ YARN is a job scheduler

▶ It manages resources (nodes)

▶ It manages applications (a single job or a DAG of jobs)

▶ The Application Manager negotiates resources with the Resource Manager and works

with the Node Manager to execute and monitor tasks

Page 5: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

5 www.prace-ri.euPresentation title

YARN Architecture

Page 6: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

6 www.prace-ri.euPresentation title

YARN Components

▶ ResourceManager

▶ Scheduler: assigns cluster resources to queues and applications

▶ ApplicationsManager: accepts job submissions and creates the ApplicationMaster

▶ NodeManager:

▶ responsible for launching and managing containers

▶ Manages the “health” of the node and reports to ResourceManager

▶ ApplicationMaster

▶ specifies the tasks to execute in a container

▶ Container

Page 7: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

7 www.prace-ri.euPresentation title

Hadoop Applications

▶ Financial

▶ Retail

▶ IoT

▶ Fast processing of volatile and large volumes of data

▶ Reliable, robust and scalable for peak demand

▶ Storage and analytics

▶ Data Analytics

▶ Structured, semi-structured and unstructured data

▶ Data storage

▶ Data lakes

Page 8: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

8 www.prace-ri.euPresentation title

Hadoop applications

▶ Hundreds of millions of Tweets are processed, stored,

cached, served and analysed every day

▶ Hadoop clusters are running both compute and HDFS

▶ “We have multiple clusters storing over 500 PB divided in four

groups (real time, processing, data warehouse and cold

storage). Our biggest cluster is over 10k nodes. We run 150k

applications and launch 130M containers per day.”

▶ Optimised data storage and workflow solutions within Hadoop

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.htmlhttps://github.com/twitter/elephant-bird

Page 9: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

9 www.prace-ri.euPresentation title

Hadoop applications

▶ “Expedia provisions Hadoop clusters

using Amazon Elastic Map Reduce (Amazon

EMR) to analyze and process streams of data

coming from Expedia’s global network of websites,

primarily clickstream, user interaction, and supply

data"

https://aws.amazon.com/solutions/case-studies/expedia/

Page 10: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

10 www.prace-ri.euPresentation title

… and many more!

Page 11: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

11 www.prace-ri.euPresentation title

A little more on Spark

▶ Explicitly supports caching data

▶ Speeds up iterative algorithms

▶ Can use HDFS as the data source

▶ More that just map/reduce

▶ Transformations:

▶ map, filter, union, Cartesian, join, sample…

▶ Actions:

▶ reduce, collect, count, first, countBy, foreach…

Page 12: Hadoop YARN - PRACE Agenda Systems (Indico) · 8 Presentation title Hadoop applications Hundreds of millions of Tweets are processed, stored, cached, served and analysed every day

12 www.prace-ri.euPresentation title

THANK YOU FOR YOUR ATTENTION

www.prace-ri.eu