Cloud Infrastrukturen Folienset 8 - Mehr Cloud Technologien - Mesos, Spark

More Cloud Technologies Apache Mesos, Apache Spark

Apache Mesos

What is Mesos?

„We wanted people to be able to program for the data center just like

they program for their laptop.“ - Ben Hindman, Co-Creator of Apache Mesos

• = centralized fault-tolerant cluster manager.

• Designed for distributed computing environments

• Provides resource management and resource isolation

http://iankent.uk/2014/02/26/a-quick-introduction-to-apache-mesos/

• Mesos joins multiple physical resources into a single virtual resource (opposite of classic virtualization)

• Schedules CPU & memory across the cluster

Apache Mesos is a tool to build/schedule cluster

frameworks such as Apache Spark.

Why is Mesos relevant?

• Trend: clusters of commodity hardware

• Many cloud computing frameworks exist today

• Each cluster compute framework has its pros & cons > No framework suits all use cases

In larger organizations, multiple cluster-frameworks

are required

Legacy strategies to run multiple cluster compute frameworks:

• a) Split cluster > Run one framework per sub-cluster

• b) Virtualize and allocate a set of VMs to each framework

• (-) Suboptimal server utilization

• (-) Inefficient data sharing

• > Inappropriate allocation granularity for both

Data Locality

• Compute frameworks often divide workloads into jobs and tasks.

• Tasks often have a short execution duration.

• Often multiple jobs per node can be run.

• > Jobs should be run where the data is. > Better ration between time used for data transport vs. computation.

• Short job execution times enables higher cluster utilization.

A uniform, generic approach of sharing cluster resources such as CPU time

and data across compute frameworks would be desirable.

This is what Mesos provides.

How does Mesos work?

• ZooKeper

• Mesos masters

• Mesos slaves

• Frameworks

• Chronos, Marathon, ….

• Aurora, Hadoop, Jenkins, Spark, Torque

http://mesos.apache.org/documentation/latest/mesos-architecture/

• Master daemon manages

• Slave daemon on each Cluster Node


Master Daemon

• Master controls resources across applications by making

• Resource offers

• Master decides about resource allocation to frameworks based on organizational policy


Organizational Policies

• Organization policies

• Fair sharing

• Strict priority

• New policy strategies can be added as plug-ins.


Frameworks

• Runs on top of Mesos

• Consists of two components:

• Scheduler

• Executor


• Scheduler

• registers with the master

• receives resource offerings from the master

• decides what to do with resources offered by the master within the framework


• Executor

• launched on slave nodes

• runs framework tasks


Example Resource Offer

What do I need Mesos for?

Providing „thin resource sharing layer that enables fine-grained sharing across diverse cluster computing

frameworks, by giving frameworks a common interface for accessing cluster

resources.“ - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

• How to match resources to a task?

• Be framework agnostic.

• Adapt to different scheduling needs.

• Be highly scalable.

• Scheduling must be HA and fault-tolerant.

• Addresses large data warehouse scenarios, such as Facebook’s Hadoop data warehouse (~1200 nodes in 2010).

• Median job length ~84 s built of

• Map reduce tasks ~23s

Show me a Mesos demo!

Apache Spark

What is Spark?

„Apache Spark is a fast and general-purpose cluster computing system.“

- https://spark.apache.org/docs/latest/

• APIs/SDKs available for

• Java

• Scala

• Python

• Included Tools

• Spark SQL - SQL and structured data processing.

• MLib - Machine learning library

• GraphX - Graph processing

• Spark Streaming - scalable, high-throughput, fault-tolerant stream processing of live data streams

• much wider class of applications than MapReduce

• automatic fault-tolerance

https://spark.apache.org/research.html

Why is Spark relevant?

• Spark is well designed for data analytics use cases > cyclic data flow

• Iterative algorithmse.g. machine learning algorithms and graph algorithms such as PageRank

• Interactive data mininguser loads data into RAM across a cluster and query it repeatedly

• Streaming applicationsmaintain aggregate state over time


• Spark RDDs = resilient distributed datasets (RDDs)

• RDDs can be stored in memory between queries without requiring replication

• RDDs can rebuild lost data be lineage > Redo all steps required to get the data (map, join, groupBy)


„RDDs allow Spark to outperform existing models by up to 100x in multi-

pass analytics.“https://spark.apache.org/research.html

How does Spark work?

• run as independent sets of processes on a cluster

• coordinated by the SparkContext in your main program (= driver programm)

• SparkContext can connect to several types of cluster managers

• Spark standalone manager

• Apache Mesos

• Apache Hadoop YARNhttps://spark.apache.org/docs/latest/cluster-overview.html

https://spark.apache.org/docs/latest/cluster-overview.html

• Spark acquires executors on nodes in the cluster

• Executor = process

• runs computations

• stores data for your app

• Sends app code (jars, python files) < specified in the SparkContext

• Spark sends tasks for the executors to run



• 1 executor process per app

• lives while the app lives

• runs tasks in multiple threads

• = isolation between apps

• each scheduler schedules its own tasks

• different apps > different executors > different JVMs

Show me a Spark demo!

https://spark.apache.org/examples.html

https://spark.apache.org/examples.html

https://spark.apache.org/docs/latest/programming-guide.html

https://spark.apache.org/docs/latest/programming-guide.html

Thank you.

@fischerjulian [email protected]

mailto:[email protected]

Links & Sources

• neo4j.com

• docker.com

• http://unionfs.filesystems.org/

• mesos.apache.org

• spark.apache.org

http://neo4j.com

http://docker.com

http://mesos.apache.org

http://spark.apache.org

Cloud Infrastrukturen Folienset 8 - Mehr Cloud Technologien - Mesos, Spark

Technology

Transcript of Cloud Infrastrukturen Folienset 8 - Mehr Cloud Technologien - Mesos, Spark