Cloud Infrastrukturen Folienset 8 - Mehr Cloud Technologien - Mesos, Spark
-
Upload
anynines -
Category
Technology
-
view
102 -
download
0
Transcript of Cloud Infrastrukturen Folienset 8 - Mehr Cloud Technologien - Mesos, Spark
„We wanted people to be able to program for the data center just like
they program for their laptop.“ - Ben Hindman, Co-Creator of Apache Mesos
• = centralized fault-tolerant cluster manager.
• Designed for distributed computing environments
• Provides resource management and resource isolation
http://iankent.uk/2014/02/26/a-quick-introduction-to-apache-mesos/
• Mesos joins multiple physical resources into a single virtual resource (opposite of classic virtualization)
• Schedules CPU & memory across the cluster
• Trend: clusters of commodity hardware
• Many cloud computing frameworks exist today
• Each cluster compute framework has its pros & cons > No framework suits all use cases
• a) Split cluster > Run one framework per sub-cluster
• b) Virtualize and allocate a set of VMs to each framework
• (-) Suboptimal server utilization
• (-) Inefficient data sharing
• > Inappropriate allocation granularity for both
• Compute frameworks often divide workloads into jobs and tasks.
• Tasks often have a short execution duration.
• Often multiple jobs per node can be run.
• > Jobs should be run where the data is. > Better ration between time used for data transport vs. computation.
• Short job execution times enables higher cluster utilization.
A uniform, generic approach of sharing cluster resources such as CPU time
and data across compute frameworks would be desirable.
• ZooKeper
• Mesos masters
• Mesos slaves
• Frameworks
• Chronos, Marathon, ….
• Aurora, Hadoop, Jenkins, Spark, Torque
• Master daemon manages
• Slave daemon on each Cluster Node
http://mesos.apache.org/documentation/latest/mesos-architecture/
• Master controls resources across applications by making
• Resource offers
• Master decides about resource allocation to frameworks based on organizational policy
http://mesos.apache.org/documentation/latest/mesos-architecture/
• Organization policies
• Fair sharing
• Strict priority
• New policy strategies can be added as plug-ins.
http://mesos.apache.org/documentation/latest/mesos-architecture/
• Runs on top of Mesos
• Consists of two components:
• Scheduler
• Executor
http://mesos.apache.org/documentation/latest/mesos-architecture/
• Scheduler
• registers with the master
• receives resource offerings from the master
• decides what to do with resources offered by the master within the framework
http://mesos.apache.org/documentation/latest/mesos-architecture/
• Executor
• launched on slave nodes
• runs framework tasks
http://mesos.apache.org/documentation/latest/mesos-architecture/
Providing „thin resource sharing layer that enables fine-grained sharing across diverse cluster computing
frameworks, by giving frameworks a common interface for accessing cluster
resources.“ - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
• How to match resources to a task?
• Be framework agnostic.
• Adapt to different scheduling needs.
• Be highly scalable.
• Scheduling must be HA and fault-tolerant.
• Addresses large data warehouse scenarios, such as Facebook’s Hadoop data warehouse (~1200 nodes in 2010).
• Median job length ~84 s built of
• Map reduce tasks ~23s
„Apache Spark is a fast and general-purpose cluster computing system.“
- https://spark.apache.org/docs/latest/
• Included Tools
• Spark SQL - SQL and structured data processing.
• MLib - Machine learning library
• GraphX - Graph processing
• Spark Streaming - scalable, high-throughput, fault-tolerant stream processing of live data streams
• much wider class of applications than MapReduce
• automatic fault-tolerance
https://spark.apache.org/research.html
• Spark is well designed for data analytics use cases > cyclic data flow
• Iterative algorithmse.g. machine learning algorithms and graph algorithms such as PageRank
• Interactive data mininguser loads data into RAM across a cluster and query it repeatedly
• Streaming applicationsmaintain aggregate state over time
https://spark.apache.org/research.html
• Spark RDDs = resilient distributed datasets (RDDs)
• RDDs can be stored in memory between queries without requiring replication
• RDDs can rebuild lost data be lineage > Redo all steps required to get the data (map, join, groupBy)
https://spark.apache.org/research.html
„RDDs allow Spark to outperform existing models by up to 100x in multi-
pass analytics.“https://spark.apache.org/research.html
• run as independent sets of processes on a cluster
• coordinated by the SparkContext in your main program (= driver programm)
• SparkContext can connect to several types of cluster managers
• Spark standalone manager
• Apache Mesos
• Apache Hadoop YARNhttps://spark.apache.org/docs/latest/cluster-overview.html
• Spark acquires executors on nodes in the cluster
• Executor = process
• runs computations
• stores data for your app
• Sends app code (jars, python files) < specified in the SparkContext
• Spark sends tasks for the executors to run
https://spark.apache.org/docs/latest/cluster-overview.html
• 1 executor process per app
• lives while the app lives
• runs tasks in multiple threads
• = isolation between apps
• each scheduler schedules its own tasks
• different apps > different executors > different JVMs
https://spark.apache.org/docs/latest/programming-guide.html
• neo4j.com
• docker.com
• http://unionfs.filesystems.org/
• mesos.apache.org
• spark.apache.org