HUG August 2010: Mesos
-
Upload
hadoop-user-group -
Category
Technology
-
view
106 -
download
2
description
Transcript of HUG August 2010: Mesos
MesosA Resource Management Platform for Hadoop and Big Data Clusters
Benjamin Hindman, Andy Konwinski, Matei Zaharia,
Ali Ghodsi, Anthony D. Joseph, Randy Katz,Scott Shenker, Ion Stoica
UC Berkeley
Challenges in Hadoop Cluster ManagementIsolation (both fault and resource)
»E.g. if co-locating production and experimental jobs
Testing and deploying upgrades
JobTracker scalability (in larger clusters)
Running non-Hadoop jobs»E.g. MPI, streaming MapReduce, etc
MesosMesos is a cluster resource manager over which multiple instances of Hadoop and other distributed applications can run
Hadoop
(production)
Hadoop
(ad-hoc)
MPI
Mesos
Node Node Node Node
…
What’s Different About It?Other resource managers exist today, including
» Hadoop on Demand» Batch schedulers (e.g. Torque)» VM schedulers (e.g. Eucalyptus)
However, have 2 problems with Hadoop-like workloads:» Data locality compromised due to static partitioning of nodes» Utilization hurt because jobs hold nodes for their full duration
Mesos addresses these through two features:» Fine-grained sharing at the level of tasks» Two-level scheduling model where jobs control placement
Fine-Grained Sharing
Hadoop 1
Hadoop 2
Hadoop 3
Hadoop 1
Hadoop 1
Hadoop 1
Hadoop 1
Hadoop 3
Hadoop 3
Hadoop 3
Hadoop 3 Hadoop
3
Hadoop 2
Hadoop 2
Hadoop 2
Hadoop 2
Hadoop 2
Hadoop 2
Hadoop 1
Hadoop 3
Hadoop 2
Hadoop 3
Hadoop 1
Hadoop 2
Coarse-Grained (HOD, etc) Fine-Grained (Mesos)
Mesos slave
Mesos master
Hadoop 0.20
scheduler
Mesos slave
Hadoop job
Hadoop 20 executor
task
Mesos slaveHadoop
19 executor
task
MPIscheduler
MPI job
MPIexecutor
task
Mesos Architecture
Hadoop 0.19
scheduler
Hadoop job
Hadoop 19
executor
task
MPIexecutor
task
Design GoalsScalability (to 10,000’s of nodes)
Robustness (even to master failure)
Flexibility (support wide variety of frameworks)
Resulting design: simple, minimal core that pushes resource selection logic to frameworks
Other FeaturesMaster fault tolerance using ZooKeeper
Resource isolation using Linux Containers
»Isolate CPU and memory between tasks on each node
»In newest kernels, can isolate network & disk IO too
Web UI for viewing cluster state
Deploy scripts for private clusters and EC2
Mesos StatusPrototype in 10000 lines of C++
Ported frameworks:» Hadoop (0.20.2), MPI (MPICH2), Torque
New frameworks:» Spark, Scala framework for iterative & interactive jobs
Test deployments at Twitter and Facebook
Results
1 28 55 82 1091361631902172442712983253520%
20%
40%
60%
80%
100%
MPIHadoopSpark
Time (s)
Sh
are
of
Clu
ste
r
Dynamic Resource Sharing
Data Locality Scalability
Static partition-
ing
Mesos, no delay sched.
Mesos, 1s delay sched.
Mesos, 5s delay sched.
0%
20%
40%
60%
80%
100%
0
120
240
360
480
600
Data Locality Job Running Times
Loca
l M
ap T
asks (
%)
Job R
unnin
g T
Ime (
s)
0 10000 20000 30000 40000 500000
0.25
0.5
0.75
1
Number of Nodes
Task S
tart
up
Overh
ead (
s)
Mesos Demo
SparkA framework for iterative and interactive cluster computingMatei Zaharia, Mosharaf Chowdhury,Michael Franklin, Scott Shenker, Ion Stoica
Spark GoalsSupport iterative applications
»Common in machine learning but problematic for MapReduce, Dryad, etc
Retain MapReduce’s fault tolerance & scalability
Experiment with programmability»Integrate into Scala programming language»Support interactive use from Scala
interpreter
Key AbstractionResilient Distributed Datasets (RDDs)
»Collections of elements distributed across cluster that can persist across parallel operations
»Can be stored in memory, on disk, etc»Can be transformed with map, filter, etc»Automatically rebuilt on failure
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Cached RDD Parallel
operation
Example: Logistic RegressionGoal: find best line separating two sets
of points
+
–
++
+
+
+
++ +
– ––
–
–
–
––
+
target
–
random initial line
Logistic Regression Codeval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y scale * p.x }).reduce(_ + _) w -= gradient}
println("Final w: " + w)
Logistic Regression Performance
1 5 10 20 300
50010001500200025003000350040004500
Hadoop
Number of Iterations
Ru
nn
ing
Tim
e (
s) 127 s / iteration
first iteration 174 s
further iterations 6 s
Spark Demo
ConclusionMesos provides a stable platform to multiplex resources among diverse cluster applications
Spark is a new cluster programming framework for iterative & interactive jobs enabled by Mesos
Both are open-source (but in very early alpha!)
http://github.com/mesos
http://mesos.berkeley.edu