Big Data on DC/OS
-
Upload
susan-xinh-huynh -
Category
Software
-
view
29 -
download
0
Transcript of Big Data on DC/OS
© 2016 Mesosphere, Inc. All Rights Reserved.
BIG DATA ON DC/OS
1
Susan X. Huynh, Women in Big Data Meetup, Apr. 2017
© 2016 Mesosphere, Inc. All Rights Reserved.
ANNOUNCEMENTS
2
• https://www.womeninbigdata.org/ - strengthening diversity in the big data field
• Member appreciation night @Intel on April 6th
• https://mesosphere.com/careers/
© 2016 Mesosphere, Inc. All Rights Reserved.
OUTLINE
3
DC/OS for Big Data Ops - 10 min.
Demo: Deploy a Data Pipeline on a DC/OS Cluster - 20 min.
© 2016 Mesosphere, Inc. All Rights Reserved.
AUDIENCE
4
© 2016 Mesosphere, Inc. All Rights Reserved.
BIG DATA: DATA PIPELINES
5
• A typical data pipeline:
data source
message bus
analytics engine
Web service Kafka
Spark HDFS
Cassandra database
file system Logs
Events
storage
© 2016 Mesosphere, Inc. All Rights Reserved.
DC/OS FOR BIG DATA OPS
6
• A typical project lifecycle of a data pipeline
1. Development: on one machine - e.g., laptop
2. Deploy to production: cluster of machines <== DC/OS ==>
3. Ops in production: upgrade Spark version
© 2016 Mesosphere, Inc. All Rights Reserved.
PRODUCTION & OPS FOR BIG DATA
7
• Deployment
• Efficient resource management
• Scaling: adding more nodes running, i.e., Kafka or Spark
• Fault tolerance: restarting failed tasks
• Maintenance
• Backups & Restore, i.e., Cassandra
© 2016 Mesosphere, Inc. All Rights Reserved.
EXAMPLE: HDFS DEPLOYMENT
8
http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-common/ClusterSetup.html
Hadoop Startup
To start a Hadoop cluster you will need to start both the HDFS and YARN cluster.
Format a new distributed filesystem:
$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>
Start the HDFS with the following command, run on the designated NameNode:
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
Run a script to start DataNodes on all slaves:
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
Start the YARN with the following command, run on the designated ResourceManager:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
Run a script to start NodeManagers on all slaves:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
Start a standalone WebAppProxy server. If multiple servers are used with load balancing it should be run on each of them:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR
Start the MapReduce JobHistory Server with the following command, run on the designated server:
$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
DC/OS Manual
© 2016 Mesosphere, Inc. All Rights Reserved.
EXAMPLE: HDFS DEPLOYMENT
9
(Illustrated on next slide …)
1. Allocates resources (nodes, cpus, mem, disk) on the cluster
2. Copies HDFS binaries & config files to each node
3. Starts HDFS Journal Node processes
4. Starts HDFS Name Node processes
1. Format NameNode 0
2. Bootstrap NameNode 1
5. Starts HDFS ZKFC processes (colocated with NN)
6. Starts HDFS Data Node processes
© 2016 Mesosphere, Inc. All Rights Reserved.
EXAMPLE: HDFS DEPLOYMENT: BEFORE
10
node 1
node 2
node 3 node 6
node 5
node 4
ClusterJournal Node
Journal Node
Journal Node
Name Node 1
Name Node 2
Data Node
Data Node
Data Node
ZKFC Node
ZKFC Node
1.0 cpu
2 GB mem
5 GB disk
1.0 cpu
2 GB mem
5 GB disk
1.0 cpu
1 GB mem
Colocated w/ NN
1.0 cpu
2 GB mem
5 GB disk
© 2016 Mesosphere, Inc. All Rights Reserved.
EXAMPLE: HDFS DEPLOYMENT: AFTER
11
node 1
node 2
node 3 node 6
node 5
node 4
Cluster
Journal Node
Journal Node
Journal Node
Name Node 1
Name Node 2
Data Node
Data Node Data Node
ZKFC Node
ZKFC Node
© 2016 Mesosphere, Inc. All Rights Reserved.
DEMO: RUN DATA ANALYTICS PIPELINE ON 6-NODE CLUSTER
12
© 2016 Mesosphere, Inc. All Rights Reserved.
DEMO: CLUSTER SETUP
13
• Create a 6-node DC/OS Cluster in AWS
1. https://dcos.io/install/
2. Specify key pair
3. Takes about 10 min. to spin up
© 2016 Mesosphere, Inc. All Rights Reserved.
DEMO: TWEETER DATA PIPELINE
14
data source
message bus
analytics engine
Tweeter Kafka
Zeppelin
Spark
Cassandra database
storage
Tweeter
Tweeter
MarathonLB
PostTweets
load balancer
tweet botweb service
© 2016 Mesosphere, Inc. All Rights Reserved.
DEMO: ANALYTICS IN ZEPPELIN / SPARK
15
© 2016 Mesosphere, Inc. All Rights Reserved.
RECAP
16
• DC/OS simplifies production & ops for big data
• Demo: deploy a full data pipeline on a 6-node cluster
• Please do try this at home!
© 2016 Mesosphere, Inc. All Rights Reserved.
REFERENCES
17
• Creating a DC/OS cluster: dcos.io
• Tweeter: https://github.com/mesosphere/tweeter
© 2016 Mesosphere, Inc. All Rights Reserved.
THANK YOU!
18