Big Data on DC/OS

© 2016 Mesosphere, Inc. All Rights Reserved.

BIG DATA ON DC/OS

1

Susan X. Huynh, Women in Big Data Meetup, Apr. 2017


ANNOUNCEMENTS

2

• https://www.womeninbigdata.org/ - strengthening diversity in the big data field

• Member appreciation night @Intel on April 6th

• https://mesosphere.com/careers/

https://www.womeninbigdata.org/

https://mesosphere.com/careers/


OUTLINE

3

DC/OS for Big Data Ops - 10 min.

Demo: Deploy a Data Pipeline on a DC/OS Cluster - 20 min.


AUDIENCE

4


BIG DATA: DATA PIPELINES

5

• A typical data pipeline:

data source

message bus

analytics engine

Web service Kafka

Spark HDFS

Cassandra database

file system Logs

Events

storage


DC/OS FOR BIG DATA OPS

6

• A typical project lifecycle of a data pipeline

1. Development: on one machine - e.g., laptop

2. Deploy to production: cluster of machines <== DC/OS ==>

3. Ops in production: upgrade Spark version


PRODUCTION & OPS FOR BIG DATA

7

• Deployment

• Efficient resource management

• Scaling: adding more nodes running, i.e., Kafka or Spark

• Fault tolerance: restarting failed tasks

• Maintenance

• Backups & Restore, i.e., Cassandra


EXAMPLE: HDFS DEPLOYMENT

8

http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-common/ClusterSetup.html

Hadoop Startup

To start a Hadoop cluster you will need to start both the HDFS and YARN cluster.

Format a new distributed filesystem:

$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>

Start the HDFS with the following command, run on the designated NameNode:

$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode

Run a script to start DataNodes on all slaves:

$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

Start the YARN with the following command, run on the designated ResourceManager:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager

Run a script to start NodeManagers on all slaves:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager

Start a standalone WebAppProxy server. If multiple servers are used with load balancing it should be run on each of them:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR

Start the MapReduce JobHistory Server with the following command, run on the designated server:

$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR

DC/OS Manual


EXAMPLE: HDFS DEPLOYMENT

9

(Illustrated on next slide …)

1. Allocates resources (nodes, cpus, mem, disk) on the cluster

2. Copies HDFS binaries & config files to each node

3. Starts HDFS Journal Node processes

4. Starts HDFS Name Node processes

1. Format NameNode 0

2. Bootstrap NameNode 1

5. Starts HDFS ZKFC processes (colocated with NN)

6. Starts HDFS Data Node processes


EXAMPLE: HDFS DEPLOYMENT: BEFORE

10

node 1

node 2

node 3 node 6

node 5

node 4

ClusterJournal Node

Journal Node

Journal Node

Name Node 1

Name Node 2

Data Node

Data Node

Data Node

ZKFC Node

ZKFC Node

1.0 cpu

2 GB mem

5 GB disk

1.0 cpu

2 GB mem

5 GB disk

1.0 cpu

1 GB mem

Colocated w/ NN

1.0 cpu

2 GB mem

5 GB disk


EXAMPLE: HDFS DEPLOYMENT: AFTER

11

node 1

node 2

node 3 node 6

node 5

node 4

Cluster

Journal Node

Journal Node

Journal Node

Name Node 1

Name Node 2

Data Node

Data Node Data Node

ZKFC Node

ZKFC Node


DEMO: RUN DATA ANALYTICS PIPELINE ON 6-NODE CLUSTER

12


DEMO: CLUSTER SETUP

13

• Create a 6-node DC/OS Cluster in AWS

1. https://dcos.io/install/

2. Specify key pair

3. Takes about 10 min. to spin up

https://dcos.io/install/


DEMO: TWEETER DATA PIPELINE

14

data source

message bus

analytics engine

Tweeter Kafka

Zeppelin

Spark

Cassandra database

storage

Tweeter

Tweeter

MarathonLB

PostTweets

load balancer

tweet botweb service


DEMO: ANALYTICS IN ZEPPELIN / SPARK

15


RECAP

16

• DC/OS simplifies production & ops for big data

• Demo: deploy a full data pipeline on a 6-node cluster

• Please do try this at home!


REFERENCES

17

• Creating a DC/OS cluster: dcos.io

• Tweeter: https://github.com/mesosphere/tweeter

http://dcos.io

https://github.com/mesosphere/tweeter


THANK YOU!

18

Big Data on DC/OS

Software

Transcript of Big Data on DC/OS