Table of Contents - VMwaredocs.hol.vmware.com/HOL-2013/hol-sdc-1309_beta_pdf_en.pdf · There is a...

Table of ContentsLab Overview .................................................................................................................... 2

HOL-SDC-1309-vSphere Big Data Extensions Lab Modules .....................................3Verify Hadoop Clusters Have Started ...................................................................... 5

Module 1 - Hadoop POC In Under An Hour ...................................................................... 10Module Overview .................................................................................................. 11Manage Hadoop Pooled Resources........................................................................ 12Create Basic Hadoop Cluster via Web Client ......................................................... 22Create Hadoop Cluster with Serengeti CLI ............................................................ 30Add Data and Run a MapReduce Job ..................................................................... 34Scale out Hadoop Cluster via the Web UI.............................................................. 48Scale Out Cluster via Serengeti CLI....................................................................... 52

Module 2 - Fast and Easy Deployment of Hadoop Clusters .............................................59Module Overview .................................................................................................. 60Configure and Deploy Hadoop Clusters................................................................. 61Resize Hadoop cluster after creation .................................................................... 72Export configuration and create customized cluster .............................................76

Module 3- Compute only Clusters on Shared HDFS......................................................... 87Module Overview .................................................................................................. 88Create Compute only cluster................................................................................. 89Hadoop Filesystem Commands Within CLI ............................................................ 96

Module 4 - Highly Available Hadoop.............................................................................. 100Module Overview ................................................................................................ 101How to Create Hadoop Cluster with HA Enabled ................................................. 102Kill the Namenode and Verify HA restart............................................................. 114

Module 5 - Fast and Easy Deployment of HBase Clusters .............................................123Module Overview ................................................................................................ 124Configure and Deploy HBase Clusters................................................................. 125Manage Hadoop Pooled Resources...................................................................... 137

Module 6 - Elastic Hadoop............................................................................................. 143Module Overview ................................................................................................ 144Manage Existing Tier1 and Tier2 Clusters ........................................................... 145Manual Hadoop Elasticity .................................................................................... 158Automatic Hadoop Elasticity ............................................................................... 169

HOL-SDC-1309

VMware Beta Program CONFIDENTIAL

Lab Overview

HOL-SDC-1309


HOL-SDC-1309-vSphere Big DataExtensions Lab ModulesThe Apache Hadoop software library is a framework that allows for the distributedprocessing of large data sets across clusters of computers, designed to scale up fromsingle servers to thousands of machines, each offering local computation and storage.Hadoop is being used by enterprises across verticals for Big Data analytics to help makebetter business decisions based on large data sets.

VMware enables you to easily and efficiently deploy and use Hadoop on your existingvirtual infrastructure through vSphere Big Data Extensions (BDE). BDE makes Hadoopvirtualization-aware, improves performance in virtual environments and enablesdeployment of Highly Available Hadoop clusters in minutes. vSphere BDE automatesdeployment of a Hadoop cluster, and thus provides better Hadoop manageability andusability.

In this lab you will execute 15 minute Lightning labs to configure and deploy Hadoop, aswell as HBase, clusters on local storage in minutes. You will also create compute-onlyclusters that allow the use of shared storage across multiple Map Reduce clusters,providing multi-tenancy and enabling easy Scale in or scale out of compute resources.You will also add vSphere High Availability (HA) to improve resiliency of your Hadoopclusters.

There is a full length lab to simulate a complete Hadoop Proof of concept. In the POCmodule, you will configure and deploy your cluster, add data to HDFS and run MapReduce jobs against your deployed cluster.

In the final module, you will configure manual and automatic scaling of your hadoopclusters. You will use resource pools with differing priorities and run Map Reduce jobs tosee how vSphere will scale in or out cluster nodes based on your priorities and theresource demands placed on the system.

Note: Some of the lab modules contain lengthy command lines that must be typed intothe Putty session. To ease this process, there is a Commands.txt file on the desktop inthe "Lab Files" folder. You can copy the relevant commands and paste them into Puttyusing this file if you don't want to type them manually.

The modules and timing are as follows:

Hadoop POC In Under an Hour

• Add resources• Create cluster Hadoop/Hbase• Put data into HDFS• Execute MR/HBase Jobs

HOL-SDC-1309


• Visualize results (Partner product)

Fast and Easy Deployment of Hadoop Cluster 15 Minutes

• Create and resize standard Hadoop Clusters with multiple Distros and config• Modify Hadoop configuration after creation (e.g. change scheduler from FIFO to

Fair)• Manage resources (Add/delete Network, Resource Pools, Datastores)

Create Compute only clusters on shared HDFS 15 Minutes

• Deploy with HVE to enable locality• Show node placement policy controls in Serengeti.

Highly Available Hadoop 15 Minutes

• Deploy master nodes on shared storage with HA enabled.• Kill the NameNode process and see the node automatically restart.

Create HBase Cluster 15 Minutes

• Create and Resize Hbase Cluster• Manage resources (Add/delete Network, Resource Pools, Datastores)

Simulate Elasticity POC 45 Minutes

• Create Tier 1 and Tier 2 Clusters• Execute MR jobs on both clusters• Show manual elasticity• Show Automated Elasticity

Lab Captains: Michael West, Andy Hill, Robert Jensen

HOL-SDC-1309


Verify Hadoop Clusters Have StartedAs part of the deployment of your lab, the Hadoop clusters that were created for youhave been automatically started. It is possible that the clusters you need did not startsuccessfully. Before starting any module, perform the following steps to verify that yourclusters have started.

Four clusters have been pre-created for you. The small_Hbase cluster is only needed ifyou are going to do the Fast and Easy Deployment of Hbase Clusters module. If not, donot start this cluster as it takes significant system resources

HOL-SDC-1309


Connect to Serengeti CLI

From the Windows Desktop perform the following steps:

1) Click on the Putty Icon

2) Select the SerengetiCLI session

3) Click the Open button

4) At the OS login enter the Password. It is password

HOL-SDC-1309


Connect to Serengeti CLI to Verify Running Clusters

1. To open the CLI, type Serengeti The CLI commands are case sensitive.2. Type connect --host localhost:8443 to connect to our management server.

Username is administrator@corp password is VMware1!

You are now in a Command Line environment that can interact directly with yourHadoop Clusters.

HOL-SDC-1309


Listing Hadoop Cluster Details

1. To see your clusters type cluster list (note that up arrow will let you see yourcommand history.

2. small_cluster, Tier1 and Tier2 clusters must have a STATUS of RUNNING. If thestatus is STOPPED or ERROR you will need to start the cluster again.

3. Hbase Cluster only needs to be running if you are going to do the HbaseDeployment Module

4. Note: Clusters take several minutes to start, so you don't want to start a clusteryou are not going to use.

Start a Hadoop Cluster

1) Type cluster start --name "cluster name". Replace "cluster name" with the name ofthe cluster that needs to be started.

HOL-SDC-1309


Note: You do not need to wait for the clusters to start, since the first few steps in eachmodule do not depend on the clusters running. Feel free to continue, and check back onthe status of the start command.

HOL-SDC-1309


Module 1 - Hadoop POC InUnder An Hour

HOL-SDC-1309


Module OverviewHadoop clusters typically require specialized expertise and dedicated hardwareinfrastructure to deploy. In this module we will explore the benefits of running Hadoopon VMware vSphere. By virtualizing Hadoop clusters, you are able to deploy multipleVMs per host, which also allows you to separate data from compute. By doing this, youcan seamlessly scale the compute layer within your Hadoop cluster, while keeping thedata separate. Other benefits of running Hadoop on vSphere include:

• Run multiple compute workloads on the same physical hardware, optimizingresource utilization

• Eliminate the need for dedicated hardware to run Hadoop workloads• Inherit better reliability and flexibility due to High Availability (HA), vMotion, and

DRS features of the vSphere platform

In this module, we will simulate a rapid proof of concept using vSphere Big DataExtensions. We will explore the following key concepts:

• Mapping vSphere resources to Big Data Extensions resources for consumption byHadoop

• Quickly create multiple types of Hadoop clusters• Load data and run MapReduce jobs• Run a Pig script via the Serengeti CLI• Simple scale-out of Hadoop compute node on vSphere

Note: If you have not already done so, You MUST run the "Verify Hadoop Clusters HaveStarted" step under the lab overview section prior to doing this module.

Let's get started!

HOL-SDC-1309


Manage Hadoop Pooled ResourcesHadoop makes excellent use of the system resources that are made available to it. In anenvironment with shared physical resources that have been virtualized, it is importantto appropriately assign the resources that can be used by your Hadoop clusters.vSphere allows you to specifically make available CPU, RAM, Storage and VirtualNetworks to your Hadoop clusters. In this module, you will use the vSphere Big DataExtensions Plugin to add network and storage resources to the Hadoop Clusters.

Login to vSphere Web Client

Open Firefox and login to the vSphere Web Client by checking the "Use Windows sessionauthentication" checkbox, and clicking the Login button. The username is corp\administrator the password is VMware1! (Note: ! is part of the password)

HOL-SDC-1309


Explore the vSphere Environment

In the vSphere Web Client, click the "Hosts and Clusters" icon.

HOL-SDC-1309


Hosts and Clusters View

First, take a look at the resource pools that are configured in this vSphere environment.The vSphere Big Data Extensions will leverage these resource pools to ensure ourHadoop clusters have the resources they need based upon business need, while alsoensuring they do not overconsume resources and impact other applications.

HOL-SDC-1309


Datastore View

Next, click over to the datastore tab, just to get a sense of the datastores and networksthat are configured in this environment.

Notice that there is an NFS volume configured, and there are also local VMFS volumesconfigured on each ESXi host. In the next steps, we'll configure our Hadoop clusters touse both shared and local storage, which is a key benefit of using the vSphere Big DataExtensions.

HOL-SDC-1309


Navigate to Big Data Extensions Plugin

To get to the Big Data Extensions plugin, first click the "Home" icon, then choose "BigData Extensions" from the sidebar menu.

HOL-SDC-1309


Explore BDE Plugin

First, let's take a look at the Hadoop clusters that are already configured in thisenvironment. Click on the "Big Data Clusters" item in the sidebar menu.

HOL-SDC-1309


View Hadoop Clusters

Notice that there are four Hadoop clusters configured in this vSphere environment. Thecolumnar view on the right indicates each cluster's name, status, which distribution it isrunning, which vSphere resource pool it belongs to, and the list of nodes. As we saw inthe last lesson, resource pools are an important way to manage how Hadoop consumesthe underlying physical resources.

This is an important differentiator over using dedicated physical hardware for Hadoop,where the resources may be wasted when Hadoop jobs are not running. vSphere allowsyou to run a mix of workloads, while also guaranteeing resources based upon businessneeds.

HOL-SDC-1309


View Cluster Actions

Right-click on one of the clusters in the right-hand pane, and note all the actions thatcan be taken on a cluster from within the vSphere Web Client. We will investigate thesefurther in a future lesson.

Now click back to the Big Data Extensions main menu by clicking the button indicated instep 2 in the screenshot above.

HOL-SDC-1309


Click Resources

Click the "Resources" item under Inventory lists, highlighted above.

HOL-SDC-1309


Map vCenter Resources to BDE Inventory Items

This screen is where we map vSphere datastores into constructs that the Big DataExtensions will allocate to Hadoop clusters. Notice that several mappings are alreadymade.

The Big Data Extensions can consume both shared and local storage as appropriate forthe specific need. In this screen, we can see that there is a "defaultDSShared" item thatis mapped to the ds-site-a-nfs01 vSphere datastore. There is also a "dsLOCAL" item thatis mapped to any vSphere datastore that is local to a host, and begins with the name"esx". Wildcards allow multiple datastores to be easily managed and consumed by ourHadoop clusters.

Go ahead and click the plus sign to walk through creating a new datastore mapping.

In the Add Datastore popup, you would enter any name that you choose, the actualvSphere datastore name, and then indicate whether the datastore is shared or local.Since we already have all the mappings we need, click cancel when you are done.

HOL-SDC-1309


Create Basic Hadoop Cluster via WebClientIn this lesson, we will create a Hadoop cluster via the vSphere Web Client.


Click on "Big Data Extensions" in the side bar.

HOL-SDC-1309


Simulate Creating a Basic Hadoop Cluster

A Basic Hadoop cluster mimics the standard deployment you see in physical Hadoopclusters. The Datanode and Tasktrackers reside within a single VM. In other lessons youwill see that it is advantageous to separate these services into their own VMs.

Click "Big Data Clusters"

Click the New Cluster button

Click on the icon indicated above, which is the "new cluster" button.

HOL-SDC-1309


Name and type

You will choose your preferred Hadoop Distribution. Supported distros include Cloudera,Mapr, HortonWorks, and PivotablHD. We will use the opensource Apache distribution inthis module.

There are several deployment types for your clusters. You can mimic the typical physicalHadoop deployment with the Basic Hadoop Cluster. This type will separate theNamenode and Jobtracker into their own Virtual Machines, however each Tasktrackerand Datanodes combination will be in a single Virtual Machine. You also have the optionof separating the Compute (Tasktracker) from the Datanode using the Data/ComputeSeparation Hadoop option. This facilitates the elastic scaling of Compute you can see inModule 6.

For this Module, select the following options:

Big Data Cluster Name : Basic Hadoop

Hadoop distribution: apache

Deployment Type: Basic Hadoop Cluster

HOL-SDC-1309


Select the custom template

Each distinct Hadoop Node configuration is called a NodeGroup. You will see specificNodeGroups based on the Deployment Type you selected, but you can also use theCommand Line Interface, or Customize Deployment Type in the UI, to define any type ofNodeGroup you want. In this section, you are sizing the virtual machine CPU, RAM andData storage for each NodeGroup. You will also define the number of a specificnodegroup to deploy. In the image above, you are going to deploy 3 Worker Nodes,containing a TaskTracker and DataNode, 1 ComputeMaster (Jobtracker) and 1DataMaster (NameNode).

Click the Resource template button, and select Customize

Customize the template

Note that you can select Shared or Local storage. Typically, Hadoop has been deployedwith local storage to provide the data locality that is central to its performance. You cansee that each NodeGroup can be configured with its own Datastore type. This meansthat, for instance, your DataNodes can run on Local storage, while you have theJobtracker and Namenodes on Shared storage. This allows the use of vSphere HA or FTto improve the availability of those nodes while still ensuring data locality.

Change the default to :

vCpu number : 1

HOL-SDC-1309


Memory size : 3748

Storage Size : 10

Datastore type : Shared

Click ok.

HOL-SDC-1309


Select the Resources for the Cluster

Make sure to select the Customize option and size each NodeGroups resources as in theprevious step.

Set the number of nodes for each worker to 1.

Network and resource pool

Leave the Hadoop topology and Network settings at their default values.

Click the select button to select a resource pool.

HOL-SDC-1309


Select Resource Pools

Select one of the resource pools listed above, and click OK.

Cancel Creation

Depending on the size of the cluster, it takes anywhere from 6 to 20 minutes to deployand be running. Due to resource and time constraints for the lab, we will not actuallycreate the cluster

Click cancel to cancel the deployment, and watch the video below, to see a deploymentof a hadoop cluster.

HOL-SDC-1309


Video

HOL-SDC-1309


Create Hadoop Cluster with SerengetiCLIIn the last lesson, we used the vSphere Web Client to walk through creating a newHadoop cluster. We will now run through the same process using the Serengeti CLI. TheCLI allows you to have finer-grained control over cluster creation, including the ability tospecify what roles run on which nodes in the cluster.

Use Putty to SSH to management-server

Click the PuTTy icon on the desktop, choose the SerengetiCLI item, and click Open.

Login as root with a password of 'password'

HOL-SDC-1309


Connect to the Serengeti CLI

1. Type serengeti2. Type connect --host localhost:8443 to connect to our management server.

Username is "[email protected]" password is VMware1!

HOL-SDC-1309


Explore the Serengeti CLI

Try out the following commands in the CLI to get an idea of how the environment isconfigured:

• cluster list - lists all the Hadoop clusters and some of their configuration• resourcepool list - lists vSphere resource pools• datastore list - lists the Serengeti datastores• network list - lists the network mappings

Create a Hadoop Cluster via the CLI

Now we will walk through how to create a Hadoop cluster via the CLI. This process issimilar to using the vSphere Web Client, but there are more options available.

View Specfile

Hadoop cluster configuration can be controlled via spec files. Let's take a look inside oneof the spec files before we create the cluster.

From the Serengeti management Server:

1. cd /opt/serengeti/cong

HOL-SDC-1309


2. more small_cluster.json

Using a json file via the CLI allows more control over the configuration of the cluster,including role placement across nodes in the cluster.

Create the Cluster - Video

This video shows the process to create a compute-only cluster using an existing HDFSfilesystem. We won't actually create another cluster in this lab due to time constraints,but here is the command to use in the CLI along with the json file:

cluster create --name SharedHDFSTest --specfile SharedHDFS.json

HOL-SDC-1309


Add Data and Run a MapReduce JobIn this section we will:

• Use the HDFS Put command from the CLI to add files to the Hadoop Filesystem• Run a Map Reduce job in an existing Hadoop Cluster• Run a Pig script in an existing Hadoop Cluster• From the vSphere Web Client, use the Hadoop management pages to view job

status and the results file

Use Putty to SSH to management-server

You should still be connected to the Serengeti CLI, but if not, re-connect as follows:

Click the PuTTy icon, choose the SerengetiCLI session, and click the Open button.

Login as serengeti with a password of 'password'

HOL-SDC-1309


Connect to the Serengeti CLI

1. Type serengeti2. Type connect --host localhost:8443 to connect to our management server.

Username is "[email protected]" password is VMware1!

HOL-SDC-1309


Select the small_cluster as our target

To choose the small_cluster as the target we will be working with, enter the followingcommand into the CLI:

cluster target --name small_cluster

HOL-SDC-1309


Put data into HDFS

As a simple example of a MapReduce job, we will do a word count on the SerengetiUser's Guide. We first need to upload a text version of the document into the HDFSfilesystem:

fs put --from /home/serengeti/serengeti.txt --to /tmp/input/serengetitest

HOL-SDC-1309


Open the MapReduce Status Page

Back in the vSphere Web Client, open the MapReduce status page by right-clicking onthe small_cluster line, and choosing "Open MapReduce Status Page" from the contextmenu.

Once this page opens, you can return to the Serengeti CLI window. We will come back tothis status page after we execute the MapReduce job.

Run MapReduce

To run our MapReduce job, enter the following command in the CLI:

HOL-SDC-1309


mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.2.1.jar --mainclassorg.apache.hadoop.examples.WordCount --args "/tmp/input /tmp/output"

This command executes the WordCount MapReduce job that is included in the hadoop-examples jar file that comes with Serengeti. This class reads the input from the /tmp/input directory, executes the MR job, and stores the results in the /tmp/output directory.

HOL-SDC-1309


View Map Reduce Status Page

Go back to your web browser and scroll down to view the Map Reduce results.

Click refresh in the browser address bar.

Scroll down and look at the running and completed jobs section. The job we submitteddoes not take long to run, so it may already be completed by the time you view thepage.

Once the job completes (you may have to refresh the page a couple of times), click onthe hyperlinked Jobid to view some details about the job.

View MR Job Details

Feel free to explore this page and look at the statistics for the job we submitted.

When we executed the MapReduce job to do a word count on the Serengeti user guide,here is what happened, at a simplified level:

HOL-SDC-1309


1. Map Step: The master node takes the input data, divides it into smaller units ofwork, and distributes these to the worker nodes, which further subdivide them. Inthe WordCount example, each line in the file is broken into words, and the mapfunction outputs key/value pairs containing the word and the number ofoccurrences in that line.

2. Reduce Step: The master node collects back all the results from the workernodes, sums the values for each word (key) and outputs a single key/value withthe word and sum.

HOL-SDC-1309


Sort the Results Using Pig

The default results file output by the MapReduce job is sorted alphabetically by word. Tomake our results file easier to understand, we will run a simple pig script to sort the fileby number of occurrences of each word in ascending order.

Back in the Serengeti CLI window, type the following command:

pig script /home/serengeti/sort.pig

Once the command completes as pictured above, move on to the next step.

HOL-SDC-1309


Open HDFS Status Page

Back in the vSphere Client, make sure you are on the Hadoop Clusters page.

Right-click on the small_cluster and choose the "Open HDFS Status Page" option.

HOL-SDC-1309


Browse HDFS Filesystem

Click the "Browse the filesystem" link as shown in the screenshot above.

HOL-SDC-1309


Navigate HDFS Filesystem

We need to navigate to the directory /tmp/output/wordcount-sorted. You can simply typethis directory in to the "Goto:" field and click on the go button, or manually click on thedirectories until you reach that folder.

HOL-SDC-1309


View the Sorted Results

Now that we are in the right directory, simply click the "part-r-0000" file to view theresults that we sorted with the Pig script.

HOL-SDC-1309


Scroll through the file

To view the complete file, you will need to click the "View Next chunk" link near the topof the page.

The word count results are sorted by the number of occurrences of each word, inascending order. If you scroll to the bottom of the last chunk, you will see the mostcommon words, such as "the" and "Hadoop."

HOL-SDC-1309


Scale out Hadoop Cluster via the WebUIThis lesson will show you how to scale out a Hadoop cluster via the vSphere Web Client.The next lesson will walk through the same procedure using the CLI interface.

The ability to easily scale your Hadoop clusters up and down is a key benefit of runningHadoop on vSphere. It is very difficult and costly to achieve the same results ondedicated physical hardware.


If you are not already in the Big Data Extensions plugin, navigate back to it by clickingthe home icon, then choosing "Big Data Extensions" from the sidebar menu as shownabove.

HOL-SDC-1309


Click Hadoop Clusters

Click "Big Data Clusters" in the sidebar menu.

HOL-SDC-1309


Scale Out the small_cluster

Right-Click on small_cluster and choose Scale Out...

HOL-SDC-1309


Change Instance Number to 2

Change the Instance number to 2, and click OK.

Upon execution of this task, the Big Data Extensions would clone a new worker nodeand handle adding it to the Hadoop cluster automatically.

Note that in this lab environment, nothing will actually happen when you click OK, dueto resource constraints. Go ahead and click Cancel.

Video of the Resize Process

This video shows you the resize process in action.

HOL-SDC-1309


Scale Out Cluster via Serengeti CLIIn this lesson, we will scale out a cluster using the CLI.

HOL-SDC-1309


Open the Command Line Interface (CLI)

Some Hadoop Cluster management can be done through the vSphere Web Client, as wesaw in the last lesson. We are going to look at changing cluster size through the CLI.


1. Click on the Putty Icon2. Select the SerengetiCLI session3. Click the Load button4. Click the Open button5. At the OS login enter the Password. It is password

HOL-SDC-1309


HOL-SDC-1309


Open the Serengeti CLI

1. To open the CLI, type serengeti The CLI commands are case sensitive.2. Type connect --host localhost:8443 to connect to our management server.

Username is administrator@corp, password is VMware1!


HOL-SDC-1309



1. To see your clusters type cluster list (note that up arrow will let you see yourcommand history.

2. Notice that we currently have 4 workers in the Tier1 cluster. In the next step, wewill expand this to 5 workers.

HOL-SDC-1309


View Cluster Resize Help

First, let's look at the help for the cluster resize command.

Type:

help cluster resize

Take a look at the keywords for the command.

Enter the Resize Command

As you can see from the help information, the command we need to enter in order toresize the Tier1 cluster to 5 worker nodes is:

cluster resize --name Tier1 --nodeGroup worker --instanceNum 5

HOL-SDC-1309


Note: If you plan to take additional modules in this lab, you may not want to actuallyrun this command, since it will take time to complete and may impact the functionalityof other lab modules.

If you choose, you may enter the command, and you will see the screen above thatindicates Tier1-worker-4 is cloning. If you have the vSphere Web Client open, you willsee the progress of the clone VM task. It may take several minutes for the operation tocomplete.

Eventually, the task status in the CLI will update to "waiting for ip," and then to "VMready" and several other steps as the cluster reconfigures itself.

These steps can take some time, and you do not need to wait for them to completesince this is the last step of this lab module.

HOL-SDC-1309


Module 2 - Fast and EasyDeployment of Hadoop

Clusters

HOL-SDC-1309


Module OverviewHadoop clusters typically require specialized expertise and dedicated hardwareinfrastructure to deploy. In this module you will see how easy it is to configure yourHadoop cluster nodes, size the virtual machines - including CPU, Memory and Storage -and deploy into your existing vSphere environment. As resource demands change overtime - or throughout the day - you can resize the Hadoop cluster to accommodate thesechanges. Lastly, once a cluster is configured, you will see how to export thatconfiguration, and use it to create or update other Hadoop clusters.

Note: You MUST run "Verify Hadoop Clusters Have Started" step under the lab overviewsection prior to doing this module.

HOL-SDC-1309


Configure and Deploy Hadoop ClustersIn this module, you will deploy a Apache Hadoop cluster using the vSphere Web Clientand vSphere Big Data Extensions.

Navigate to Hosts and Clusters

Click on Hosts and Clusters

Create Resource pool

Resource Pools allow you to limit the amount of CPU and Memory that can be consumedby your Hadoop cluster, but as you will see in Module 6, also are the mechanism forestablishing the priority of one cluster over another in the case of resource contention.

HOL-SDC-1309


Right click on the cluster names Cluster Site A, and select New Resource pool

Configure resource pool

Name the cluster MyHadoopCluster.

HOL-SDC-1309


Leave all setting at the default level and click ok.

HOL-SDC-1309


Return to homepage

Click the home button in the top, to return to the homepage.

HOL-SDC-1309



This is a vCenter Plugin providing specific capability to config, deploy and manage yourBig Data environment.

Click on the "Big Data Extensions" tab

HOL-SDC-1309


Select Hadoop Clusters

Four Hadoop clusters have been created for this lab. If any cluster that you need has notstarted or has an error status, follow the directions in "Verify Hadoop Clusters HaveStarted" step under the lab overview section prior to doing this module.

Click on the Hadoop clusters tab.

Create Hadoop Cluster

Click Create New Hadoop Cluster

HOL-SDC-1309


Name and type



For this Module, select the following options:

Hadoop Cluster Name : Basic Hadoop

Hadoop Distro: Apache

Deployment Type: Basic Hadoop Cluster.

HOL-SDC-1309



Each distinct Hadoop Node configuration is called a NodeGroup. You will see specificNodeGroups based on the Deployment Type you selected, but you can also use theCommand Line Interface to define any type of NodeGroup you want. In this section, youare sizing the virtual machine CPU, RAM and Data storage for each NodeGroup. You willalso define the number of a specific nodegroup to deploy. In the image above, you aregoing to deploy 3 Worker Nodes, containing a TaskTracker and DataNode, 1ComputeMaster (Jobtracker) and 1 DataMaster (NameNode).


HOL-SDC-1309





vCpu number : 1

Memory size : 3748

Storage Size : 10


Click ok.

HOL-SDC-1309


Select the Resources for the Cluster

Make sure to select the Customize option and size each NodeGroups resources as in theprevious step.

Set the number of notes, for each worker to 1.


Leave the network to Defaultnetwork.

click the select button, to select a resource pool.

Select the proper resource pool

Select the resource pool, MyHadoopCluster, that you created in a earlier step.

Click ok.

HOL-SDC-1309


Cancel creation


Click cancel to cancel the deployment, and watch the video below, to see a deploymentof a hadoop cluster.

Video

HOL-SDC-1309


Resize Hadoop cluster after creationAs resource demands change over time - or throughout the day - you can resize theHadoop cluster to accommodate these changes. In this module, you will use thevSphere Big Data Extensions Plugin to resize an existing cluster.



HOL-SDC-1309




Select the Cluster

You may choose any of the Running clusters for the Resize process. Because of Resourceand Timing constraints in the lab environment, we will not actually complete thecreation of additional nodes.

Right click the cluster from the Center Panel list of Clusters.

HOL-SDC-1309


Select Scale Out

Scaling out in our environment is to create an additional node for the NodeGroup youselect. vSphere will automatically provision the Virtual Machine, install and configure theappropriate Hadoop components for your selected NodeGroup and startup the services.

Select Scale Out.

Select the NodeGroup to resize

Select node group, you want to resize.

Select the new number of instances.

Click cancel.

HOL-SDC-1309


Due to the time it takes to make configuration changes and resource constraints in thelab environment, we will not be doing any changes to the cluster.

Watch the video below, to see the scale out, of a cluster.

Video

HOL-SDC-1309


Export configuration and createcustomized clusterOnce a Hadoop cluster is configured, you will be able export that configuration and useit to create or update the configuration of other Hadoop clusters. In this module, you willexport a running configuration, and deploy a customized cluster from that configuration.

HOL-SDC-1309


Connect to the Big Data Extensions Command LineInterface (CLI)

Open Putty from the Windows Desktop, and select the SerengetiCLI server.

Click open.

HOL-SDC-1309


Login to the management appliance

vSphere Big Data Extensions take advantage of open source project Serengeti startedby VMware last year. You are now connecting to the Serengeti Management Appliance.

Login to the management appliance, using

login : serengeti

password : password

HOL-SDC-1309


Connect to the Big Data Extensions CLI

Once you have logged into the Serengeti Management Appliance, you will start theCommand Line Interface (CLI)

At the linux Operating System prompt, Type

serengeti

to start the Serengeti CLI.

HOL-SDC-1309


Connect to Serengeti server

Now that you are in the CLI, you need to connect to the specific Serengeti Server youwant to use. (Note: This step may seem unnecessary because you are already logged into the Serengeti Server, however the CLI can be run on your client machine as well. Inthat case, the need to connect to a specific Server is obvious).

Type

connect --host localhost:8443

Username : administrator@corp

password : VMware1!

to connect to the serengeti server

HOL-SDC-1309


List Cluster Information

Locate the running cluster, by typing

cluster list --name small_cluster

Export cluster configuration

To change the cluster's configuration, we must first export it to a configuration file.

Type :

cluster export --name small_cluster --specFile /home/serengeti/small_cluster.json

Configuration file

The cluster configuration file is stored as a json file. To see its contents, Exit theSerengeti CLI and

type the command "more /home/serengeti/small_cluster.json"

You can edit it with your favorite text editor, and when you are done, just save it. Noticethat the configuration includes definition of the Nodegroups and specific Hadoopconfigurations.

Due to time constraints for the lab, we wont be editing the file. A sample of the file isprovided below.

Small_cluster.json{

HOL-SDC-1309


"nodeGroups" : [{

"name" : "master","roles" : [

"hadoop_namenode","hadoop_jobtracker"

],"instanceNum" : 1,"storage" : {

"type" : "shared","shares" : "NORMAL","sizeGB" : 2

},"cpuNum" : 1,"memCapacityMB" : 1024,"swapRatio" : 1.0,"haFlag" : "on","configuration" : {

"hadoop" : {}

}},{

"name" : "worker","roles" : [

"hadoop_datanode","hadoop_tasktracker"



},"cpuNum" : 1,"memCapacityMB" : 1024,"swapRatio" : 1.0,"haFlag" : "off","configuration" : {

"hadoop" : {}

}},{

"name" : "client","roles" : [

"hadoop_client","pig",

HOL-SDC-1309


"hive","hive_server"



},"cpuNum" : 1,"memCapacityMB" : 1024,"swapRatio" : 1.0,"haFlag" : "off","configuration" : {

"hadoop" : {}

}}

],"configuration" : {

"hadoop" : {"core-site.xml" : {},"hdfs-site.xml" : {},"mapred-site.xml" : {},"hadoop-env.sh" : {},"log4j.properties" : {},"fair-scheduler.xml" : {},"capacity-scheduler.xml" : {},"mapred-queue-acls.xml" : {}

}},"specFile" : false

}

HOL-SDC-1309


http://hadoop-env.sh/

Changing the CPU count using the Spec File

Open a new Putty session from the Windows Desktop, and select the SerengetiCLI.

Click open.



Login : serengeti

password : password

Edit the json file

Open the json file, in the VI editor by typing :

HOL-SDC-1309


vi /opt/serengeti/conf/small_cluster.json

Change cpu count

Move the curser to the line with "cpuNum".

Press

i

on the keyboard, to go to edit mode.

Change the number 1 to 2 and press

esc

on the keyboard, to exit edit mode

type

:wq

on the keyboard, to quit VI and save the changes.

HOL-SDC-1309


Deploy custom cluster

To create a custom cluster, from the file you just edited, you would enter the serengetiCLI and type the command below.

cluster create --name small_cluster_2cpu --specFile /home/serengeti/small_cluster.json

Due to time and resource constraints in our lab environment we will not execute thecommand, but have created a video showing the above command.

Video

HOL-SDC-1309


Module 3- Compute onlyClusters on Shared HDFS

HOL-SDC-1309


Module OverviewHadoop clusters typically require specialized expertise and dedicated hardwareinfrastructure to deploy. In the previous module you deployed a Basic Hadoop clusterthat separated the Namenode and Jobtracker into their own Virtual Machines, kept eachTasktracker and Datanodes combination in a single Virtual Machine. In this module youwill see how easy it is to not only separate your Jobtracker and Namenode, but also toput Tasktrackers and Datanodes into their own VMs as well. This separation of Computeand Data is the key element of the Elastic Scaling that is demonstrated in Module 6 ofthis lab. Specifically, you will create a Compute Only Cluster that deploys JobTracker,Namenode and Tasktracker nodes, but does not create new Datanodes. Instead, you willpoint to an existing Hadoop File System (HDFS) that was previously created. The valuein this is many organizations have isolated Hadoop clusters today that make use ofsome of the same data. You can now easily spin up a cluster and point it to existing datain HDFS instead of copying it into a new filesystem.

Note: If you have not done so in a previous module, You MUST run "Verify HadoopClusters Have Started" step under the lab overview section prior to doing this module.

HOL-SDC-1309


Create Compute only clusterYou will deploy a Hadoop compute only cluster, that uses an external HDFS filesystem,and HVE

Hadoop Virtualization Extensions (HVE) are changes VMware has submitted to theopensource Apache community to make Hadoop run better on virtualized infrastructure.HVE refines Hadoop‟s replica placement, task scheduling and balancer policies. Hadoopclusters implemented on virtualized infrastructure have full awareness of the topologyon which they are running. Thus, the reliability and performance of these clusters areenhanced. For more information about HVE, you can refer to https://issues.apache.org/jira/browse/HADOOP-8468.

Connect to the Big Data Extensions CLI

Open Putty, and select the SerengetiCLI.

Click open.



HOL-SDC-1309


https://issues.apache.org/jira/browse/HADOOP-8468.



Login : serengeti

password : password

Start Big Data Extensions Command Line Interface (CLI)

Type

serengeti


HOL-SDC-1309



Type



password : VMware1!


Hadoop Rack topology

Hadoop makes placement and execution decisions based on datacenter topology.Administrators provide their datacenter topology via a topology file. It specifies, forinstance, the racks in the datacenter and the servers on each rack. In a virtualenvironment we have introduced the concept of a nodegroup to represent servers (thatare actually VMs) that are running on a specific esxi host. You can make Hadooptopology aware by uploading your topology file through the Big Data Extensions CLI. Weare showing you a very simple example that only defines the Racks and physical hosts.

To do this, upload the file topology.txt by typing :

topology upload --fileName /opt/serengeti/conf/rack_topology.txt

The content of the file is :

rack1: esx-01a.corp.local, esx-02a.corp.local, esx-03a.corp.local

HOL-SDC-1309


List topology

Verify that the topology has been uploaded, by typing :

topology list

And see that rack1 contains the 3 esx hosts in your vCenter.

Configuring Compute Only Hadoop Cluster

As we saw in a Modules 1 and 2, Hadoop Clusters can be created directly through thevSphere Big Data Extensions plugin. They can also be created through the CLI using ajson SpecFile. The specFile contains the cluster configuration and points to the externalHadoop Filesystem using the "ExternalHDFS" tag. This tag points to the Namenode of anexisting Hadoop cluster.

This enables the new cluster to use the already existing HDFS filesystem, whiledeploying Master and compute resources.

{"externalHDFS": "hdfs://192.168.110.123:8020","distro":"PivotalHD","nodeGroups":[

{"name": "master","roles": [

"hadoop_jobtracker"],"instanceNum": 1,"storage":{"type": "SHARED","sizeGB":1

},"cpuNum": 1,"memCapacityMB": 1024,"haFlag": "off","rpNames": [

"Tier2RP"]

},

HOL-SDC-1309


{"name": "worker","roles": [

"hadoop_tasktracker"],"instanceNum": 1,"cpuNum": 1,"memCapacityMB": 1024,"storage": {

"type": "LOCAL","sizeGB": 1

},"rpNames": [

"Tier2RP" // change this to the resource pool added via Serengeti CLI]

},{"name": "client","roles": [

"hadoop_client"],"instanceNum": 1,"cpuNum": 1,"memCapacityMB": 1024,"storage": {

"type": "SHARED","sizeGB": 1

},"rpNames": [

"Tier2RP"]

}],"configuration": {

"hadoop": {"core-site.xml": {

// check for all settings at http://hadoop.apache.org/docs/stable/core-default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample:// "io.file.buffer.size": "4096"

},"hdfs-site.xml": {

// check for all settings at http://hadoop.apache.org/docs/stable/hdfs-default.html },"mapred-site.xml": {

// check for all settings at http://hadoop.apache.org/docs/stable/mapred-default.html },"hadoop-env.sh": {

// "HADOOP_HEAPSIZE": "",// "HADOOP_NAMENODE_OPTS": "",// "HADOOP_DATANODE_OPTS": "",// "HADOOP_SECONDARYNAMENODE_OPTS": "",

HOL-SDC-1309


http://hadoop.apache.org/docs/stable/core-default.html

http://hadoop.apache.org/docs/stable/hdfs-default.html

http://hadoop.apache.org/docs/stable/mapred-default.html

http://hadoop-env.sh/

// "HADOOP_JOBTRACKER_OPTS": "",// "HADOOP_TASKTRACKER_OPTS": "",// "HADOOP_CLASSPATH": "",// "JAVA_HOME": "",// "PATH": ""

},"log4j.properties": {

// "hadoop.root.logger": "INFO,RFA",// "log4j.appender.RFA.MaxBackupIndex": "10",// "log4j.appender.RFA.MaxFileSize": "100MB",// "hadoop.security.logger": "DEBUG,DRFA"

}}

}}

Deploy Hadoop Cluster With PivotalHD Distro

From the Big Data Extensions CLI you can deploy the Compute Only Cluster withPivotalHD as the Distro and take advantage of HVE to provide virtual topologyawareness. Below is an example of the command used to deploy an alternate distro toApache. In this example, the file Pivotal.txt would specify the PivotalHD distro to beused. We will not actually execute this command in the lab because the PivotalHD distrohas not been installed in the Serengeti server.

Type :

cluster create --name Pivotal --topology HVE --distro PivotalHD --specFile /opt/serengeti/conf/Pivotal.txt

Due to time and resource constraints in our lab environment, do not execute thecommand. The video below shows the deployment of a Compute Only Hadoop Cluster.

HOL-SDC-1309


Video

HOL-SDC-1309


Hadoop Filesystem Commands WithinCLIIn the previous section you saw the creation of a Compute Only Hadoop cluster thatused an External HDFS filesystem. In this section you will use the Big Data ExtensionsCLI to upload files to the HDFS filesystem, and verify that they have been uploaded.

Connect to Serengeti Server With Putty

Open Putty, and select the SerengetiCLI.

Click open.

HOL-SDC-1309




Login : serengeti

password : password

Start Big Data Extensions Command Line Interface (CLI)

Type

serengeti



Type



HOL-SDC-1309


password : VMware1!


Connect to Cluster

We are going to load data into the small_cluster so we need to point the CLI to thattarget.

Connect to the target cluster, by typing :

cluster target --name small_cluster

Upload and download data from HDFS

We can use traditional put/get commands within the CLI to upload/download data fromthe Hadoop Filesystem (HDFS)

To upload the file /etc/inittab in the Linux EXT4 filesystem to /tmp/input/inittab in HDFSissue the following command

fs put --from /etc/inittab --to /tmp/input/inittab

To download the data to a new Linux file type :

HOL-SDC-1309


fs get --from /tmp/input/inittab --to /tmp/local-inittab

List files in HDFS

Type :

fs ls /tmp/input

to list the files there.

Verify the file you just uploaded, is there. Note: you can also use the Big DataExtensions Plugin to launch the Hadoop HDFS page to browse the filesystem from a webpage. Simply go to your list of clusters in the vSphere Web Client, Right click on thesmall_cluster in the Center Panel, select the Open HDFS Status page. You can browsethe filesystem from there.

HOL-SDC-1309


Module 4 - HighlyAvailable Hadoop

HOL-SDC-1309


Module OverviewThis is a single 15 minute lab.

vSphere provides a well known capability to automatically restart VMs when a physicalinfrastructure failure occurs. If an esxi host fails, vSphere HA will automatically restartthe failed VM on another host in your vSphere cluster. vSphere Big Data Extensions addto this capability by monitoring specific Hadoop nodes and restarting them when thoseprocesses fail. In this lab we will take a running Hadoop cluster, kill the Namenodeprocess and see that vSphere will detect that process failure and automatically restartthe node.

Note: You MUST run "Verify Hadoop Clusters Have Started" step under the lab overviewsection prior to doing this module.

HOL-SDC-1309


How to Create Hadoop Cluster with HAEnabled

Let's start by getting comfortable with the Big Data Extension vCenter plugin and seehow to create a Hadoop cluster with HA enabled.

Accessing the Big Data Extensions in vCenter

Open Firefox from your Desktop and enter Username corp\administrator and passwordVMware1!

HOL-SDC-1309


Navigate to Big Data Extensions

Click on the Home icon at the top of the screen. In the Inventories panel, click on the BigData Extensions icon

HOL-SDC-1309


Working with Clusters

Click on Hadoop Clusters in the Inventory Lists panel. You can now view details ofHadoop Clusters already deployed

HOL-SDC-1309


Hadoop and Hbase Clusters are already running.

Notice that 4 Hadoop clusters have previously been created for you. We will be workingwith the Small_cluster in this Module. Click on the small_cluster to drill into the details.

HOL-SDC-1309


Hadoop Cluster Nodes are Virtual Machines

Notice that the Hadoop cluster is made up of 3 VMs. The Nodegroup defines the HadoopRoles that have been enabled on those VMs and ultimately the Hadoop processes thatare running. As a reminder, the Namenode keeps the directory tree of the Hadoop filesystem (HDFS) and tracks where across the filesystem data is stored. The Namenodedoes not actually store the data but if it is down, the data is unavailable. In this Hadoopcluster the data is stored in the Worker Node VM. The small_cluster_master-0 VMcontains the Hadoop Namenode process. Click on the small_cluster-Master-0 VM to seeit's details.

HOL-SDC-1309


Virtual Machine Availability Enabled

You are now looking at the small_cluster-master-0 Virtual Machine (VM) detailinformation. Click on the Summary tab to drill in further.

HOL-SDC-1309


Namenode Virtual Machine Details

Now that you are looking at Virtual Machine details for the Namenode VM, you can see ifit is protected by HA. Hover the mouse over the icon highlighted above to see protectionlevel. Next we will open the Command Line Interface (CLI) to see how this cluster wascreated.

Command Line Interface (CLI) for granular configuration

Hadoop Cluster creation can be done through the vCenter UI. Please try the othermodules in this lab for details on that process. We are going to look at detailedconfiguration through the CLI.

HOL-SDC-1309





3) Click the Load button



HOL-SDC-1309


Cluster Configuration using JSON file

Cluster definitions are done using JSON files. These are specfiles that define the nodesthat make up your Hadoop clusters, including types of nodes, what Hadoop roles will beconfigured in each node, how many to deploy, resources allocated to each node, HA/FTon or off, node placement on hosts, and even affinity between types of nodes. (Themodules on creating Hadoop clusters go into more detail on this).

HOL-SDC-1309


1) type cd /opt/serengeti/conf to move to the directory that contains the JSON files

2) type ls -al to list the files in that directory.

The cluster we looked at with the vSphere Web Client was named small_cluster.small_cluster.json is the file that was used to define that cluster.

HOL-SDC-1309


small_cluster is defined by small_cluster.json file

1)Type more small_cluster.json.

Notice the NodeGroup with the name = Master. The Master NodeGroup contains tworoles; Jobtracker and Namenode. These roles map directly to Chef Recipes that are usedto orchestrate the provisioning of the VMs. Also notice that HA is set to ON for theMaster NodeGroup and is set to OFF for the worker NodeGroup. When we create thecluster through the command line, we simply reference this specfile in the cluster createcommand. We have already done that for you in this lab.

HOL-SDC-1309


HOL-SDC-1309


Kill the Namenode and Verify HArestartNow we are going to kill the Namenode process and see what happens.

HOL-SDC-1309


Connect to the Namenode VM



2) Select the Namenode session




HOL-SDC-1309


Find the Namenode Process

1) type ps -ef | grep proc_namenode

This command lists the processes running on the system and searches for the string"proc_namenode". Remember the process ID. You will use it in the next step.

HOL-SDC-1309


Kill the Namenode process

1) type sudo kill -9 "Process_ID" Process_ID will actually be replaced with the Process IDyou identified in the previous step.

2) type ps -ef | grep proc_namenode again to verify that the process is terminated.

This command terminates the process you identified. The Namenode service is now notrunning and this Hadoop cluster cannot access data stored in the HDFS filesystem.

HOL-SDC-1309


Watch the Restart of the Namenode

We will navigate to the Console screen of the Namenode VM. Go back to the vSphereWeb Client using Firefox.

1)Click on the Home Icon or Tab

2)Click on the Hosts and Clusters Icon

Find the Namenode VM and Launch the Console

1. Expand the inventory on the left hand side until you find the vSphere Big DataExtensions Resource Pool.

2. Expand the vSphere Big Data Extensions Server Small_cluster Resource Pool

HOL-SDC-1309


3. Expand the Master Resource Pool.4. Click on Small Cluster_Master-0 VM and launch the Console. Notice that the VM is

restarting. It should take about 2 minutes to restart.

HOL-SDC-1309


View the Namenode restart

Notice that the VM is restarting. It takes a couple of minutes for HA to determine thatthe namenode process has failed and to initiate a restart.

HOL-SDC-1309


Verify that the Namenode has Restarted

HOL-SDC-1309


Verify New Namenode Process

The Namenode process is again running in the small_cluster-master-0 VM. vSphereidentified the failure of the Namenode process, and initiated an automatic restart thatreduced the potential downtime for this Hadoop cluster.

HOL-SDC-1309


Module 5 - Fast and EasyDeployment of HBase

Clusters

HOL-SDC-1309


Module OverviewHadoop clusters typically require specialized expertise and dedicated hardwareinfrastructure to deploy. In this module you will see how easy it is to go beyond Hadoopdeployment to configure your Hbase cluster nodes, size the virtual machines - includingCPU, Memory and Storage - and deploy into your existing vSphere environment.

Note: If you have not done so in a previous module, You MUST run "Verify HadoopClusters Have Started" step under the lab overview section prior to doing this module.

HOL-SDC-1309


Configure and Deploy HBase ClustersIn this module, you will see how to configure and deploy an HBase cluster suing thevSphere Big Data Extensions Plugin.

Navigate to Hosts and Clusters

From the vSphere Web Client, Click on Hosts and Clusters

Create Resource pool

If you already created a resource pool in a previous module, skip down to step "Returnto Homepage". Resource Pools allow you to limit the amount of CPU and Memory thatcan be consumed by your clusters, but as you will see in Module 6, also are the

HOL-SDC-1309


mechanism for establishing the priority of one cluster over another in the case ofresource contention.

Right click on the cluster names Cluster site A, and select New Resource pool

HOL-SDC-1309


Configure resource pool

Name the cluster MyHadoopCluster.

Leave all settings at the default level and click ok.

HOL-SDC-1309


HOL-SDC-1309


Return to homepage

Click the home button at the top to return to the homepage.

HOL-SDC-1309



This is a vCenter Plugin providing specific capability to config, deploy and manage yourBig Data environment.


HOL-SDC-1309




Create Hadoop Cluster

Four Hadoop clusters have been created for this lab. If any cluster that you need has notstarted or has an error status, follow the directions in "Verify Hadoop Clusters HaveStarted" step under the lab overview section prior to doing this module.

Click Create New Hadoop Cluster

HOL-SDC-1309


Name and type



For this Module You will be deploying an Hbase cluster.

Select the following options:

Hadoop Cluster Name : HBase

Hadoop Distro: Apache

Deployment Type: Hbase Cluster

HOL-SDC-1309



Each distinct Hadoop Node configuration is called a NodeGroup. You will see specificNodeGroups based on the Deployment Type you selected, but you can also use theCommand Line Interface to define any type of NodeGroup you want. In this section, youare sizing the virtual machine CPU, RAM and Data storage for each NodeGroup. You willalso define the number of a specific nodegroup to deploy. In the image above, you aregoing to deploy 3 Worker Nodes, containing a TaskTracker and DataNode, 1ComputeMaster (Jobtracker) and 1 DataMaster (NameNode).


HOL-SDC-1309





vCpu number : 1

Memory size : 1024

Storage Size : 2


Click ok.

HOL-SDC-1309


Select the Resources for the Hbase Cluster

Make sure to select the Customize option and size each NodeGroup's resources as in theprevious step.

Set the number of nodes, for each worker and client Nodegroup to 1.


Leave the network to Defaultnetwork.

click the select button, to select a resource pool.

Select the proper resource pool

Select the resource pool, MyHadoopCluster, that you created in a earlier step.

Click ok.

HOL-SDC-1309


Cancel creation


Click cancel to cancel the deployment, and watch the video below, to see a deploymentof an Hbase cluster.

Video

HOL-SDC-1309


Manage Hadoop Pooled ResourcesHadoop makes excellent use of the system resources that are made available to it. In anenvironment with shared physical resources that have been virtualized, it is importantto appropriately assign the resources that can be used by your Hadoop clusters.vSphere allows you to specifically make available CPU, RAM, Storage and VirtualNetworks to your Hadoop clusters. In this module, you will use the vSphere Big DataExtensions Plugin to add network and storage resources to the Hadoop Clusters.



HOL-SDC-1309


Select Resources

Click on the Resources tab.

HOL-SDC-1309


Find Your Datastores

This process is not creating new datastores. It is simply allowing the administrator todetermine which datastores can be used when creating Hadoop clusters. vSphere willthen create virtual disks across those datastores during cluster creation.

Select the Datastores tab.

Add datastore

Click on the plus sign in the upper left corner to open the add datastore window.

Add datastore details

Fill out the information for the datastores you want to add. The Name you specify can beused in SpecFiles to refer to this set of datastores.

Name : Test datastores

HOL-SDC-1309


The show name of the datastore

Datastore : test*

Selects all datastores that begin with the name test


Select if the datastores are shared or local storage.

Select Cancel because we have already added the datastores into your environment.

Networks

You are able to easily segment network traffic for specific clusters by adding multiplenetworks and using them in the cluster create specFiles.

Select the tab Networks

HOL-SDC-1309


Add network

Click on the plus sign in the upper left corner to open the add networks window.

HOL-SDC-1309


Network information

Fill out the information for your selected network.

Name : This will be the name you refer to when creating your cluster specFiles

Port group name : Then name of the port group, where the network is attached.

Use DHCP to obtain ipadress.

Check this, if there is DHCP on the network.

Ip range : Type the ip range, that the VM's can use.

Subnet mask : The subnet mask of the network.

Gateway : The gateway of the network.

DNS : The DNS server of the network.

Select cancel, to exit the guide.

HOL-SDC-1309


Module 6 - Elastic Hadoop

HOL-SDC-1309


Module OverviewvSphere Big Data Extensions add to the resource monitoring and sharing capabilities ofvSphere. You will configure manual and automatic scaling of your hadoop clusters. Youwill use resource pools with differing priorities and run Map Reduce jobs to see howvSphere will scale in or out cluster nodes based on your priorities and the resourcedemands placed on the system. We will begin by introducing the vCenter extensionsthat provide the new Big Data functionality and show you how to monitor resourceconsumption of your clusters. Next you will manually resize your clusters, includingcreation of new cluster nodes, in support of increased resource demand. Finally, you willexecute a MapReduce job called Pi, on two separate clusters with different priorities. Youwill see how vSphere can automatically respond to resource contention by poweringdown lower priority cluster nodes.

Note: You MUST run "Verify Hadoop Clusters Have Started" step under the lab overviewsection prior to doing this module. The Tier1 and Tier2 clusters must have a status of -RUNNING.

Elastic Hadoop Video

If you are running short of time and do not want to complete the Elastic Hadoop lab, wehave included this video to show it in action.

HOL-SDC-1309


Manage Existing Tier1 and Tier2ClustersWe will get familiar with the clusters pre-created for this lab and use the Hadoopadministrative views. We will also see the CPU performance views that will be used inthe later part of the module.

HOL-SDC-1309


Accessing the Big Data Extensions in vCenter

Open Firefox from your Desktop and enter Username corp\administrator and passwordVMware1!

We are going to navigate to the Big Data Extensions functionality in vCenter.

HOL-SDC-1309


Navigate to Big Data Extensions

Click on the Home icon at the top of the screen. In the Inventories panel, click on the BigData Extensions icon.

HOL-SDC-1309


Working with Clusters

Click on Hadoop Clusters in the Inventory Lists panel. You can now view details ofHadoop Clusters already deployed.

HOL-SDC-1309


Manually Scale Out Hadoop Cluster

Those of you familiar with vSphere are comfortable with the idea of scaling up or downindividual VMs. With Big Data Extenstions, not only can we add resources to individualHadoop Nodes, but we can add new nodes to existing clusters or power down nodesthat are not needed for current workloads. To add new nodes to an existing cluster:

1) Right Click on the Tier1 Cluster

2) Select the Scale Out menu item

Choose The Number of Instances to Deploy

You can now choose the Nodegroup that you want to change and the total number ofinstances to deploy. Just a reminder that Node Groups define the Hadoop Roles that areconfigured on all VMs associated with a particular Nodegroup. We do not want to addinstances in this Node Group because space is limited in our lab environment.

HOL-SDC-1309


1) Click the Cancel Button.

HOL-SDC-1309


View Hadoop File System (HDFS) Details

Deployed Hadoop Clusters contain administrative pages that are available via your webbrowser. You can access those pages directly from vCenter. To view Hadoop File System(HDFS) information.


2) Click on Open HDFS Status Page

Note that this page is deployed in a separate tab in your Firefox browser.

HOL-SDC-1309


Hadoop NameNode Page

Click on a few of the links to see the wealth of information that is available on thesepages.

View MapReduce Job Details

Deployed Hadoop Clusters contain administrative pages that are available via your webbrowser. You can access those pages directly from vCenter. To view MapReduceinformation.


2) Click on Open MapReduce Status Page

HOL-SDC-1309


Note that this page is deployed in a separate tab in your Firefox browser.

HOL-SDC-1309


MapReduce Job Details Page

Click on a few of the links to see the wealth of information that is available on thesepages.

HOL-SDC-1309


vSphere Performance Views

We are only going to look at CPU usage information in this lab. You should note thatvSphere will monitor, and take action based upon, CPU and Memory usage.

1) Click on the Tier1 cluster

HOL-SDC-1309


Find One of The Worker VMs in Tier1 Cluster

In our clusters, the Data VM contain the Data Node Role for Hadoop. The Worker VMscontain the Tasktracker Role and are responsible for executing the tasks that make up aJob. Our goal is to make sure that we have the right number of Worker (Tasktracker) VMsavailable for the workload and prioritization defined for the clusters. Here we are goingto monitor the performance of a single worker node from each of our two clusters

1) Click on the Tier1-worker-0 VM in the Tier1 cluster.

Navigate to Advanced CPU Monitoring

Navigate to the Advanced CPU Performance tab for the Tier1-worker-0 VM. Get familiarwith the CPU usage on this chart. Later in the Module we will configure a specific ChartView to monitor the load on the VM.

1. Click on Monitor Tab

HOL-SDC-1309


2. Click on Performance Tab3. Click on Advanced Tab4. Click to Close the Advanced Panel.

HOL-SDC-1309


Manual Hadoop ElasticityWe will use the CLI to see how to deploy clusters into specific Resource Pools. We willalso see how to directly access Hadoop clusters to Scale in (power down) nodes and toResize (add new nodes) using manual commands.

HOL-SDC-1309


Command Line Interface (CLI) For Manual Elasticity

Some Hadoop Cluster management can be done through the vCenter UI. Please try theother modules in this lab for details on that process. We are going to look at changingcluster size through the CLI.







HOL-SDC-1309


Cluster Configuration using JSON

Cluster definitions are done using JSON files. These are specfiles that define the nodesthat make up your Hadoop clusters, including types of nodes, what Hadoop roles will beconfigured in each node, how many to deploy, resources allocated to each node, HA/FTon or off, node placement on hosts, and even affinity between types of nodes. (Themodules on creating Hadoop clusters go into more detail on this).

HOL-SDC-1309


1) type cd /opt/serengeti/conf to move to the directory that contains the JSON files

2) type ls -al to list the files in that directory.

The cluster we looked at with the vSphere Web Client was named Tier1. Tier1.json is thefile that was used to define that cluster.

HOL-SDC-1309


Tier1 Cluster is Defined by Tier1.json

1)Type more Tier1.json.

Notice the NodeGroup with the name = Master. The Master NodeGroup contains tworoles; Jobtracker and Namenode. These roles map directly to Chef Recipes that are usedto orchestrate the provisioning of the VMs. Also notice that HA is set to ON for theMaster NodeGroup. When we create the cluster through the command line, we simplyreference this specfile in the cluster create command. We have already done that foryou in this lab. Notice that we have specified the Resource Pool that this cluster will bedeployed in to. This is important for prioritization of clusters, as you will see later in themodule.

HOL-SDC-1309


Serengeti CLI


Username is root password is VMware1!

HOL-SDC-1309



HOL-SDC-1309



1. To see your clusters type cluster list (note that up arrow will let you see yourcommand history. Notice that the AUTO ELASTIC is set to Disabled for both ourTier1 and Tier2 clusters. This means if you want to power on or off a node in aHadoop cluster because a workload has changed, you must do it manually. Wewill automate this later in the lab.

HOL-SDC-1309


Cluster List with More Details

1. Type cluster list --name Tier1 --detail to see additional details of the Tier1 cluster.Note that all 4 Worker nodes have a STATUS of Service Ready. This means theVMs are powered on and the Hadoop services are running.

Manually Change Running Cluster Nodes

If you need to change the number of worker nodes that are running, you will execute asingle cluster command. Executing this command will take a couple of minutes and mayimpact the results of the Automated Elasticity lab. We recommend that you view the

HOL-SDC-1309


results in the video at the beginning of the module. Also, note that you can perform thisoperation through the vCenter Big Data Extensions Plugin GUI by right clicking on thecluster name and selecting "Scale Out"

1) The correct command is:

cluster setParam --name Tier1 --elasticityMode MANUAL --targetComputeNodeNum 3

These commands are case-sensitive. Notice that one of the VMs has been Powered Off.

HOL-SDC-1309


Resizing Hadoop Clusters

Note: Do not execute the Cluster resize command. This Lab environment does not haveenough storage to allocate additional nodes.

1. You can also manually add new nodes (Creating new VMs to extend the nodes inthe cluster) to an existing cluster with the cluster resize command. Type helpcluster resize to see the syntax. Also, note that you can perform this operationthrough the vCenter Big Data Extensions Plugin GUI by right clicking on thecluster name and selecting "Scale Up/Down"

HOL-SDC-1309


Automatic Hadoop ElasticityWe will execute MapReduce jobs on both our Tier1 and Tier2 clusters and see howvSphere responds to the consumption of CPU from multiple clusters with differentpriorities levels. vSphere also supports scaling clusters in or out based on Memorycontention, however we will focus on CPU contention in this lab. Note: The resourcesavailable to this lab are highly dependent upon the number of labs being deployed inthe HOL environment. Your results may be different than those shown in thescreenshots.

Start MapReduce Job on Tier2 Cluster

1. From the Windows desktop, Click on Putty2. Click on Tier2JobtrackerNode3. Click on Load4. Click on Open. Password is password

Show the MapReduce Script

HOL-SDC-1309


1. Type cd /usr/lib/hadoop This moves you to the hadoop directory, which containsour script

2. Type ls -al run* to see the python scripts that call MapReduce Java apps. We aregoing to use the runPi.py script

HOL-SDC-1309


Run Pi MapReduce on Tier2

1. Type python runPi.py This will start a Pi calculation MapReduce job that willsaturate the CPU usage on the worker VMs of your Tier2 cluster. This scriptexecutes a MapReduce job that is a heavy CPU process that will use 100% of theavailable resources in the worker (Tasktracker) VMs in our cluster. Note: It ispossible that your results could be significantly different based on the totalresource usage in the HOL environment.

HOL-SDC-1309


Check Tier2 CPU Usage through Web Client

Click on the Home icon at the top of the screen. In the Inventories panel, click on the BigData Extensions icon.

HOL-SDC-1309


View Your Cluster List


HOL-SDC-1309


Select Your Tier2 Cluster

1) Click on Tier2 Cluster

HOL-SDC-1309


Find The Worker-0 VMs in Tier2 Cluster



Navigate to Advanced CPU Monitoring

Navigate to the Advanced CPU Performance tab for the Tier2-worker-0 VM.

1. Click on Monitor Tab2. Click on Performance Tab

HOL-SDC-1309


3. Click on Advanced Tab4. Click to Close the Advanced Panel.

Create Custom Chart for Tier2-worker-0 VM

You are going to create a custom chart that contains CPU Usage and CPU Ready Time.You will save this as a chart called "Elasticity Test"

1) Click on Chart Options

HOL-SDC-1309


Select the Performance Metrics for Your Custom Chart

1) Make sure that Target Object 0 is deselected and that Tier2-worker-0 is selected

2) Select the Ready Counter

3) Select the Usage Counter

Create "Elasticity Testing" View For Tier2 Worker-0 VM

1)Select "Save Options as" and choose type "Elasticity Testing" as the name.

This chart will now let you see CPU Ready Time and CPU Usage in a single pane.

A quick note on reading these numbers. This data is accumulated in 20 secondintervals. You are looking at the average CPU utilization % over that interval. Ready timeis a measure of the amount of time that a vCPU is ready to run, but has not yet beenscheduled on a physical CPU. This number should be less than 10% per vCPU. Thecollection interval is 20 seconds (or 20,000 Milliseconds). We are running with 1 vCPU

HOL-SDC-1309


per VM, so Ready time above 2,000 Milliseconds potentially signals that there iscontention for resources and we may need to power down a Hadoop node VM tooptimize performance of the clusters. Note: Because of the nature of our HOLenvironment, there can be spikes in Ready time that are unrelated to the workloadwithin your individual labs. This means that VMs will tend to power on or off more oftenthan in other physical infrastructure. It is also possible that you will not see any VMspower down. If you do not see results in two to three minutes, move on in the labbecause the Ready time did not exceed the threshold needed to invoke the power off.You can see the expected behavior in the video at the beginning of the Module.

HOL-SDC-1309


View Elasticity Testing Chart

1) Select "Elasticity Testing" from the Chart View Drop List.

HOL-SDC-1309


Tier2-worker-0 Resource Consumption

Notice that we are using 100% of the one vCPU that is assigned to this VM. The ReadyTime number should be relatively low, however as mentioned in the note above, our labenvironment will cause some Ready Time spikes due to the extreme over-allocation ofresources to support thousands of VMs with limited physical Hardware.

Start MapReduce Job on Tier1 Cluster

Now we want to repeat our previous process and start the MapReduce Job on our Tier1cluster.

HOL-SDC-1309


1. From the Windows desktop, Click on Putty2. Click on Tier1JobtrackerNode3. Click on Load4. Click on Open. Password is password

HOL-SDC-1309


Show the MapReduce Script

1. Type cd /usr/lib/hadoop This moves you to the hadoop directory, which containsour script

2. Type ls -al run* to see the python scripts that call MapReduce Java apps. We aregoing to use the runPi.py script

HOL-SDC-1309


Run Pi MapReduce on Tier1

1. Type python runPi.py This will start a Pi calculation MapReduce job that willsaturate the CPU usage on the worker VMs of your Tier1 cluster. This scriptexecutes a MapReduce job that is a heavy CPU process that will use 100% of theavailable resources in the worker (Tasktracker) VMs in our cluster. Note: Becauseof the nature of our lab environment, it is possible that you will not see the 100%CPU. You can see the expected result in the video at the beginning of the module.

HOL-SDC-1309


Check Tier1 CPU Usage Through the Web Client

If you have not left the Performance Chart page we used to view Tier2 CPU, then ClickTwice on the Navigation Drop List to go back to your cluster list. You can also navigatedirectly there from the Drop list or by taking the path we used previously: Home -> BigData Extensions -> Hadoop Clusters -> Tier1

HOL-SDC-1309




HOL-SDC-1309


Find Your Tier1-worker-0 VM



HOL-SDC-1309


Create "Elasticity Testing" View For Tier1 Worker-0 VM

1) Make sure that Target Object 0 is deselected and that Tier1-worker-0 is selected

2) Select the Ready Counter

3) Select the Usage Counter

4) Select "Save Options As" and name it "Elasticity Testing"

Tier1-worker-0 VM Resource Consumption

You should see CPU for this VM at 100% Usage as expected. You also should be seeingsome increase in Ready Time. Note: As previously mentioned, due to the nature of our

HOL-SDC-1309


lab environment, you may not see 100% CPU usage. To see the expected behavior, youcan view the video at the beginning of the module.

Tiered Service Levels - Set Resource Pool Priorities

We now want to show how to increase the priority of the Tier1 Hadoop cluster. We dothat by setting the CPU shares in the Tier1 Clusters Resource Pool to HIGH. Note that theshares were already set to HIGH for you.

1) Click on the Home Icon or Home Tab

HOL-SDC-1309


2) Click on Hosts and Clusters

Raise the Priority on Your Tier1 Resource Pool

Raising the priority of a Resource Pool that contains a Hadoop Cluster means that thecluster will get a higher share of resources than Clusters that are created in lowerpriority Resource Pools.

HOL-SDC-1309


1)Expand the Inventory List on the Left hand side of the screen and click on Tier1Hadoop Clusters Resource Pool

2) Click on the Manage Tab on the middle panel of the screen

Notice that the CPU Shares is already set to HIGH, but this is where you can change thissetting.

HOL-SDC-1309


Edit Tier1 Clusters Resource Pool CPU Shares

1) Click on Edit

2) Notice Shares are set to High

3) Click on OK

Constrain CPU Resource

Because of the nuances of our Hands-on Lab environment, we are going to arbitrarilylimit the amount of CPU available to non Tier1 VMs by setting a CPU reservation on theTier1 Hadoop Clusters resource pool. This is not something you need to do in your ownenvironment to enable elastic scaling.

1) Right click on the Tier1 Hadoop Clusters Resource Pool and select Edit Settings

2) Set a CPU reservation for 4144 megahertz and click OK

HOL-SDC-1309


As you view the performance charts later in the lab, you might like to come back hereand play with this Reservation amount. Increasing it will starve the Tier2 clusters,resulting in increases in CPU Ready time for its VMs.

HOL-SDC-1309


Verify Worker Node VMs are Powered on

1) Click on Related Objects

2) Click on Virtual Machines

Verify that all Tier1 worker nodes are Powered On. They should be unless you poweredthem off in a Previous Lab.

HOL-SDC-1309


Verify Tier2 Worker Nodes are Powered On

1) Click on Tier2 Hadoop Clusters Resource Pool

2) Click on Related Objects

3) Click on Virtual Machines

Verify that all Tier2 worker nodes are Powered On. They should be unless you poweredthem off in a Previous Lab.

HOL-SDC-1309


Change Elasticity Mode to Auto

Now that we have set the Priority of the Tier1 Cluster Resource Pool to High, we wantvSphere to Automatically manage the number of Hadoop nodes that are running, basedon the workloads and that prioritization. We will set the elasticity level through the CLI







HOL-SDC-1309


Connect to Serengeti CLI to Set Elasticity Mode


Username is root password is VMware1!

You are now in a Command Line environment that can interact directly with yourHadoop Clusters. Node: We have sometimes seen the cluster entirely CPU bound during

HOL-SDC-1309


this test, which can make Connecting to Serengeti and running this command difficult.This is an artifact of our HOL environment. If you are unable to execute this portion ofthe lab, please see the video at the beginning of the lab for the expected results.


1. To see your clusters type cluster list (note that up arrow will let you see yourcommand history. Notice that the AUTO ELASTIC is set to Disabled for both ourTier1 and Tier2 clusters.

Turn on Auto Elasticity Mode

To Turn on Auto Elasticity:

1) Type cluster setParam --name Tier2 --elasticityMode auto

2) Type cluster setParam --name Tier1 --elasticityMode auto

HOL-SDC-1309


Note: you can change elasticity mode through the Big Data Extensions vCenter pluginby right clicking on the Tier1 and Tier2 clusters

Because you are consuming Host CPU by running the RunPi workload, this commandtakes a little longer than normal. Expect about 2 minutes for each cluster setParamcommand. Also note that you can set elasticity directly withing vCenter using the BigData Extensions plugin. Simply right click on the cluster you want to set and select SetElasticity, then select Auto.

HOL-SDC-1309


Monitor Power Off/On Tasks

It may take a few minutes for vSphere to determine that a Node needs to be poweredoff

1) Go back to the vSphere Web Client

2) on the Right side of the screen, you will see the Recent Tasks Panel. Click on ALL

3) Click on More Tasks.

HOL-SDC-1309


VMs Powering On/Off

In a couple of minutes you should see VMs in your Tier1 and Tier2 Clusters begin topower down. As mentioned before, the nature of our Hands on Lab infrastructure willmake this somewhat unpredictable, but generally you will see more Tier2 VMs Power offthan Tier1. Note: You should click the refresh button on this page to view the updatedtasks more quickly. If you do not see this occur in a couple of minutes, please view thevideo at the beginning of the module for the expected result. Sometimes the Ready timethreshold for powering down is not met and the VMs may not power off.

HOL-SDC-1309


Monitoring CPU Performance Metrics

We will navigate back to our Custom Performance Views to see what is happening withCPU Usage and Ready time.

1) Click on the Home icon at the top of the screen.

2) In the Inventories panel, click on the Big Data Extensions icon.

HOL-SDC-1309


View Your Cluster List


HOL-SDC-1309




Find The Worker-0 VMs in Tier1 Cluster


HOL-SDC-1309



Monitor Ready Time Reduction

1. Click on Monitor2. Click on Performance3. Click on the Chart Options View DropList and select "Elasticity Testing" This will

give you your CPU Usage and Ready View You should see some reduction in the

HOL-SDC-1309


Ready time spikes based on a reduction in the CPU consumption across thecluster. Note: this will be dependent upon the infrastructure anomalies describedearlier in the module.

HOL-SDC-1309


ConclusionThank you for participating in the VMware 2013 Hands-on Labs. Be sure to visithttp://hol.vmware.com/ to continue your lab experience online.

Lab SKU: HOL-SDC-1309

Version: 20140213-184824

HOL-SDC-1309


http://hol.vmware.com/

Table of Contents - VMwaredocs.hol.vmware.com/HOL-2013/hol-sdc-1309_beta_pdf_en.pdf · There is a...

Documents

Transcript of Table of Contents - VMwaredocs.hol.vmware.com/HOL-2013/hol-sdc-1309_beta_pdf_en.pdf · There is a...