Hadoop Deployment Manual - Bright...

51
Bright Cluster Manager 7.0 Hadoop Deployment Manual Revision: abbc0ba Date: Tue May 1 2018

Transcript of Hadoop Deployment Manual - Bright...

Page 1: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

Bright Cluster Manager 70

Hadoop Deployment ManualRevision abbc0ba

Date Tue May 1 2018

copy2015 Bright Computing Inc All Rights Reserved This manual or partsthereof may not be reproduced in any form unless permitted by contractor by written permission of Bright Computing Inc

TrademarksLinux is a registered trademark of Linus Torvalds PathScale is a regis-tered trademark of Cray Inc Red Hat and all Red Hat-based trademarksare trademarks or registered trademarks of Red Hat Inc SUSE is a reg-istered trademark of Novell Inc PGI is a registered trademark of ThePortland Group Compiler Technology STMicroelectronics Inc SGE is atrademark of Sun Microsystems Inc FLEXlm is a registered trademarkof Globetrotter Software Inc Maui Cluster Scheduler is a trademark ofAdaptive Computing Inc ScaleMP is a registered trademark of ScaleMPInc All other trademarks are the property of their respective owners

Rights and RestrictionsAll statements specifications recommendations and technical informa-tion contained herein are current or planned as of the date of publicationof this document They are reliable as of the time of this writing and arepresented without warranty of any kind expressed or implied BrightComputing Inc shall not be liable for technical or editorial errors oromissions which may occur in this document Bright Computing Incshall not be liable for any damages resulting from the use of this docu-ment

Limitation of Liability and Damages Pertaining toBright Computing IncThe Bright Cluster Manager product principally consists of free softwarethat is licensed by the Linux authors free of charge Bright ComputingInc shall have no liability nor will Bright Computing Inc provide anywarranty for the Bright Cluster Manager to the extent that is permittedby law Unless confirmed in writing the Linux authors andor third par-ties provide the program as is without any warranty either expressed orimplied including but not limited to marketability or suitability for aspecific purpose The user of the Bright Cluster Manager product shallaccept the full risk for the quality or performance of the product Shouldthe product malfunction the costs for repair service or correction will beborne by the user of the Bright Cluster Manager product No copyrightowner or third party who has modified or distributed the program aspermitted in this license shall be held liable for damages including gen-eral or specific damages damages caused by side effects or consequentialdamages resulting from the use of the program or the un-usability of theprogram (including but not limited to loss of data incorrect processingof data losses that must be borne by you or others or the inability of theprogram to work together with any other program) even if a copyrightowner or third party had been advised about the possibility of such dam-ages unless such copyright owner or third party has signed a writing tothe contrary

Table of Contents

Table of Contents i01 About This Manual iii02 About The Manuals In General iii03 Getting Administrator-Level Support iv

1 Introduction 111 What Is Hadoop About 112 Available Hadoop Implementations 213 Further Documentation 3

2 Installing Hadoop 521 Command-line Installation Of Hadoop Using

cm-hadoop-setup -c ltfilenamegt 5211 Usage 5212 An Install Run 6

22 Ncurses Installation Of Hadoop Using cm-hadoop-setup 823 Avoiding Misconfigurations During Hadoop Installation 9

231 NameNode Configuration Choices 924 Installing Hadoop With Lustre 10

241 Lustre Internal Server Installation 10242 Lustre External Server Installation 11243 Lustre Client Installation 11244 Lustre Hadoop Configuration 11

25 Hadoop Installation In A Cloud 13

3 Viewing And Managing HDFS From CMDaemon 1531 cmgui And The HDFS Resource 15

311 The HDFS Instance Overview Tab 16312 The HDFS Instance Settings Tab 16313 The HDFS Instance Tasks Tab 17314 The HDFS Instance Notes Tab 18

32 cmsh And hadoop Mode 18321 The show And overview Commands 18322 The Tasks Commands 20

4 Hadoop Services 2341 Hadoop Services From cmgui 23

411 Configuring Hadoop Services From cmgui 23412 Monitoring And Modifying The Running Of

Hadoop Services From cmgui 24

ii Table of Contents

5 Running Hadoop Jobs 2751 Shakedown Runs 2752 Example End User Job Run 29

6 Hadoop-related Projects 3161 Accumulo 31

611 Accumulo Installation With cm-accumulo-setup 32612 Accumulo Removal With cm-accumulo-setup 33613 Accumulo MapReduce Example 33

62 Hive 34621 Hive Installation With cm-hive-setup 34622 Hive Removal With cm-hive-setup 36623 Beeline 36

63 Kafka 36631 Kafka Installation With cm-kafka-setup 36632 Kafka Removal With cm-kafka-setup 37

64 Pig 37641 Pig Installation With cm-pig-setup 37642 Pig Removal With cm-pig-setup 38643 Using Pig 38

65 Spark 39651 Spark Installation With cm-spark-setup 39652 Spark Removal With cm-spark-setup 41653 Using Spark 41

66 Sqoop 41661 Sqoop Installation With cm-sqoop-setup 41662 Sqoop Removal With cm-sqoop-setup 43

67 Storm 43671 Storm Installation With cm-storm-setup 43672 Storm Removal With cm-storm-setup 44673 Using Storm 44

Preface

Welcome to the Hadoop Deployment Manual for Bright Cluster Manager70

01 About This ManualThis manual is aimed at helping cluster administrators install under-stand configure and manage the Hadoop capabilities of Bright ClusterManager The administrator is expected to be reasonably familiar withthe Bright Cluster Manager Administrator Manual

02 About The Manuals In GeneralRegularly updated versions of the Bright Cluster Manager70 manuals are available on updated clusters by default atcmshareddocscm The latest updates are always online athttpsupportbrightcomputingcommanuals

bull The Installation Manual describes installation procedures for a basiccluster

bull The Administrator Manual describes the general management of thecluster

bull The User Manual describes the user environment and how to submitjobs for the end user

bull The Cloudbursting Manual describes how to deploy the cloud capa-bilities of the cluster

bull The Developer Manual has useful information for developers whowould like to program with Bright Cluster Manager

bull The OpenStack Deployment Manual describes how to deploy Open-Stack with Bright Cluster Manager

bull The Hadoop Deployment Manual describes how to deploy Hadoopwith Bright Cluster Manager

bull The UCS Deployment Manual describes how to deploy the Cisco UCSserver with Bright Cluster Manager

If the manuals are downloaded and kept in one local directory then inmost pdf viewers clicking on a cross-reference in one manual that refersto a section in another manual opens and displays that section in the sec-ond manual Navigating back and forth between documents is usuallypossible with keystrokes or mouse clicks

For example ltAltgt-ltBackarrowgt in Acrobat Reader or clicking onthe bottom leftmost navigation button of xpdf both navigate back to theprevious document

iv Table of Contents

The manuals constantly evolve to keep up with the development ofthe Bright Cluster Manager environment and the addition of new hard-ware andor applications The manuals also regularly incorporate cus-tomer feedback Administrator and user input is greatly valued at BrightComputing So any comments suggestions or corrections will be verygratefully accepted at manualsbrightcomputingcom

03 Getting Administrator-Level SupportUnless the Bright Cluster Manager reseller offers support sup-port is provided by Bright Computing over e-mail via supportbrightcomputingcom Section 102 of the Administrator Manual hasmore details on working with support

1Introduction

11 What Is Hadoop AboutHadoop is the core implementation of a technology mostly used in theanalysis of data sets that are large and unstructured in comparison withthe data stored in regular relational databases The analysis of such datacalled data-mining can be done better with Hadoop than with relationaldatabases for certain types of parallelizable problems Hadoop has thefollowing characteristics in comparison with relational databases

1 Less structured input Key value pairs are used as records for thedata sets instead of a database

2 Scale-out rather than scale-up design For large data sets if thesize of a parallelizable problem increases linearly the correspond-ing cost of scaling up a single machine to solve it tends to grow ex-ponentially simply because the hardware requirements tend to getexponentially expensive If however the system that solves it is acluster then the corresponding cost tends to grow linearly becauseit can be solved by scaling out the cluster with a linear increase inthe number of processing nodes

Scaling out can be done with some effort for database problemsusing a parallel relational database implementation However scale-out is inherent in Hadoop and therefore often easier to implementwith Hadoop The Hadoop scale-out approach is based on the fol-lowing design

bull Clustered storage Instead of a single node with a speciallarge storage device a distributed filesystem (HDFS) usingcommodity hardware devices across many nodes stores thedata

bull Clustered processing Instead of using a single node with manyprocessors the parallel processing needs of the problem aredistributed out over many nodes The procedure is called theMapReduce algorithm and is based on the following approach

ndash The distribution process ldquomapsrdquo the initial state of the prob-lem into processes out to the nodes ready to be handled inparallel

ndash Processing tasks are carried out on the data at nodes them-selves

copy Bright Computing Inc

2 Introduction

ndash The results are ldquoreducedrdquo back to one result

3 Automated failure handling at application level for data Repli-cation of the data takes place across the DataNodes which are thenodes holding the data If a DataNode has failed then anothernode which has the replicated data on it is used instead automat-ically Hadoop switches over quickly in comparison to replicateddatabase clusters due to not having to check database table consis-tency

12 Available Hadoop ImplementationsBright Cluster Manager supports the Hadoop implementations providedby the following organizations

1 Apache (httpapacheorg) This is the upstream source forthe Hadoop core and some related components which all the otherimplementations use

2 Cloudera (httpwwwclouderacom) Cloudera providessome extra premium functionality and components on top of aHadoop suite One of the extra components that Cloudera providesis the Cloudera Management Suite a major proprietary manage-ment layer with some premium features

3 Hortonworks (httphortonworkscom) Hortonworks DataPlatform (HDP) is a fully open source Hadoop suite

4 Pivotal HD Pivotal Hadoop Distribution is a completely Apache-compliant distribution with extensive analytic toolsets

The ISO image for Bright Cluster Manager available at httpwwwbrightcomputingcomDownload can include Hadoop for all 4 im-plementations During installation from the ISO the administrator canchoose which implementation to install (section 3314 of the InstallationManual)

The contents and versions of the Hadoop implementations providedby Bright Computing at the time of writing (April 2015) are as follows

bull Apache packaged as cm-apache-hadoop by Bright ComputingThis provides

ndash Hadoop versions 121 220 and 260

ndash HBase version 09811 for Hadoop 121

ndash HBase version 100 for Hadoop 220 and Hadoop 260

ndash ZooKeeper version 346

bull Cloudera packaged as cm-cloudera-hadoop by Bright Comput-ing This provides

ndash CDH 471 (based on Apache Hadoop 20 HBase 09415 andZooKeeper 345)

ndash CDH 532 (based on Apache Hadoop 250 HBase 0986 andZooKeeper 345)

copy Bright Computing Inc

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 2: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

copy2015 Bright Computing Inc All Rights Reserved This manual or partsthereof may not be reproduced in any form unless permitted by contractor by written permission of Bright Computing Inc

TrademarksLinux is a registered trademark of Linus Torvalds PathScale is a regis-tered trademark of Cray Inc Red Hat and all Red Hat-based trademarksare trademarks or registered trademarks of Red Hat Inc SUSE is a reg-istered trademark of Novell Inc PGI is a registered trademark of ThePortland Group Compiler Technology STMicroelectronics Inc SGE is atrademark of Sun Microsystems Inc FLEXlm is a registered trademarkof Globetrotter Software Inc Maui Cluster Scheduler is a trademark ofAdaptive Computing Inc ScaleMP is a registered trademark of ScaleMPInc All other trademarks are the property of their respective owners

Rights and RestrictionsAll statements specifications recommendations and technical informa-tion contained herein are current or planned as of the date of publicationof this document They are reliable as of the time of this writing and arepresented without warranty of any kind expressed or implied BrightComputing Inc shall not be liable for technical or editorial errors oromissions which may occur in this document Bright Computing Incshall not be liable for any damages resulting from the use of this docu-ment

Limitation of Liability and Damages Pertaining toBright Computing IncThe Bright Cluster Manager product principally consists of free softwarethat is licensed by the Linux authors free of charge Bright ComputingInc shall have no liability nor will Bright Computing Inc provide anywarranty for the Bright Cluster Manager to the extent that is permittedby law Unless confirmed in writing the Linux authors andor third par-ties provide the program as is without any warranty either expressed orimplied including but not limited to marketability or suitability for aspecific purpose The user of the Bright Cluster Manager product shallaccept the full risk for the quality or performance of the product Shouldthe product malfunction the costs for repair service or correction will beborne by the user of the Bright Cluster Manager product No copyrightowner or third party who has modified or distributed the program aspermitted in this license shall be held liable for damages including gen-eral or specific damages damages caused by side effects or consequentialdamages resulting from the use of the program or the un-usability of theprogram (including but not limited to loss of data incorrect processingof data losses that must be borne by you or others or the inability of theprogram to work together with any other program) even if a copyrightowner or third party had been advised about the possibility of such dam-ages unless such copyright owner or third party has signed a writing tothe contrary

Table of Contents

Table of Contents i01 About This Manual iii02 About The Manuals In General iii03 Getting Administrator-Level Support iv

1 Introduction 111 What Is Hadoop About 112 Available Hadoop Implementations 213 Further Documentation 3

2 Installing Hadoop 521 Command-line Installation Of Hadoop Using

cm-hadoop-setup -c ltfilenamegt 5211 Usage 5212 An Install Run 6

22 Ncurses Installation Of Hadoop Using cm-hadoop-setup 823 Avoiding Misconfigurations During Hadoop Installation 9

231 NameNode Configuration Choices 924 Installing Hadoop With Lustre 10

241 Lustre Internal Server Installation 10242 Lustre External Server Installation 11243 Lustre Client Installation 11244 Lustre Hadoop Configuration 11

25 Hadoop Installation In A Cloud 13

3 Viewing And Managing HDFS From CMDaemon 1531 cmgui And The HDFS Resource 15

311 The HDFS Instance Overview Tab 16312 The HDFS Instance Settings Tab 16313 The HDFS Instance Tasks Tab 17314 The HDFS Instance Notes Tab 18

32 cmsh And hadoop Mode 18321 The show And overview Commands 18322 The Tasks Commands 20

4 Hadoop Services 2341 Hadoop Services From cmgui 23

411 Configuring Hadoop Services From cmgui 23412 Monitoring And Modifying The Running Of

Hadoop Services From cmgui 24

ii Table of Contents

5 Running Hadoop Jobs 2751 Shakedown Runs 2752 Example End User Job Run 29

6 Hadoop-related Projects 3161 Accumulo 31

611 Accumulo Installation With cm-accumulo-setup 32612 Accumulo Removal With cm-accumulo-setup 33613 Accumulo MapReduce Example 33

62 Hive 34621 Hive Installation With cm-hive-setup 34622 Hive Removal With cm-hive-setup 36623 Beeline 36

63 Kafka 36631 Kafka Installation With cm-kafka-setup 36632 Kafka Removal With cm-kafka-setup 37

64 Pig 37641 Pig Installation With cm-pig-setup 37642 Pig Removal With cm-pig-setup 38643 Using Pig 38

65 Spark 39651 Spark Installation With cm-spark-setup 39652 Spark Removal With cm-spark-setup 41653 Using Spark 41

66 Sqoop 41661 Sqoop Installation With cm-sqoop-setup 41662 Sqoop Removal With cm-sqoop-setup 43

67 Storm 43671 Storm Installation With cm-storm-setup 43672 Storm Removal With cm-storm-setup 44673 Using Storm 44

Preface

Welcome to the Hadoop Deployment Manual for Bright Cluster Manager70

01 About This ManualThis manual is aimed at helping cluster administrators install under-stand configure and manage the Hadoop capabilities of Bright ClusterManager The administrator is expected to be reasonably familiar withthe Bright Cluster Manager Administrator Manual

02 About The Manuals In GeneralRegularly updated versions of the Bright Cluster Manager70 manuals are available on updated clusters by default atcmshareddocscm The latest updates are always online athttpsupportbrightcomputingcommanuals

bull The Installation Manual describes installation procedures for a basiccluster

bull The Administrator Manual describes the general management of thecluster

bull The User Manual describes the user environment and how to submitjobs for the end user

bull The Cloudbursting Manual describes how to deploy the cloud capa-bilities of the cluster

bull The Developer Manual has useful information for developers whowould like to program with Bright Cluster Manager

bull The OpenStack Deployment Manual describes how to deploy Open-Stack with Bright Cluster Manager

bull The Hadoop Deployment Manual describes how to deploy Hadoopwith Bright Cluster Manager

bull The UCS Deployment Manual describes how to deploy the Cisco UCSserver with Bright Cluster Manager

If the manuals are downloaded and kept in one local directory then inmost pdf viewers clicking on a cross-reference in one manual that refersto a section in another manual opens and displays that section in the sec-ond manual Navigating back and forth between documents is usuallypossible with keystrokes or mouse clicks

For example ltAltgt-ltBackarrowgt in Acrobat Reader or clicking onthe bottom leftmost navigation button of xpdf both navigate back to theprevious document

iv Table of Contents

The manuals constantly evolve to keep up with the development ofthe Bright Cluster Manager environment and the addition of new hard-ware andor applications The manuals also regularly incorporate cus-tomer feedback Administrator and user input is greatly valued at BrightComputing So any comments suggestions or corrections will be verygratefully accepted at manualsbrightcomputingcom

03 Getting Administrator-Level SupportUnless the Bright Cluster Manager reseller offers support sup-port is provided by Bright Computing over e-mail via supportbrightcomputingcom Section 102 of the Administrator Manual hasmore details on working with support

1Introduction

11 What Is Hadoop AboutHadoop is the core implementation of a technology mostly used in theanalysis of data sets that are large and unstructured in comparison withthe data stored in regular relational databases The analysis of such datacalled data-mining can be done better with Hadoop than with relationaldatabases for certain types of parallelizable problems Hadoop has thefollowing characteristics in comparison with relational databases

1 Less structured input Key value pairs are used as records for thedata sets instead of a database

2 Scale-out rather than scale-up design For large data sets if thesize of a parallelizable problem increases linearly the correspond-ing cost of scaling up a single machine to solve it tends to grow ex-ponentially simply because the hardware requirements tend to getexponentially expensive If however the system that solves it is acluster then the corresponding cost tends to grow linearly becauseit can be solved by scaling out the cluster with a linear increase inthe number of processing nodes

Scaling out can be done with some effort for database problemsusing a parallel relational database implementation However scale-out is inherent in Hadoop and therefore often easier to implementwith Hadoop The Hadoop scale-out approach is based on the fol-lowing design

bull Clustered storage Instead of a single node with a speciallarge storage device a distributed filesystem (HDFS) usingcommodity hardware devices across many nodes stores thedata

bull Clustered processing Instead of using a single node with manyprocessors the parallel processing needs of the problem aredistributed out over many nodes The procedure is called theMapReduce algorithm and is based on the following approach

ndash The distribution process ldquomapsrdquo the initial state of the prob-lem into processes out to the nodes ready to be handled inparallel

ndash Processing tasks are carried out on the data at nodes them-selves

copy Bright Computing Inc

2 Introduction

ndash The results are ldquoreducedrdquo back to one result

3 Automated failure handling at application level for data Repli-cation of the data takes place across the DataNodes which are thenodes holding the data If a DataNode has failed then anothernode which has the replicated data on it is used instead automat-ically Hadoop switches over quickly in comparison to replicateddatabase clusters due to not having to check database table consis-tency

12 Available Hadoop ImplementationsBright Cluster Manager supports the Hadoop implementations providedby the following organizations

1 Apache (httpapacheorg) This is the upstream source forthe Hadoop core and some related components which all the otherimplementations use

2 Cloudera (httpwwwclouderacom) Cloudera providessome extra premium functionality and components on top of aHadoop suite One of the extra components that Cloudera providesis the Cloudera Management Suite a major proprietary manage-ment layer with some premium features

3 Hortonworks (httphortonworkscom) Hortonworks DataPlatform (HDP) is a fully open source Hadoop suite

4 Pivotal HD Pivotal Hadoop Distribution is a completely Apache-compliant distribution with extensive analytic toolsets

The ISO image for Bright Cluster Manager available at httpwwwbrightcomputingcomDownload can include Hadoop for all 4 im-plementations During installation from the ISO the administrator canchoose which implementation to install (section 3314 of the InstallationManual)

The contents and versions of the Hadoop implementations providedby Bright Computing at the time of writing (April 2015) are as follows

bull Apache packaged as cm-apache-hadoop by Bright ComputingThis provides

ndash Hadoop versions 121 220 and 260

ndash HBase version 09811 for Hadoop 121

ndash HBase version 100 for Hadoop 220 and Hadoop 260

ndash ZooKeeper version 346

bull Cloudera packaged as cm-cloudera-hadoop by Bright Comput-ing This provides

ndash CDH 471 (based on Apache Hadoop 20 HBase 09415 andZooKeeper 345)

ndash CDH 532 (based on Apache Hadoop 250 HBase 0986 andZooKeeper 345)

copy Bright Computing Inc

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 3: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

Table of Contents

Table of Contents i01 About This Manual iii02 About The Manuals In General iii03 Getting Administrator-Level Support iv

1 Introduction 111 What Is Hadoop About 112 Available Hadoop Implementations 213 Further Documentation 3

2 Installing Hadoop 521 Command-line Installation Of Hadoop Using

cm-hadoop-setup -c ltfilenamegt 5211 Usage 5212 An Install Run 6

22 Ncurses Installation Of Hadoop Using cm-hadoop-setup 823 Avoiding Misconfigurations During Hadoop Installation 9

231 NameNode Configuration Choices 924 Installing Hadoop With Lustre 10

241 Lustre Internal Server Installation 10242 Lustre External Server Installation 11243 Lustre Client Installation 11244 Lustre Hadoop Configuration 11

25 Hadoop Installation In A Cloud 13

3 Viewing And Managing HDFS From CMDaemon 1531 cmgui And The HDFS Resource 15

311 The HDFS Instance Overview Tab 16312 The HDFS Instance Settings Tab 16313 The HDFS Instance Tasks Tab 17314 The HDFS Instance Notes Tab 18

32 cmsh And hadoop Mode 18321 The show And overview Commands 18322 The Tasks Commands 20

4 Hadoop Services 2341 Hadoop Services From cmgui 23

411 Configuring Hadoop Services From cmgui 23412 Monitoring And Modifying The Running Of

Hadoop Services From cmgui 24

ii Table of Contents

5 Running Hadoop Jobs 2751 Shakedown Runs 2752 Example End User Job Run 29

6 Hadoop-related Projects 3161 Accumulo 31

611 Accumulo Installation With cm-accumulo-setup 32612 Accumulo Removal With cm-accumulo-setup 33613 Accumulo MapReduce Example 33

62 Hive 34621 Hive Installation With cm-hive-setup 34622 Hive Removal With cm-hive-setup 36623 Beeline 36

63 Kafka 36631 Kafka Installation With cm-kafka-setup 36632 Kafka Removal With cm-kafka-setup 37

64 Pig 37641 Pig Installation With cm-pig-setup 37642 Pig Removal With cm-pig-setup 38643 Using Pig 38

65 Spark 39651 Spark Installation With cm-spark-setup 39652 Spark Removal With cm-spark-setup 41653 Using Spark 41

66 Sqoop 41661 Sqoop Installation With cm-sqoop-setup 41662 Sqoop Removal With cm-sqoop-setup 43

67 Storm 43671 Storm Installation With cm-storm-setup 43672 Storm Removal With cm-storm-setup 44673 Using Storm 44

Preface

Welcome to the Hadoop Deployment Manual for Bright Cluster Manager70

01 About This ManualThis manual is aimed at helping cluster administrators install under-stand configure and manage the Hadoop capabilities of Bright ClusterManager The administrator is expected to be reasonably familiar withthe Bright Cluster Manager Administrator Manual

02 About The Manuals In GeneralRegularly updated versions of the Bright Cluster Manager70 manuals are available on updated clusters by default atcmshareddocscm The latest updates are always online athttpsupportbrightcomputingcommanuals

bull The Installation Manual describes installation procedures for a basiccluster

bull The Administrator Manual describes the general management of thecluster

bull The User Manual describes the user environment and how to submitjobs for the end user

bull The Cloudbursting Manual describes how to deploy the cloud capa-bilities of the cluster

bull The Developer Manual has useful information for developers whowould like to program with Bright Cluster Manager

bull The OpenStack Deployment Manual describes how to deploy Open-Stack with Bright Cluster Manager

bull The Hadoop Deployment Manual describes how to deploy Hadoopwith Bright Cluster Manager

bull The UCS Deployment Manual describes how to deploy the Cisco UCSserver with Bright Cluster Manager

If the manuals are downloaded and kept in one local directory then inmost pdf viewers clicking on a cross-reference in one manual that refersto a section in another manual opens and displays that section in the sec-ond manual Navigating back and forth between documents is usuallypossible with keystrokes or mouse clicks

For example ltAltgt-ltBackarrowgt in Acrobat Reader or clicking onthe bottom leftmost navigation button of xpdf both navigate back to theprevious document

iv Table of Contents

The manuals constantly evolve to keep up with the development ofthe Bright Cluster Manager environment and the addition of new hard-ware andor applications The manuals also regularly incorporate cus-tomer feedback Administrator and user input is greatly valued at BrightComputing So any comments suggestions or corrections will be verygratefully accepted at manualsbrightcomputingcom

03 Getting Administrator-Level SupportUnless the Bright Cluster Manager reseller offers support sup-port is provided by Bright Computing over e-mail via supportbrightcomputingcom Section 102 of the Administrator Manual hasmore details on working with support

1Introduction

11 What Is Hadoop AboutHadoop is the core implementation of a technology mostly used in theanalysis of data sets that are large and unstructured in comparison withthe data stored in regular relational databases The analysis of such datacalled data-mining can be done better with Hadoop than with relationaldatabases for certain types of parallelizable problems Hadoop has thefollowing characteristics in comparison with relational databases

1 Less structured input Key value pairs are used as records for thedata sets instead of a database

2 Scale-out rather than scale-up design For large data sets if thesize of a parallelizable problem increases linearly the correspond-ing cost of scaling up a single machine to solve it tends to grow ex-ponentially simply because the hardware requirements tend to getexponentially expensive If however the system that solves it is acluster then the corresponding cost tends to grow linearly becauseit can be solved by scaling out the cluster with a linear increase inthe number of processing nodes

Scaling out can be done with some effort for database problemsusing a parallel relational database implementation However scale-out is inherent in Hadoop and therefore often easier to implementwith Hadoop The Hadoop scale-out approach is based on the fol-lowing design

bull Clustered storage Instead of a single node with a speciallarge storage device a distributed filesystem (HDFS) usingcommodity hardware devices across many nodes stores thedata

bull Clustered processing Instead of using a single node with manyprocessors the parallel processing needs of the problem aredistributed out over many nodes The procedure is called theMapReduce algorithm and is based on the following approach

ndash The distribution process ldquomapsrdquo the initial state of the prob-lem into processes out to the nodes ready to be handled inparallel

ndash Processing tasks are carried out on the data at nodes them-selves

copy Bright Computing Inc

2 Introduction

ndash The results are ldquoreducedrdquo back to one result

3 Automated failure handling at application level for data Repli-cation of the data takes place across the DataNodes which are thenodes holding the data If a DataNode has failed then anothernode which has the replicated data on it is used instead automat-ically Hadoop switches over quickly in comparison to replicateddatabase clusters due to not having to check database table consis-tency

12 Available Hadoop ImplementationsBright Cluster Manager supports the Hadoop implementations providedby the following organizations

1 Apache (httpapacheorg) This is the upstream source forthe Hadoop core and some related components which all the otherimplementations use

2 Cloudera (httpwwwclouderacom) Cloudera providessome extra premium functionality and components on top of aHadoop suite One of the extra components that Cloudera providesis the Cloudera Management Suite a major proprietary manage-ment layer with some premium features

3 Hortonworks (httphortonworkscom) Hortonworks DataPlatform (HDP) is a fully open source Hadoop suite

4 Pivotal HD Pivotal Hadoop Distribution is a completely Apache-compliant distribution with extensive analytic toolsets

The ISO image for Bright Cluster Manager available at httpwwwbrightcomputingcomDownload can include Hadoop for all 4 im-plementations During installation from the ISO the administrator canchoose which implementation to install (section 3314 of the InstallationManual)

The contents and versions of the Hadoop implementations providedby Bright Computing at the time of writing (April 2015) are as follows

bull Apache packaged as cm-apache-hadoop by Bright ComputingThis provides

ndash Hadoop versions 121 220 and 260

ndash HBase version 09811 for Hadoop 121

ndash HBase version 100 for Hadoop 220 and Hadoop 260

ndash ZooKeeper version 346

bull Cloudera packaged as cm-cloudera-hadoop by Bright Comput-ing This provides

ndash CDH 471 (based on Apache Hadoop 20 HBase 09415 andZooKeeper 345)

ndash CDH 532 (based on Apache Hadoop 250 HBase 0986 andZooKeeper 345)

copy Bright Computing Inc

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 4: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

ii Table of Contents

5 Running Hadoop Jobs 2751 Shakedown Runs 2752 Example End User Job Run 29

6 Hadoop-related Projects 3161 Accumulo 31

611 Accumulo Installation With cm-accumulo-setup 32612 Accumulo Removal With cm-accumulo-setup 33613 Accumulo MapReduce Example 33

62 Hive 34621 Hive Installation With cm-hive-setup 34622 Hive Removal With cm-hive-setup 36623 Beeline 36

63 Kafka 36631 Kafka Installation With cm-kafka-setup 36632 Kafka Removal With cm-kafka-setup 37

64 Pig 37641 Pig Installation With cm-pig-setup 37642 Pig Removal With cm-pig-setup 38643 Using Pig 38

65 Spark 39651 Spark Installation With cm-spark-setup 39652 Spark Removal With cm-spark-setup 41653 Using Spark 41

66 Sqoop 41661 Sqoop Installation With cm-sqoop-setup 41662 Sqoop Removal With cm-sqoop-setup 43

67 Storm 43671 Storm Installation With cm-storm-setup 43672 Storm Removal With cm-storm-setup 44673 Using Storm 44

Preface

Welcome to the Hadoop Deployment Manual for Bright Cluster Manager70

01 About This ManualThis manual is aimed at helping cluster administrators install under-stand configure and manage the Hadoop capabilities of Bright ClusterManager The administrator is expected to be reasonably familiar withthe Bright Cluster Manager Administrator Manual

02 About The Manuals In GeneralRegularly updated versions of the Bright Cluster Manager70 manuals are available on updated clusters by default atcmshareddocscm The latest updates are always online athttpsupportbrightcomputingcommanuals

bull The Installation Manual describes installation procedures for a basiccluster

bull The Administrator Manual describes the general management of thecluster

bull The User Manual describes the user environment and how to submitjobs for the end user

bull The Cloudbursting Manual describes how to deploy the cloud capa-bilities of the cluster

bull The Developer Manual has useful information for developers whowould like to program with Bright Cluster Manager

bull The OpenStack Deployment Manual describes how to deploy Open-Stack with Bright Cluster Manager

bull The Hadoop Deployment Manual describes how to deploy Hadoopwith Bright Cluster Manager

bull The UCS Deployment Manual describes how to deploy the Cisco UCSserver with Bright Cluster Manager

If the manuals are downloaded and kept in one local directory then inmost pdf viewers clicking on a cross-reference in one manual that refersto a section in another manual opens and displays that section in the sec-ond manual Navigating back and forth between documents is usuallypossible with keystrokes or mouse clicks

For example ltAltgt-ltBackarrowgt in Acrobat Reader or clicking onthe bottom leftmost navigation button of xpdf both navigate back to theprevious document

iv Table of Contents

The manuals constantly evolve to keep up with the development ofthe Bright Cluster Manager environment and the addition of new hard-ware andor applications The manuals also regularly incorporate cus-tomer feedback Administrator and user input is greatly valued at BrightComputing So any comments suggestions or corrections will be verygratefully accepted at manualsbrightcomputingcom

03 Getting Administrator-Level SupportUnless the Bright Cluster Manager reseller offers support sup-port is provided by Bright Computing over e-mail via supportbrightcomputingcom Section 102 of the Administrator Manual hasmore details on working with support

1Introduction

11 What Is Hadoop AboutHadoop is the core implementation of a technology mostly used in theanalysis of data sets that are large and unstructured in comparison withthe data stored in regular relational databases The analysis of such datacalled data-mining can be done better with Hadoop than with relationaldatabases for certain types of parallelizable problems Hadoop has thefollowing characteristics in comparison with relational databases

1 Less structured input Key value pairs are used as records for thedata sets instead of a database

2 Scale-out rather than scale-up design For large data sets if thesize of a parallelizable problem increases linearly the correspond-ing cost of scaling up a single machine to solve it tends to grow ex-ponentially simply because the hardware requirements tend to getexponentially expensive If however the system that solves it is acluster then the corresponding cost tends to grow linearly becauseit can be solved by scaling out the cluster with a linear increase inthe number of processing nodes

Scaling out can be done with some effort for database problemsusing a parallel relational database implementation However scale-out is inherent in Hadoop and therefore often easier to implementwith Hadoop The Hadoop scale-out approach is based on the fol-lowing design

bull Clustered storage Instead of a single node with a speciallarge storage device a distributed filesystem (HDFS) usingcommodity hardware devices across many nodes stores thedata

bull Clustered processing Instead of using a single node with manyprocessors the parallel processing needs of the problem aredistributed out over many nodes The procedure is called theMapReduce algorithm and is based on the following approach

ndash The distribution process ldquomapsrdquo the initial state of the prob-lem into processes out to the nodes ready to be handled inparallel

ndash Processing tasks are carried out on the data at nodes them-selves

copy Bright Computing Inc

2 Introduction

ndash The results are ldquoreducedrdquo back to one result

3 Automated failure handling at application level for data Repli-cation of the data takes place across the DataNodes which are thenodes holding the data If a DataNode has failed then anothernode which has the replicated data on it is used instead automat-ically Hadoop switches over quickly in comparison to replicateddatabase clusters due to not having to check database table consis-tency

12 Available Hadoop ImplementationsBright Cluster Manager supports the Hadoop implementations providedby the following organizations

1 Apache (httpapacheorg) This is the upstream source forthe Hadoop core and some related components which all the otherimplementations use

2 Cloudera (httpwwwclouderacom) Cloudera providessome extra premium functionality and components on top of aHadoop suite One of the extra components that Cloudera providesis the Cloudera Management Suite a major proprietary manage-ment layer with some premium features

3 Hortonworks (httphortonworkscom) Hortonworks DataPlatform (HDP) is a fully open source Hadoop suite

4 Pivotal HD Pivotal Hadoop Distribution is a completely Apache-compliant distribution with extensive analytic toolsets

The ISO image for Bright Cluster Manager available at httpwwwbrightcomputingcomDownload can include Hadoop for all 4 im-plementations During installation from the ISO the administrator canchoose which implementation to install (section 3314 of the InstallationManual)

The contents and versions of the Hadoop implementations providedby Bright Computing at the time of writing (April 2015) are as follows

bull Apache packaged as cm-apache-hadoop by Bright ComputingThis provides

ndash Hadoop versions 121 220 and 260

ndash HBase version 09811 for Hadoop 121

ndash HBase version 100 for Hadoop 220 and Hadoop 260

ndash ZooKeeper version 346

bull Cloudera packaged as cm-cloudera-hadoop by Bright Comput-ing This provides

ndash CDH 471 (based on Apache Hadoop 20 HBase 09415 andZooKeeper 345)

ndash CDH 532 (based on Apache Hadoop 250 HBase 0986 andZooKeeper 345)

copy Bright Computing Inc

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 5: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

Preface

Welcome to the Hadoop Deployment Manual for Bright Cluster Manager70

01 About This ManualThis manual is aimed at helping cluster administrators install under-stand configure and manage the Hadoop capabilities of Bright ClusterManager The administrator is expected to be reasonably familiar withthe Bright Cluster Manager Administrator Manual

02 About The Manuals In GeneralRegularly updated versions of the Bright Cluster Manager70 manuals are available on updated clusters by default atcmshareddocscm The latest updates are always online athttpsupportbrightcomputingcommanuals

bull The Installation Manual describes installation procedures for a basiccluster

bull The Administrator Manual describes the general management of thecluster

bull The User Manual describes the user environment and how to submitjobs for the end user

bull The Cloudbursting Manual describes how to deploy the cloud capa-bilities of the cluster

bull The Developer Manual has useful information for developers whowould like to program with Bright Cluster Manager

bull The OpenStack Deployment Manual describes how to deploy Open-Stack with Bright Cluster Manager

bull The Hadoop Deployment Manual describes how to deploy Hadoopwith Bright Cluster Manager

bull The UCS Deployment Manual describes how to deploy the Cisco UCSserver with Bright Cluster Manager

If the manuals are downloaded and kept in one local directory then inmost pdf viewers clicking on a cross-reference in one manual that refersto a section in another manual opens and displays that section in the sec-ond manual Navigating back and forth between documents is usuallypossible with keystrokes or mouse clicks

For example ltAltgt-ltBackarrowgt in Acrobat Reader or clicking onthe bottom leftmost navigation button of xpdf both navigate back to theprevious document

iv Table of Contents

The manuals constantly evolve to keep up with the development ofthe Bright Cluster Manager environment and the addition of new hard-ware andor applications The manuals also regularly incorporate cus-tomer feedback Administrator and user input is greatly valued at BrightComputing So any comments suggestions or corrections will be verygratefully accepted at manualsbrightcomputingcom

03 Getting Administrator-Level SupportUnless the Bright Cluster Manager reseller offers support sup-port is provided by Bright Computing over e-mail via supportbrightcomputingcom Section 102 of the Administrator Manual hasmore details on working with support

1Introduction

11 What Is Hadoop AboutHadoop is the core implementation of a technology mostly used in theanalysis of data sets that are large and unstructured in comparison withthe data stored in regular relational databases The analysis of such datacalled data-mining can be done better with Hadoop than with relationaldatabases for certain types of parallelizable problems Hadoop has thefollowing characteristics in comparison with relational databases

1 Less structured input Key value pairs are used as records for thedata sets instead of a database

2 Scale-out rather than scale-up design For large data sets if thesize of a parallelizable problem increases linearly the correspond-ing cost of scaling up a single machine to solve it tends to grow ex-ponentially simply because the hardware requirements tend to getexponentially expensive If however the system that solves it is acluster then the corresponding cost tends to grow linearly becauseit can be solved by scaling out the cluster with a linear increase inthe number of processing nodes

Scaling out can be done with some effort for database problemsusing a parallel relational database implementation However scale-out is inherent in Hadoop and therefore often easier to implementwith Hadoop The Hadoop scale-out approach is based on the fol-lowing design

bull Clustered storage Instead of a single node with a speciallarge storage device a distributed filesystem (HDFS) usingcommodity hardware devices across many nodes stores thedata

bull Clustered processing Instead of using a single node with manyprocessors the parallel processing needs of the problem aredistributed out over many nodes The procedure is called theMapReduce algorithm and is based on the following approach

ndash The distribution process ldquomapsrdquo the initial state of the prob-lem into processes out to the nodes ready to be handled inparallel

ndash Processing tasks are carried out on the data at nodes them-selves

copy Bright Computing Inc

2 Introduction

ndash The results are ldquoreducedrdquo back to one result

3 Automated failure handling at application level for data Repli-cation of the data takes place across the DataNodes which are thenodes holding the data If a DataNode has failed then anothernode which has the replicated data on it is used instead automat-ically Hadoop switches over quickly in comparison to replicateddatabase clusters due to not having to check database table consis-tency

12 Available Hadoop ImplementationsBright Cluster Manager supports the Hadoop implementations providedby the following organizations

1 Apache (httpapacheorg) This is the upstream source forthe Hadoop core and some related components which all the otherimplementations use

2 Cloudera (httpwwwclouderacom) Cloudera providessome extra premium functionality and components on top of aHadoop suite One of the extra components that Cloudera providesis the Cloudera Management Suite a major proprietary manage-ment layer with some premium features

3 Hortonworks (httphortonworkscom) Hortonworks DataPlatform (HDP) is a fully open source Hadoop suite

4 Pivotal HD Pivotal Hadoop Distribution is a completely Apache-compliant distribution with extensive analytic toolsets

The ISO image for Bright Cluster Manager available at httpwwwbrightcomputingcomDownload can include Hadoop for all 4 im-plementations During installation from the ISO the administrator canchoose which implementation to install (section 3314 of the InstallationManual)

The contents and versions of the Hadoop implementations providedby Bright Computing at the time of writing (April 2015) are as follows

bull Apache packaged as cm-apache-hadoop by Bright ComputingThis provides

ndash Hadoop versions 121 220 and 260

ndash HBase version 09811 for Hadoop 121

ndash HBase version 100 for Hadoop 220 and Hadoop 260

ndash ZooKeeper version 346

bull Cloudera packaged as cm-cloudera-hadoop by Bright Comput-ing This provides

ndash CDH 471 (based on Apache Hadoop 20 HBase 09415 andZooKeeper 345)

ndash CDH 532 (based on Apache Hadoop 250 HBase 0986 andZooKeeper 345)

copy Bright Computing Inc

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 6: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

iv Table of Contents

The manuals constantly evolve to keep up with the development ofthe Bright Cluster Manager environment and the addition of new hard-ware andor applications The manuals also regularly incorporate cus-tomer feedback Administrator and user input is greatly valued at BrightComputing So any comments suggestions or corrections will be verygratefully accepted at manualsbrightcomputingcom

03 Getting Administrator-Level SupportUnless the Bright Cluster Manager reseller offers support sup-port is provided by Bright Computing over e-mail via supportbrightcomputingcom Section 102 of the Administrator Manual hasmore details on working with support

1Introduction

11 What Is Hadoop AboutHadoop is the core implementation of a technology mostly used in theanalysis of data sets that are large and unstructured in comparison withthe data stored in regular relational databases The analysis of such datacalled data-mining can be done better with Hadoop than with relationaldatabases for certain types of parallelizable problems Hadoop has thefollowing characteristics in comparison with relational databases

1 Less structured input Key value pairs are used as records for thedata sets instead of a database

2 Scale-out rather than scale-up design For large data sets if thesize of a parallelizable problem increases linearly the correspond-ing cost of scaling up a single machine to solve it tends to grow ex-ponentially simply because the hardware requirements tend to getexponentially expensive If however the system that solves it is acluster then the corresponding cost tends to grow linearly becauseit can be solved by scaling out the cluster with a linear increase inthe number of processing nodes

Scaling out can be done with some effort for database problemsusing a parallel relational database implementation However scale-out is inherent in Hadoop and therefore often easier to implementwith Hadoop The Hadoop scale-out approach is based on the fol-lowing design

bull Clustered storage Instead of a single node with a speciallarge storage device a distributed filesystem (HDFS) usingcommodity hardware devices across many nodes stores thedata

bull Clustered processing Instead of using a single node with manyprocessors the parallel processing needs of the problem aredistributed out over many nodes The procedure is called theMapReduce algorithm and is based on the following approach

ndash The distribution process ldquomapsrdquo the initial state of the prob-lem into processes out to the nodes ready to be handled inparallel

ndash Processing tasks are carried out on the data at nodes them-selves

copy Bright Computing Inc

2 Introduction

ndash The results are ldquoreducedrdquo back to one result

3 Automated failure handling at application level for data Repli-cation of the data takes place across the DataNodes which are thenodes holding the data If a DataNode has failed then anothernode which has the replicated data on it is used instead automat-ically Hadoop switches over quickly in comparison to replicateddatabase clusters due to not having to check database table consis-tency

12 Available Hadoop ImplementationsBright Cluster Manager supports the Hadoop implementations providedby the following organizations

1 Apache (httpapacheorg) This is the upstream source forthe Hadoop core and some related components which all the otherimplementations use

2 Cloudera (httpwwwclouderacom) Cloudera providessome extra premium functionality and components on top of aHadoop suite One of the extra components that Cloudera providesis the Cloudera Management Suite a major proprietary manage-ment layer with some premium features

3 Hortonworks (httphortonworkscom) Hortonworks DataPlatform (HDP) is a fully open source Hadoop suite

4 Pivotal HD Pivotal Hadoop Distribution is a completely Apache-compliant distribution with extensive analytic toolsets

The ISO image for Bright Cluster Manager available at httpwwwbrightcomputingcomDownload can include Hadoop for all 4 im-plementations During installation from the ISO the administrator canchoose which implementation to install (section 3314 of the InstallationManual)

The contents and versions of the Hadoop implementations providedby Bright Computing at the time of writing (April 2015) are as follows

bull Apache packaged as cm-apache-hadoop by Bright ComputingThis provides

ndash Hadoop versions 121 220 and 260

ndash HBase version 09811 for Hadoop 121

ndash HBase version 100 for Hadoop 220 and Hadoop 260

ndash ZooKeeper version 346

bull Cloudera packaged as cm-cloudera-hadoop by Bright Comput-ing This provides

ndash CDH 471 (based on Apache Hadoop 20 HBase 09415 andZooKeeper 345)

ndash CDH 532 (based on Apache Hadoop 250 HBase 0986 andZooKeeper 345)

copy Bright Computing Inc

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 7: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

1Introduction

11 What Is Hadoop AboutHadoop is the core implementation of a technology mostly used in theanalysis of data sets that are large and unstructured in comparison withthe data stored in regular relational databases The analysis of such datacalled data-mining can be done better with Hadoop than with relationaldatabases for certain types of parallelizable problems Hadoop has thefollowing characteristics in comparison with relational databases

1 Less structured input Key value pairs are used as records for thedata sets instead of a database

2 Scale-out rather than scale-up design For large data sets if thesize of a parallelizable problem increases linearly the correspond-ing cost of scaling up a single machine to solve it tends to grow ex-ponentially simply because the hardware requirements tend to getexponentially expensive If however the system that solves it is acluster then the corresponding cost tends to grow linearly becauseit can be solved by scaling out the cluster with a linear increase inthe number of processing nodes

Scaling out can be done with some effort for database problemsusing a parallel relational database implementation However scale-out is inherent in Hadoop and therefore often easier to implementwith Hadoop The Hadoop scale-out approach is based on the fol-lowing design

bull Clustered storage Instead of a single node with a speciallarge storage device a distributed filesystem (HDFS) usingcommodity hardware devices across many nodes stores thedata

bull Clustered processing Instead of using a single node with manyprocessors the parallel processing needs of the problem aredistributed out over many nodes The procedure is called theMapReduce algorithm and is based on the following approach

ndash The distribution process ldquomapsrdquo the initial state of the prob-lem into processes out to the nodes ready to be handled inparallel

ndash Processing tasks are carried out on the data at nodes them-selves

copy Bright Computing Inc

2 Introduction

ndash The results are ldquoreducedrdquo back to one result

3 Automated failure handling at application level for data Repli-cation of the data takes place across the DataNodes which are thenodes holding the data If a DataNode has failed then anothernode which has the replicated data on it is used instead automat-ically Hadoop switches over quickly in comparison to replicateddatabase clusters due to not having to check database table consis-tency

12 Available Hadoop ImplementationsBright Cluster Manager supports the Hadoop implementations providedby the following organizations

1 Apache (httpapacheorg) This is the upstream source forthe Hadoop core and some related components which all the otherimplementations use

2 Cloudera (httpwwwclouderacom) Cloudera providessome extra premium functionality and components on top of aHadoop suite One of the extra components that Cloudera providesis the Cloudera Management Suite a major proprietary manage-ment layer with some premium features

3 Hortonworks (httphortonworkscom) Hortonworks DataPlatform (HDP) is a fully open source Hadoop suite

4 Pivotal HD Pivotal Hadoop Distribution is a completely Apache-compliant distribution with extensive analytic toolsets

The ISO image for Bright Cluster Manager available at httpwwwbrightcomputingcomDownload can include Hadoop for all 4 im-plementations During installation from the ISO the administrator canchoose which implementation to install (section 3314 of the InstallationManual)

The contents and versions of the Hadoop implementations providedby Bright Computing at the time of writing (April 2015) are as follows

bull Apache packaged as cm-apache-hadoop by Bright ComputingThis provides

ndash Hadoop versions 121 220 and 260

ndash HBase version 09811 for Hadoop 121

ndash HBase version 100 for Hadoop 220 and Hadoop 260

ndash ZooKeeper version 346

bull Cloudera packaged as cm-cloudera-hadoop by Bright Comput-ing This provides

ndash CDH 471 (based on Apache Hadoop 20 HBase 09415 andZooKeeper 345)

ndash CDH 532 (based on Apache Hadoop 250 HBase 0986 andZooKeeper 345)

copy Bright Computing Inc

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 8: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

2 Introduction

ndash The results are ldquoreducedrdquo back to one result

3 Automated failure handling at application level for data Repli-cation of the data takes place across the DataNodes which are thenodes holding the data If a DataNode has failed then anothernode which has the replicated data on it is used instead automat-ically Hadoop switches over quickly in comparison to replicateddatabase clusters due to not having to check database table consis-tency

12 Available Hadoop ImplementationsBright Cluster Manager supports the Hadoop implementations providedby the following organizations

1 Apache (httpapacheorg) This is the upstream source forthe Hadoop core and some related components which all the otherimplementations use

2 Cloudera (httpwwwclouderacom) Cloudera providessome extra premium functionality and components on top of aHadoop suite One of the extra components that Cloudera providesis the Cloudera Management Suite a major proprietary manage-ment layer with some premium features

3 Hortonworks (httphortonworkscom) Hortonworks DataPlatform (HDP) is a fully open source Hadoop suite

4 Pivotal HD Pivotal Hadoop Distribution is a completely Apache-compliant distribution with extensive analytic toolsets

The ISO image for Bright Cluster Manager available at httpwwwbrightcomputingcomDownload can include Hadoop for all 4 im-plementations During installation from the ISO the administrator canchoose which implementation to install (section 3314 of the InstallationManual)

The contents and versions of the Hadoop implementations providedby Bright Computing at the time of writing (April 2015) are as follows

bull Apache packaged as cm-apache-hadoop by Bright ComputingThis provides

ndash Hadoop versions 121 220 and 260

ndash HBase version 09811 for Hadoop 121

ndash HBase version 100 for Hadoop 220 and Hadoop 260

ndash ZooKeeper version 346

bull Cloudera packaged as cm-cloudera-hadoop by Bright Comput-ing This provides

ndash CDH 471 (based on Apache Hadoop 20 HBase 09415 andZooKeeper 345)

ndash CDH 532 (based on Apache Hadoop 250 HBase 0986 andZooKeeper 345)

copy Bright Computing Inc

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 9: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

13 Further Documentation 3

bull Hortonworks packaged as cm-hortonworks-hadoop by BrightComputing This provides

ndash HDP 139 (based on Apache Hadoop 120 HBase 0946 andZooKeeper 345)

ndash HDP 220 (based on Apache Hadoop 260 HBase 0984 andZooKeeper 346)

PivotalPivotal HD version 210 is based on Apache Hadoop 220 It runs

fully integrated on Bright Cluster Manager but is not available as a pack-age in the Bright Cluster Manager repositories This is because Pivotalprefers to distribute the software itself The user can download the soft-ware from httpsnetworkpivotalioproductspivotal-hdafter registering with Pivotal

13 Further DocumentationFurther documentation is provided in the installed tarballs of the Hadoopversion after the Bright Cluster Manager installation (Chapter 2) has beencarried out A starting point is listed in the table below

Hadoop version Path under cmsharedappshadoop

Apache 121 Apache121docsindexhtml

Apache 220Apache220sharedochadoopindexhtml

Apache 260 Apache260sharedochadoopindexhtml

Cloudera CDH 471Cloudera200-cdh471sharedocindexhtml

Cloudera CDH 532Cloudera250-cdh532sharedocindexhtml

Hortonworks HDPOnline documentation is available at httpdocshortonworkscom

copy Bright Computing Inc

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 10: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

2Installing Hadoop

In Bright Cluster Manager a Hadoop instance can be configured andrun either via the command-line (section 21) or via an ncurses GUI (sec-tion 22) Both options can be carried out with the cm-hadoop-setupscript which is run from a head node The script is a part of thecluster-tools package and uses tarballs from the Apache Hadoopproject The Bright Cluster Manager With Hadoop installation ISO in-cludes the cm-apache-hadoop package which contains tarballs fromthe Apache Hadoop project suitable for cm-hadoop-setup

21 Command-line Installation Of Hadoop Usingcm-hadoop-setup -c ltfilenamegt

211 Usage[rootbright70 ~] cm-hadoop-setup -h

USAGE cmlocalappscluster-toolsbincm-hadoop-setup [-c ltfilenamegt

| -u ltnamegt | -h]

OPTIONS

-c ltfilenamegt -- Hadoop config file to use

-u ltnamegt -- uninstall Hadoop instance

-h -- show usage

EXAMPLES

cm-hadoop-setup -c tmpconfigxml

cm-hadoop-setup -u foo

cm-hadoop-setup (no options a gui will be started)

Some sample configuration files are provided in the directory

cmlocalappscluster-toolshadoopconf

hadoop1confxml (for Hadoop 1x)

hadoop2confxml (for Hadoop 2x)

hadoop2fedconfxml (for Hadoop 2x with NameNode federation)

hadoop2haconfxml (for Hadoop 2x with High Availability)

hadoop2lustreconfxml (for Hadoop 2x with Lustre support)

copy Bright Computing Inc

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 11: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

6 Installing Hadoop

212 An Install RunAn XML template can be used based on the examples in the directorycmlocalappscluster-toolshadoopconf

In the XML template the path for a tarball component is enclosed byltarchivegt ltarchivegt tag pairs The tarball components can be asindicated

bull ltarchivegthadoop tarballltarchivegt

bull ltarchivegthbase tarballltarchivegt

bull ltarchivegtzookeeper tarballltarchivegt

The tarball components can be picked up from URLs as listed in sec-tion 12 The paths of the tarball component files that are to be usedshould be set up as needed before running cm-hadoop-setup

The downloaded tarball components should be placed in the tmpdirectory if the default definitions in the default XML files are used

Example

[rootbright70 ~] cd cmlocalappscluster-toolshadoopconf

[rootbright70 conf] grep rsquoarchivegtrsquo hadoop1confxml | grep -o gz

tmphadoop-121targz

tmpzookeeper-346targz

tmphbase-09811-hadoop1-bintargz

Files under tmp are not intended to stay around permanently Theadministrator may therefore wish to place the tarball components in amore permanent part of the filesystem instead and change the XML def-initions accordingly

A Hadoop instance name for example Myhadoop can also be definedin the XML file within the ltnamegtltnamegt tag pair

Hadoop NameNodes and SecondaryNameNodes handle HDFS meta-data while DataNodes manage HDFS data The data must be stored inthe filesystem of the nodes The default path for where the data is storedcan be specified within the ltdatarootgtltdatarootgt tag pair Mul-tiple paths can also be set using comma-separated paths NameNodesSecondaryNameNodes and DataNodes each use the value or values setwithin the ltdatarootgtltdatarootgt tag pair for their root directories

If needed more specific tags can be used for each node type This isuseful in the case where hardware differs for the various node types Forexample

bull a NameNode with 2 disk drives for Hadoop use

bull a DataNode with 4 disk drives for Hadoop use

The XML file used by cm-hadoop-setup can in this case use the tagpairs

bull ltnamenodedatadirsgtltnamenodedatadirsgt

bull ltdatanodedatadirsgtltdatanodedatadirsgt

If these are not specified then the value within theltdatarootgtltdatarootgt tag pair is used

copy Bright Computing Inc

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 12: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt 7

Example

bull ltnamenodedatadirsgtdata1data2ltnamenodedatadirsgt

bull ltdatanodedatadirsgtdata1data2data3data4ltdatanodedatadirsgt

Hadoop should then have the following dfsnamedir proper-ties added to it via the hdfs-sitexml configuration file For the pre-ceding tag pairs the property values should be set as follows

Example

bull dfsnamenodenamedir with valuesdata1hadoophdfsnamenodedata2hadoophdfsnamenode

bull dfsdatanodenamedir with valuesdata1hadoophdfsdatanodedata2hadoophdfsdatanodedata3hadoophdfsdatanodedata4hadoophdfsdatanode

An install run then displays output like the following

Example

-rw-r--r-- 1 root root 63851630 Feb 4 1513 hadoop-121targz

[rootbright70 ~] cm-hadoop-setup -c tmphadoop1confxml

Reading config from file rsquotmphadoop1confxmlrsquo done

Hadoop flavor rsquoApachersquo release rsquo121rsquo

Will now install Hadoop in cmsharedappshadoopApache121 and configure instance rsquoMyhadooprsquo

Hadoop distro being installed done

Zookeeper being installed done

HBase being installed done

Creating module file done

Configuring Hadoop instance on local filesystem and images done

Updating images

starting imageupdate for node rsquonode003rsquo started

starting imageupdate for node rsquonode002rsquo started

starting imageupdate for node rsquonode001rsquo started

starting imageupdate for node rsquonode004rsquo started

Waiting for imageupdate to finish done

Creating Hadoop instance in cmdaemon done

Formatting HDFS done

Waiting for datanodes to come up done

Setting up HDFS done

The Hadoop instance should now be running The name defined forit in the XML file should show up within cmgui in the Hadoop HDFSresource tree folder (figure 21)

copy Bright Computing Inc

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 13: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

8 Installing Hadoop

Figure 21 A Hadoop Instance In cmgui

The instance name is also displayed within cmshwhen the list com-mand is run in hadoop mode

Example

[rootbright70 ~] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Myhadoop 121 Apache etchadoopMyhadoop

The instance can be removed as follows

Example

[rootbright70 ~] cm-hadoop-setup -u Myhadoop

Requested uninstall of Hadoop instance rsquoMyhadooprsquo

Uninstalling Hadoop instance rsquoMyhadooprsquo

Removing

etchadoopMyhadoop

varlibhadoopMyhadoop

varloghadoopMyhadoop

varrunhadoopMyhadoop

tmphadoopMyhadoop

etchadoopMyhadoopzookeeper

varlibzookeeperMyhadoop

varlogzookeeperMyhadoop

varrunzookeeperMyhadoop

etchadoopMyhadoophbase

varloghbaseMyhadoop

varrunhbaseMyhadoop

etcinitdhadoop-Myhadoop-Module file(s) deleted

Uninstall successfully completed

22 Ncurses Installation Of Hadoop Usingcm-hadoop-setup

Running cm-hadoop-setup without any options starts up an ncursesGUI (figure 22)

copy Bright Computing Inc

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 14: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

23 Avoiding Misconfigurations During Hadoop Installation 9

Figure 22 The cm-hadoop-setup Welcome Screen

This provides an interactive way to add and remove Hadoop instancesalong with HBase and Zookeeper components Some explanations of theitems being configured are given along the way In addition some minorvalidation checks are done and some options are restricted

The suggested default values will work Other values can be choseninstead of the defaults but some care in selection usually a good ideaThis is because Hadoop is a complex software which means that valuesother than the defaults can sometimes lead to unworkable configurations(section 23)

The ncurses installation results in an XML configuration file This filecan be used with the -c option of cm-hadoop-setup to carry out theinstallation

23 Avoiding Misconfigurations During HadoopInstallation

A misconfiguration can be defined as a configuration that works badly ornot at all

For Hadoop to work well the following common misconfigurationsshould normally be avoided Some of these result in warnings from BrightCluster Manager validation checks during configuration but can be over-ridden An override is useful for cases where the administrator wouldjust like to for example test some issues not related to scale or perfor-mance

231 NameNode Configuration ChoicesOne of the following NameNode configuration options must be chosenwhen Hadoop is installed The choice should be made with care becausechanging between the options after installation is not possible

Hadoop 1x NameNode Configuration Choicesbull NameNode can optionally have a SecondaryNameNode

SecondaryNameNode offloads metadata operations from Name-Node and also stores the metadata offline to some extent

It is not by any means a high availability solution While recovery

copy Bright Computing Inc

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 15: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

10 Installing Hadoop

from a failed head node is possible from SecondaryNameNode it isnot easy and it is not recommended or supported by Bright ClusterManager

Hadoop 2x NameNode Configuration Choicesbull NameNode and SecondaryNameNode can run as in Hadoop 1x

However the following configurations are also possible

bull NameNode HA with manual failover In this configuration Hadoophas NameNode1 and NameNode2 up at the same time with one ac-tive and one on standby Which NameNode is active and which ison standby is set manually by the administrator If one NameNodefails then failover must be executed manually Metadata changesare managed by ZooKeeper which relies on a quorum of Journal-Nodes The number of JournalNodes is therefore set to 3 5 7

bull NameNode HA with automatic failover As for the manual caseexcept that in this case ZooKeeper manages failover too WhichNameNode is active and which is on standby is therefore decidedautomatically

bull NameNode Federation In NameNode Fedaration the storage ofmetadata is split among several NameNodes each of which has acorresponding SecondaryNameNode Each pair takes care of a partof HDFS

In Bright Cluster Manager there are 4 NameNodes in a default Na-meNode federation

ndash user

ndash tmp

ndash staging

ndash hbase

User applications do not have to know this mapping This isbecause ViewFS on the client side maps the selected path tothe corresponding NameNode Thus for example hdfs -lstmpexample does not need to know that tmp is managed byanother NameNode

Cloudera advise against using NameNode Federation for produc-tion purposes at present due to its development status

24 Installing Hadoop With LustreThe Lustre filesystem has a client-server configuration

241 Lustre Internal Server InstallationThe procedure for installing a Lustre server varies

The Bright Cluster Manager Knowledge Base article describes aprocedure that uses Lustre sources from Whamcloud The article isat httpkbbrightcomputingcomfaqindexphpaction=artikelampcat=9ampid=176 and may be used as guidance

copy Bright Computing Inc

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 16: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

24 Installing Hadoop With Lustre 11

242 Lustre External Server InstallationLustre can also be configured so that the servers run external to BrightCluster Manager The Lustre Intel IEEL 2x version can be configured inthis manner

243 Lustre Client InstallationIt is preferred that the Lustre clients are installed on the head node aswell as on all the nodes that are to be Hadoop nodes The clients shouldbe configured to provide a Lustre mount on the nodes If the Lustre clientcannot be installed on the head node then Bright Cluster Manager hasthe following limitations during installation and maintenance

bull the head node cannot be used to run Hadoop services

bull end users cannot perform Hadoop operations such as job submis-sion on the head node Operations such as those should instead becarried out while logged in to one of the Hadoop nodes

In the remainder of this section a Lustre mount point of mntlustreis assumed but it can be set to any convenient directory mount point

The user IDs and group IDs of the Lustre server and clients shouldbe consistent It is quite likely that they differ when first set up The IDsshould be checked at least for the following users and groups

bull users hdfs mapred yarn hbase zookeeper

bull groups hadoop zookeeper hbase

If they do not match on the server and clients then they must be madeconsistent manually so that the UID and GID of the Lustre server usersare changed to match the UID and GID of the Bright Cluster Managerusers

Once consistency has been checked and readwrite access is workingto LustreFS the Hadoop integration can be configured

244 Lustre Hadoop ConfigurationLustre Hadoop XML Configuration File SetupHadoop integration can be configured by using the file cmlocalappscluster-toolshadoopconfhadoop2lustreconfxmlas a starting point for the configuration It can be copied over to forexample roothadooplustreconfigxml

The Intel Distribution for Hadoop (IDH) and Cloudera can both runwith Lustre under Bright Cluster Manager The configuration for thesecan be done as follows

bull IDH

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

copy Bright Computing Inc

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 17: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

12 Installing Hadoop

ltafsgt

bull Cloudera

ndash A subdirectory of mntlustre must be specified in thehadoop2lustreconfxml file within the ltafsgtltafsgt tagpair

ndash In addition an ltfsjargtltfsjargt tag pair must be specifiedmanually for the jar that the Intel IEEL 2x distribution pro-vides

Example

ltafsgt

ltfstypegtlustreltfstypegt

ltfsrootgtmntlustrehadoopltfsrootgt

ltfsjargtrootlustrehadoop-lustre-plugin-230jarltfsjargt

ltafsgt

The installation of the Lustre plugin is automatic if this jarname is set to the right name when the cm-hadoop-setupscript is run

Lustre Hadoop Installation With cm-hadoop-setupThe XML configuration file specifies how Lustre should be integrated inHadoop If the configuration file is at ltroothadooplustreconfigxmlgt then it can be run as

Example

cm-hadoop-setup -c ltroothadooplustreconfigxmlgt

As part of configuring Hadoop to use Lustre the execution will

bull Set the ACLs on the directory specified within the ltfsrootgtltfsrootgttag pair This was set to mntlustrehadoop earlier on as an ex-ample

bull Copy the Lustre plugin from its jar path as specified in the XML fileto the correct place on the client nodes

Specifically the subdirectory sharehadoopcommonlib iscopied into a directory relative to the Hadoop installation direc-tory For example the Cloudera version of Hadoop version 230-cdh512 has the Hadoop installation directory cmshareappshadoopClouder230-cdh512 The copy is therefore car-ried out in this case fromrootlustrehadoop-lustre-plugin-230jartocmsharedappshadoopCloudera230-cdh512sharehadoopcommonlib

copy Bright Computing Inc

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 18: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

25 Hadoop Installation In A Cloud 13

Lustre Hadoop Integration In cmsh and cmgui

In cmsh Lustre integration is indicated in hadoop mode

Example

[hadoop2-gthadoop] show hdfs1 | grep -i lustre

Hadoop root for Lustre mntlustrehadoop

Use Lustre yes

In cmgui the Overview tab in the items for the Hadoop HDFS re-source indicates if Lustre is running along with its mount point (fig-ure 23)

Figure 23 A Hadoop Instance With Lustre In cmgui

25 Hadoop Installation In A CloudHadoop can make use of cloud services so that it runs as a Cluster-On-Demand configuration (Chapter 2 of the Cloudbursting Manual) or a Clus-ter Extension configuration (Chapter 3 of the Cloudbursting Manual) Inboth cases the cloud nodes used should be at least m1medium

bull For Cluster-On-Demand the following considerations apply

ndash There are no specific issues After a stopstart cycle Hadooprecognizes the new IP addresses and refreshes the list of nodesaccordingly (section 241 of the Cloudbursting Manual)

bull For Cluster Extension the following considerations apply

ndash To install Hadoop on cloud nodes the XML configurationcmlocalappscluster-toolshadoopconfhadoop2clusterextensionconfxmlcan be used as a guide

copy Bright Computing Inc

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 19: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

14 Installing Hadoop

ndash In the hadoop2clusterextensionconfxml file the clouddirector that is to be used with the Hadoop cloud nodes mustbe specified by the administrator with the ltedgegtltedgegttag pair

Example

ltedgegt

lthostsgteu-west-1-directorlthostsgt

ltedgegt

Maintenance operations such as a format will automaticallyand transparently be carried out by cmdaemon running on thecloud director and not on the head node

There are some shortcomings as a result of relying upon the clouddirector

ndash Cloud nodes depend on the same cloud director

ndash While Hadoop installation (cm-hadoop-setup) is run on thehead node users must run Hadoop commandsmdashjob submis-sions and so onmdashfrom the director not from the head node

ndash It is not possible to mix cloud and non-cloud nodes for thesame Hadoop instance That is a local Hadoop instance can-not be extended by adding cloud nodes

copy Bright Computing Inc

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 20: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

3Viewing And Managing HDFS

From CMDaemon31 cmgui And The HDFS ResourceIn cmgui clicking on the resource folder for Hadoop in the resource treeopens up an Overview tab that displays a list of Hadoop HDFS instancesas in figure 21

Clicking on an instance brings up a tab from a series of tabs associatedwith that instance as in figure 31

Figure 31 Tabs For A Hadoop Instance In cmgui

The possible tabs are Overview (section 311) Settings (section 312)Tasks (section 313) and Notes (section 314) Various sub-panes aredisplayed in the possible tabs

copy Bright Computing Inc

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 21: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

16 Viewing And Managing HDFS From CMDaemon

311 The HDFS Instance Overview TabThe Overview tab pane (figure 32) arranges Hadoop-related items insub-panes for convenient viewing

Figure 32 Overview Tab View For A Hadoop Instance In cmgui

The following items can be viewed

bull Statistics In the first sub-pane row are statistics associated with theHadoop instance These include node data on the number of appli-cations that are running as well as memory and capacity usage

bull Metrics In the second sub-pane row a metric can be selected fordisplay as a graph

bull Roles In the third sub-pane row the roles associated with a nodeare displayed

312 The HDFS Instance Settings TabThe Settings tab pane (figure 33) displays the configuration of theHadoop instance

copy Bright Computing Inc

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 22: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

31 cmgui And The HDFS Resource 17

Figure 33 Settings Tab View For A Hadoop Instance In cmgui

It allows the following Hadoop-related settings to be changed

bull WebHDFS Allows HDFS to be modified via the web

bull Description An arbitrary user-defined description

bull Directory settings for the Hadoop instance Directories for config-uration temporary purposes and logging

bull Topology awareness setting Optimizes Hadoop for racks switchesor nothing when it comes to network topology

bull Replication factors Set the default and maximum replication al-lowed

bull Balancing settings Spread loads evenly across the instance

313 The HDFS Instance Tasks TabThe Tasks tab (figure 34) displays Hadoop-related tasks that the admin-istrator can execute

copy Bright Computing Inc

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 23: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

18 Viewing And Managing HDFS From CMDaemon

Figure 34 Tasks Tab View For A Hadoop Instance In cmgui

The Tasks tab allows the administrator to

bull Restart start or stop all the Hadoop daemons

bull StartStop the Hadoop balancer

bull Format the Hadoop filesystem A warning is given before format-ting

bull Carry out a manual failover for the NameNode This feature meansmaking the other NameNode the active one and is available onlyfor Hadoop instances that support NameNode failover

314 The HDFS Instance Notes TabThe Notes tab is like a jotting pad and can be used by the administratorto write notes for the associated Hadoop instance

32 cmsh And hadoop ModeThe cmsh front end can be used instead of cmgui to display informationon Hadoop-related values and to carry out Hadoop-related tasks

321 The show And overview CommandsThe show CommandWithin hadoop mode the show command displays parameters that cor-respond mostly to cmguirsquos Settings tab in the Hadoop resource (sec-tion 312)

Example

[rootbright70 conf] cmsh

[bright70] hadoop

[bright70-gthadoop] list

Name (key) Hadoop version Hadoop distribution Configuration directory

---------- -------------- ------------------- -----------------------

Apache121 121 Apache etchadoopApache121

[bright70-gthadoop] show apache121

Parameter Value

copy Bright Computing Inc

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 24: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

32 cmsh And hadoop Mode 19

------------------------------------- ------------------------------------

Automatic failover Disabled

Balancer Threshold 10

Balancer period 2

Balancing policy dataNode

Cluster ID

Configuration directory etchadoopApache121

Configuration directory for HBase etchadoopApache121hbase

Configuration directory for ZooKeeper etchadoopApache121zookeeper

Creation time Fri 07 Feb 2014 170301 CET

Default replication factor 3

Enable HA for YARN no

Enable NFS gateway no

Enable Web HDFS no

HA enabled no

HA name service

HDFS Block size 67108864

HDFS Permission no

HDFS Umask 077

HDFS log directory varloghadoopApache121

Hadoop distribution Apache

Hadoop root

Hadoop version 121

Installation directory for HBase cmsharedappshadoopApachehbase-0+

Installation directory for ZooKeeper cmsharedappshadoopApachezookeepe+

Installation directory cmsharedappshadoopApache121

Maximum replication factor 512

Network

Revision

Root directory for data varlib

Temporary directory tmphadoopApache121

Topology Switch

Use HTTPS no

Use Lustre no

Use federation no

Use only HTTPS no

YARN automatic failover Hadoop

description installed from tmphadoop-121tar+

name Apache121

notes

The overview CommandSimilarly the overview command corresponds somewhat to thecmguirsquos Overview tab in the Hadoop resource (section 311) and pro-vides hadoop-related information on the system resources that are used

Example

[bright70-gthadoop] overview apache121

Parameter Value

----------------------------- ---------------------------

Name apache121

Capacity total 0B

Capacity used 0B

Capacity remaining 0B

Heap memory total 0B

copy Bright Computing Inc

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 25: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

20 Viewing And Managing HDFS From CMDaemon

Heap memory used 0B

Heap memory remaining 0B

Non-heap memory total 0B

Non-heap memory used 0B

Non-heap memory remaining 0B

Nodes available 0

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 0

Total blocks 0

Missing blocks 0

Under-replicated blocks 0

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 0

Applications running 0

Applications pending 0

Applications submitted 0

Applications completed 0

Applications failed 0

Federation setup no

Role Node

------------------------------ ---------------------------

DataNode ZooKeeper bright70node001node002

322 The Tasks CommandsWithin hadoopmode the following commands run tasks that correspondmostly to the Tasks tab (section 313) in the cmgui Hadoop resource

The services Commands For Hadoop ServicesHadoop services can be started stopped and restarted with

bull restartallservices

bull startallservices

bull stopallservices

Example

[bright70-gthadoop] restartallservices apache121

Will now stop all Hadoop services for instance rsquoapache121rsquo done

Will now start all Hadoop services for instance rsquoapache121rsquo done

[bright70-gthadoop]

The startbalancer And stopbalancer Commands For HadoopFor efficient access to HDFS file block level usage across nodes shouldbe reasonably balanced Hadoop can start and stop balancing across theinstance with the following commands

bull startbalancer

bull stopbalancer

The balancer policy threshold and period can be retrieved or set forthe instance using the parameters

copy Bright Computing Inc

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 26: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

32 cmsh And hadoop Mode 21

bull balancerperiod

bull balancerthreshold

bull balancingpolicy

Example

[bright70-gthadoop] get apache121 balancerperiod

2

[bright70-gthadoop] set apache121 balancerperiod 3

[bright70-gthadoop] commit

[bright70-gthadoop] startbalancer apache121

Code 0

Starting Hadoop balancer daemon (hadoop-apache121-balancer)starting balancer logging to varloghadoopapache121hdfshadoop-hdfs-balancer-bright70out

Time Stamp Iteration Bytes Moved Bytes To Move Bytes Being Moved

The cluster is balanced Exiting

[bright70-gthadoop]

Thu Mar 20 152702 2014 [notice] bright70 Started balancer for apache121 For details type events details 152727

The formathdfs CommandUsage

formathdfs ltHDFSgtThe formathdfs command formats an instance so that it can be reused

Existing Hadoop services for the instance are stopped first before format-ting HDFS and started again after formatting is complete

Example

[bright70-gthadoop] formathdfs apache121

Will now format and set up HDFS for instance rsquoapache121rsquo

Stopping datanodes done

Stopping namenodes done

Formatting HDFS done

Starting namenode (host rsquobright70rsquo) done

Starting datanodes done

Waiting for datanodes to come up done

Setting up HDFS done

[bright70-gthadoop]

The manualfailover CommandUsage

manualfailover [-f|--from ltNameNodegt] [-t|--to ltother Na-meNodegt] ltHDFSgt

The manualfailover command allows the active status of a Na-meNode to be moved to another NameNode in the Hadoop instance Thisis only available for Hadoop instances that support NameNode failoverThe active status can be move from andor to a NameNode

copy Bright Computing Inc

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 27: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

4Hadoop Services

41 Hadoop Services From cmgui

411 Configuring Hadoop Services From cmguiHadoop services can be configured on head or regular nodes

Configuration in cmgui can be carried out by selecting the node in-stance on which the service is to run and selecting the Hadoop tab ofthat node This then displays a list of possible Hadoop services withinsub-panes (figure 41)

Figure 41 Services In The Hadoop Tab Of A Node In cmgui

The services displayed are

bull Name Node Secondary Name Node

bull Data Node Zookeeper

bull Job Tracker Task Tracker

bull YARN Server YARN Client

bull HBase Server HBase Client

bull Journal

For each Hadoop service instance values can be edited via actions inthe associated pane

copy Bright Computing Inc

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 28: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

24 Hadoop Services

bull Selecting A Hadoop instance The name of the possible Hadoopinstances can be selected via a drop-down menu

bull Configuring multiple Hadoop instances More than one Hadoopinstance can run on the cluster Using the copy+ button adds an in-stance

bull Advanced options for instances The Advanced button allowsmany configuration options to be edited for each node instance foreach service This is illustrated by the example in figure 42 wherethe advanced configuration options of the Data Node service of thenode001 instance are shown

Figure 42 Advanced Configuration Options For The Data Node serviceIn cmgui

Typically care must be taken to use non-conflicting ports and direc-tories for each Hadoop instance

412 Monitoring And Modifying The Running Of HadoopServices From cmgui

The status of Hadoop services can be viewed and changed in the follow-ing locations

bull The Services tab of the node instance Within the Nodes or HeadNodes resource for a head or regular node instance the Servicestab can be selected The Hadoop services can then be viewed andaltered from within the display pane (figure 43)

Figure 43 Services In The Services Tab Of A Node In cmgui

copy Bright Computing Inc

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 29: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

41 Hadoop Services From cmgui 25

The options buttons act just like for any other service (section 311of the Administrator Manual) which means it includes the possibilityof acting on multiple service selections

bull The Tasks tab of the Hadoop instance Within the HadoopHDFS resource for a specific Hadoop instance the Tasks tab (sec-tion 313) conveniently allows all Hadoop daemons to be (re)startedand stopped directly via buttons

copy Bright Computing Inc

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 30: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

5Running Hadoop Jobs

51 Shakedown RunsThe cm-hadoop-testssh script is provided as part of Bright Clus-ter Managerrsquos cluster-tools package The administrator can use thescript to conveniently submit example jar files in the Hadoop installationto a job client of a Hadoop instance

[rootbright70 ~] cd cmlocalappscluster-toolshadoop

[rootbright70 hadoop] cm-hadoop-testssh ltinstancegt

The script runs endlessly and runs several Hadoop test scripts Ifmost lines in the run output are elided for brevity then the structure ofthe truncated output looks something like this in overview

Example

[rootbright70 hadoop] cm-hadoop-testssh apache220

================================================================

Press [CTRL+C] to stop

================================================================

================================================================

start cleaning directories

================================================================

================================================================

clean directories done

================================================================

================================================================

start doing gen_test

================================================================

140324 150537 INFO terasortTeraSort Generating 10000 using 2

140324 150538 INFO mapreduceJobSubmitter number of splits2

Job Counters

Map-Reduce Framework

copy Bright Computing Inc

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 31: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

28 Running Hadoop Jobs

orgapachehadoopexamplesterasortTeraGen$Counters

140324 150703 INFO terasortTeraSort starting

140324 150912 INFO terasortTeraSort done

================================================================================

gen_test done

================================================================================

================================================================================

start doing PI test

================================================================================

Working Directory = userrootbbp

During the run the Overview tab in cmgui (introduced in section 311)for the Hadoop instance should show activity as it refreshes its overviewevery three minutes (figure 51)

Figure 51 Overview Of Activity Seen For A Hadoop Instance In cmgui

In cmsh the overview command shows the most recent values thatcan be retrieved when the command is run

[mk-hadoop-centos6-gthadoop] overview apache220

Parameter Value

------------------------------ ---------------------------------

Name Apache220

Capacity total 2756GB

Capacity used 7246MB

Capacity remaining 1641GB

Heap memory total 2807MB

Heap memory used 1521MB

Heap memory remaining 1287MB

Non-heap memory total 2581MB

Non-heap memory used 2519MB

Non-heap memory remaining 6155MB

Nodes available 3

Nodes dead 0

Nodes decommissioned 0

Nodes decommission in progress 0

Total files 72

Total blocks 31

Missing blocks 0

copy Bright Computing Inc

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 32: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

52 Example End User Job Run 29

Under-replicated blocks 2

Scheduled replication blocks 0

Pending replication blocks 0

Block report average Time 59666

Applications running 1

Applications pending 0

Applications submitted 7

Applications completed 6

Applications failed 0

High availability Yes (automatic failover disabled)

Federation setup no

Role Node

-------------------------------------------------------------- -------

DataNode Journal NameNode YARNClient YARNServer ZooKeeper node001

DataNode Journal NameNode YARNClient ZooKeeper node002

52 Example End User Job RunRunning a job from a jar file individually can be done by an end user

An end user fred can be created and issued a password by the ad-ministrator (Chapter 6 of the Administrator Manual) The user must thenbe granted HDFS access for the Hadoop instance by the administrator

Example

[bright70-gtuser[fred]] set hadoophdfsaccess apache220 commit

The possible instance options are shown as tab-completion suggestionsThe access can be unset by leaving a blank for the instance option

The user fred can then submit a run from a pi value estimator fromthe example jar file as follows (some output elided)

Example

[fredbright70 ~]$ module add hadoopApache220Apache220

[fredbright70 ~]$ hadoop jar $HADOOP_PREFIXsharehadoopmapreducehadoop-mapreduce-examples-220jar pi 1 5

Job Finished in 19732 seconds

Estimated value of Pi is 400000000000000000000

The module add line is not needed if the user has the module loadedby default (section 223 of the Administrator Manual)

The input takes the number of maps and number of samples as optionsmdash1 and 5 in the example The result can be improved with greater valuesfor both

copy Bright Computing Inc

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 33: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

6Hadoop-related Projects

Several projects use the Hadoop framework These projects may befocused on data warehousing data-flow programming or other data-processing tasks which Hadoop can handle well Bright Cluster Managerprovides tools to help install the following projects

bull Accumulo (section 61)

bull Hive (section 62)

bull Kafka (section 63)

bull Pig (section 64)

bull Spark (section 65)

bull Sqoop (section 66)

bull Storm (section 67)

61 AccumuloApache Accumulo is a highly-scalable structured distributed key-valuestore based on Googlersquos BigTable Accumulo works on top of Hadoop andZooKeeper Accumulo stores data in HDFS and uses a richer model thanregular key-value stores Keys in Accumulo consist of several elements

An Accumulo instance includes the following main components

bull Tablet Server which manages subsets of all tables

bull Garbage Collector to delete files no longer needed

bull Master responsible of coordination

bull Tracer collection traces about Accumulo operations

bull Monitor web application showing information about the instance

Also a part of the instance is a client library linked to Accumulo ap-plications

The Apache Accumulo tarball can be downloaded from httpaccumuloapacheorg For Hortonworks HDP 21x the Accumulotarball can be downloaded from the Hortonworks website (section 12)

copy Bright Computing Inc

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 34: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

32 Hadoop-related Projects

611 Accumulo Installation With cm-accumulo-setupBright Cluster Manager provides cm-accumulo-setup to carry out theinstallation of Accumulo

Prerequisites For Accumulo Installation And What Accumulo Installa-tion DoesThe following applies to using cm-accumulo-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull Hadoop can be configured with a single NameNode or NameNodeHA but not with NameNode federation

bull The cm-accumulo-setup script only installs Accumulo on the ac-tive head node and on the DataNodes of the chosen Hadoop in-stance

bull The script assigns no roles to nodes

bull Accumulo executables are copied by the script to a subdirectory un-der cmsharedhadoop

bull Accumulo configuration files are copied by the script to underetchadoop This is done both on the active headnode andon the necessary image(s)

bull By default Accumulo Tablet Servers are set to use 1GB of memoryA different value can be set via cm-accumulo-setup

bull The secret string for the instance is a random string created bycm-accumulo-setup

bull A password for the root user must be specified

bull The Tracer service will use Accumulo user root to connect to Ac-cumulo

bull The services for Garbage Collector Master Tracer and Monitor areby default installed and run on the headnode They can be installedand run on another node instead as shown in the next exampleusing the --master option

bull A Tablet Server will be started on each DataNode

bull cm-accumulo-setup tries to build the native map library

bull Validation tests are carried out by the script

The options for cm-accumulo-setup are listed on runningcm-accumulo-setup -h

An Example Run With cm-accumulo-setup

The option-p ltrootpassgt

is mandatory The specified password will also be used by the Tracerservice to connect to Accumulo The password will be stored inaccumulo-sitexml with read and write permissions assigned toroot only

copy Bright Computing Inc

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 35: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

61 Accumulo 33

The option-s ltheapsizegt

is not mandatory If not set a default value of 1GB is used

The option--master ltnodenamegt

is not mandatory It is used to set the node on which the Garbage Collec-tor Master Tracer and Monitor services run If not set then these servicesare run on the head node by default

Example

[rootbright70 ~] cm-accumulo-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p ltrootpassgt -s ltheapsizegt -t tmpaccumulo-162-bintargz --master node005

Accumulo release rsquo162rsquo

Accumulo GC Master Monitor and Tracer services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 260

Accumulo being installed done

Creating directories for Accumulo done

Creating module file for Accumulo done

Creating configuration files for Accumulo done

Updating images done

Setting up Accumulo directories in HDFS done

Executing rsquoaccumulo initrsquo done

Initializing services for Accumulo (on DataNodes) done

Initializing master services for Accumulo done

Waiting for NameNode to be ready done

Executing validation test done

Installation successfully completed

Finished

612 Accumulo Removal With cm-accumulo-setupcm-accumulo-setup should also be used to remove the Accumulo in-stance Data and metadata will not be removed

Example

[rootbright70 ~] cm-accumulo-setup -u hdfs1

Requested removal of Accumulo for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Accumulo directories done

Updating images done

Removal successfully completed

Finished

613 Accumulo MapReduce ExampleAccumulo jobs must be run using accumulo system user

Example

[rootbright70 ~] su - accumulo

bash-41$ module load accumulohdfs1

bash-41$ cd $ACCUMULO_HOME

bash-41$ bintoolsh libaccumulo-examples-simplejar

copy Bright Computing Inc

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 36: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

34 Hadoop-related Projects

orgapacheaccumuloexamplessimplemapreduceTeraSortIngest -i hdfs1 -z $ACCUMULO_ZOOKEEPERS -u root -p secret --count 10 --minKeySize 10 --maxKeySize 10 --minValueSize 78 --maxValueSize 78 --table sort --splits 10

62 HiveApache Hive is a data warehouse software It stores its data using HDFSand can query it via the SQL-like HiveQL language Metadata values forits tables and partitions are kept in the Hive Metastore which is an SQLdatabase typically MySQL or Postgres Data can be exposed to clientsusing the following client-server mechanisms

bull Metastore accessed with the hive client

bull HiveServer2 accessed with the beeline client

The Apache Hive tarball should be downloaded from one of the locationsspecified in Section 12 depending on the chosen distribution

621 Hive Installation With cm-hive-setupBright Cluster Manager provides cm-hive-setup to carry out Hive in-stallation

Prerequisites For Hive Installation And What Hive Installation DoesThe following applies to using cm-hive-setup

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checked Hiveworks with releases 5118 or earlier of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Hive setup works

ndash a suitable 5118 or earlier release of ConnectorJ isdownloaded from httpdevmysqlcomdownloadsconnectorj

ndash cm-hive-setup is run with the --conn option to specify theconnector version to use

Example

--conn tmpmysql-connector-java-5118-binjar

bull Before running the script the following statements must be exe-cuted explicitly by the administrator using a MySQL client

GRANT ALL PRIVILEGES ON ltmetastoredbgt TO rsquohiversquorsquorsquoIDENTIFIED BY rsquolthivepassgtrsquoFLUSH PRIVILEGES

DROP DATABASE IF EXISTS ltmetastoredbgt

In the preceding statements

ndash ltmetastoredbgt is the name of metastore database to be used byHive The same name is used later by cm-hive-setup

copy Bright Computing Inc

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 37: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

62 Hive 35

ndash lthivepassgt is the password for hive user The same passwordis used later by cm-hive-setup

ndash The DROP line is needed only if a database with that namealready exists

bull The cm-hive-setup script installs Hive by default on the activehead node

It can be installed on another node instead as shown in the nextexample with the use of the --master option In that case Con-nectorJ should be installed in the software image of the node

bull The script assigns no roles to nodes

bull Hive executables are copied by the script to a subdirectory undercmsharedhadoop

bull Hive configuration files are copied by the script to under etchadoop

bull The instance of MySQL on the head node is initialized as theMetastore database for the Bright Cluster Manager by the scriptA different MySQL server can be specified by using the options--mysqlserver and --mysqlport

bull The data warehouse is created by the script in HDFS in userhivewarehouse

bull The Metastore and HiveServer2 services are started up by the script

bull Validation tests are carried out by the script using hive andbeeline

An Example Run With cm-hive-setupExample

[rootbright70 ~] cm-hive-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -p lthivepassgt --metastoredb ltmetastoredbgt -t tmpapache-hive-110-bintargz --master node005

Hive release rsquo110-binrsquo

Using MySQL server on active headnode

Hive service will be run on node node005

Using MySQL ConnectorJ installed in usrsharejava

Hive being installed done

Creating directories for Hive done

Creating module file for Hive done

Creating configuration files for Hive done

Initializing database rsquometastore_hdfs1rsquo in MySQL done

Waiting for NameNode to be ready done

Creating HDFS directories for Hive done

Updating images done

Waiting for NameNode to be ready done

Hive setup validation

-- testing rsquohiversquo client

-- testing rsquobeelinersquo client

Hive setup validation done

Installation successfully completed

Finished

copy Bright Computing Inc

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 38: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

36 Hadoop-related Projects

622 Hive Removal With cm-hive-setupcm-hive-setup should also be used to remove the Hive instance Dataand metadata will not be removed

Example

[rootbright70 ~] cm-hive-setup -u hdfs1

Requested removal of Hive for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Hive directories done

Updating images done

Removal successfully completed

Finished

623 BeelineThe latest Hive releases include HiveServer2 which supports Beeline com-mand shell Beeline is a JDBC client based on the SQLLine CLI (httpsqllinesourceforgenet) In the following example Beelineconnects to HiveServer2

Example

[rootbright70 ~] beeline -u jdbchive2node005cmcluster10000 -d orgapachehivejdbcHiveDriver -e rsquoSHOW TABLESrsquo

Connecting to jdbchive2node005cmcluster10000

Connected to Apache Hive (version 110)

Driver Hive JDBC (version 110)

Transaction isolation TRANSACTION_REPEATABLE_READ

+-----------+--+

| tab_name |

+-----------+--+

| test |

| test2 |

+-----------+--+

2 rows selected (0243 seconds)

Beeline version 110 by Apache Hive

Closing 0 jdbchive2node005cmcluster10000

63 KafkaApache Kafka is a distributed publish-subscribe messaging system Amongother usages Kafka is used as a replacement for message broker forwebsite activity tracking for log aggregation The Apache Kafka tar-ball should be downloaded from httpkafkaapacheorg wheredifferent pre-built tarballs are available depeding on the preferred Scalaversion

631 Kafka Installation With cm-kafka-setupBright Cluster Manager provides cm-kafka-setup to carry out Kafkainstallation

Prerequisites For Kafka Installation And What Kafka Installation DoesThe following applies to using cm-kafka-setup

copy Bright Computing Inc

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 39: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

64 Pig 37

bull A Hadoop instance with ZooKeeper must already be installed

bull cm-kafka-setup installs Kafka only on the ZooKeeper nodes

bull The script assigns no roles to nodes

bull Kafka is copied by the script to a subdirectory under cmsharedhadoop

bull Kafka configuration files are copied by the script to under etchadoop

An Example Run With cm-kafka-setup

Example

[rootbright70 ~] cm-kafka-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpkafka_211-0821tgz

Kafka release rsquo0821rsquo for Scala rsquo211rsquo

Found Hadoop instance rsquohdfs1rsquo release 121

Kafka being installed done

Creating directories for Kafka done

Creating module file for Kafka done

Creating configuration files for Kafka done

Updating images done

Initializing services for Kafka (on ZooKeeper nodes) done

Executing validation test done

Installation successfully completed

Finished

632 Kafka Removal With cm-kafka-setupcm-kafka-setup should also be used to remove the Kafka instance

Example

[rootbright70 ~] cm-kafka-setup -u hdfs1

Requested removal of Kafka for Hadoop instance hdfs1

Stoppingremoving services done

Removing module file done

Removing additional Kafka directories done

Updating images done

Removal successfully completed

Finished

64 PigApache Pig is a platform for analyzing large data sets Pig consists of ahigh-level language for expressing data analysis programs coupled withinfrastructure for evaluating these programs Pig programs are intendedby language design to fit well with ldquoembarrassingly parallelrdquo problemsthat deal with large data sets The Apache Pig tarball should be down-loaded from one of the locations specified in Section 12 depending onthe chosen distribution

641 Pig Installation With cm-pig-setupBright Cluster Manager provides cm-pig-setup to carry out Pig instal-lation

copy Bright Computing Inc

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 40: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

38 Hadoop-related Projects

Prerequisites For Pig Installation And What Pig Installation DoesThe following applies to using cm-pig-setup

bull A Hadoop instance must already be installed

bull cm-pig-setup installs Pig only on the active head node

bull The script assigns no roles to nodes

bull Pig is copied by the script to a subdirectory under cmsharedhadoop

bull Pig configuration files are copied by the script to under etchadoop

An Example Run With cm-pig-setup

Example

[rootbright70 ~] cm-pig-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmppig-0140targz

Pig release rsquo0140rsquo

Pig being installed done

Creating directories for Pig done

Creating module file for Pig done

Creating configuration files for Pig done

Waiting for NameNode to be ready

Waiting for NameNode to be ready done

Validating Pig setup

Validating Pig setup done

Installation successfully completed

Finished

642 Pig Removal With cm-pig-setupcm-pig-setup should also be used to remove the Pig instance

Example

[rootbright70 ~] cm-pig-setup -u hdfs1

Requested removal of Pig for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Pig directories done

Updating images done

Removal successfully completed

Finished

643 Using PigPig consists of an executable pig that can be run after the user loadsthe corresponding module Pig runs by default in ldquoMapReduce Moderdquothat is it uses the corresponding HDFS installation to store and deal withthe elaborate processing of data More thorough documentation for Pigcan be found at httppigapacheorgdocsr0140starthtml

Pig can be used in interactive mode using the Grunt shell

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig

copy Bright Computing Inc

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 41: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

65 Spark 39

140826 115741 INFO pigExecTypeProvider Trying ExecType LOCAL

140826 115741 INFO pigExecTypeProvider Trying ExecType MAPREDUCE

140826 115741 INFO pigExecTypeProvider Picked MAPREDUCE as the ExecType

gruntgt

or in batch mode using a Pig Latin script

[rootbright70 ~] module load hadoophdfs1

[rootbright70 ~] module load pighdfs1

[rootbright70 ~] pig -v -f tmpsmokepig

In both cases Pig runs in ldquoMapReduce moderdquo thus working on thecorresponding HDFS instance

65 SparkApache Spark is an engine for processing Hadoop data It can carry outgeneral data processing similar to MapReduce but typically faster

Spark can also carry out the following with the associated high-leveltools

bull stream feed processing with Spark Streaming

bull SQL queries on structured distributed data with Spark SQL

bull processing with machine learning algorithms using MLlib

bull graph computation for arbitrarily-connected networks with graphX

The Apache Spark tarball should be downloaded from httpsparkapacheorg where different pre-built tarballs are available for Hadoop1x for CDH 4 and for Hadoop 2x

651 Spark Installation With cm-spark-setupBright Cluster Manager provides cm-spark-setup to carry out Sparkinstallation

Prerequisites For Spark Installation And What Spark Installation DoesThe following applies to using cm-spark-setup

bull A Hadoop instance must already be installed

bull Spark can be installed in two different deployment modes Stan-dalone or YARN

ndash Standalone mode This is the default for Apache Hadoop 1xCloudera CDH 4x and Hortonworks HDP 13x

It is possible to force the Standalone mode deployment byusing the additional option--standalone

When installing in standalone mode the script installs Sparkon the active head node and on the DataNodes of the cho-sen Hadoop instance

copy Bright Computing Inc

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 42: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

40 Hadoop-related Projects

The Spark Master service runs on the active head node bydefault but can be specified to run on another node byusing the option --masterSpark Worker services run on all DataNodes

ndash YARN mode This is the default for Apache Hadoop 2x Cloud-era CDH 5x Hortonworks 2x and Pivotal 2x The default canbe overridden by using the --standalone option

When installing in YARN mode the script installs Sparkonly on the active head node

bull The script assigns no roles to nodes

bull Spark is copied by the script to a subdirectory under cmsharedhadoop

bull Spark configuration files are copied by the script to under etchadoop

An Example Run With cm-spark-setup in YARN modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed in YARN (clientcluster) mode

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Waiting for NameNode to be ready done

Copying Spark assembly jar to HDFS done

Waiting for NameNode to be ready done

Validating Spark setup done

Installation successfully completed

Finished

An Example Run With cm-spark-setup in standalone modeExample

[rootbright70 ~] cm-spark-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpspark-130-bin-hadoop24tgz --standalone --master node005

Spark release rsquo130-bin-hadoop24rsquo

Found Hadoop instance rsquohdfs1rsquo release 260

Spark will be installed to work in Standalone mode

Spark Master service will be run on node node005

Spark will use all DataNodes as WorkerNodes

Spark being installed done

Creating directories for Spark done

Creating module file for Spark done

Creating configuration files for Spark done

Updating images done

Initializing Spark Master service done

Initializing Spark Worker service done

Validating Spark setup done

copy Bright Computing Inc

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 43: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

66 Sqoop 41

Installation successfully completed

Finished

652 Spark Removal With cm-spark-setupcm-spark-setup should also be used to remove the Spark instance

Example

[rootbright70 ~] cm-spark-setup -u hdfs1

Requested removal of Spark for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Spark directories done

Removal successfully completed

Finished

653 Using SparkSpark supports two deploy modes to launch Spark applications on YARNConsidering the SparkPi example provided with Spark

bull In ldquoyarn-clientrdquo mode the Spark driver runs in the client processand the SparkPi application is run as a child thread of ApplicationMaster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-client --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

bull In ldquoyarn-clusterrdquo mode the Spark driver runs inside an ApplicationMaster process which is managed by YARN on the cluster

[rootbright70 ~] module load sparkhdfs1

[rootbright70 ~] spark-submit --master yarn-cluster --class orgapachesparkexamplesSparkPi $SPARK_PREFIXlibspark-examples-jar

66 SqoopApache Sqoop is a tool designed to transfer bulk data between Hadoopand an RDBMS Sqoop uses MapReduce to import and export data BrightCluster Manager supports transfers between Sqoop and MySQL

RHEL7 and SLES12 use MariaDB and are not yet supported by theavailable versions of Sqoop at the time of writing (April 2015) At presentthe latest Sqoop stable release is 145 while the latest Sqoop2 versionis 1995 Sqoop2 is incompatible with Sqoop it is not feature-completeand it is not yet intended for production use The Bright Computing uti-tility cm-sqoop-setup does not as yet support Sqoop2

661 Sqoop Installation With cm-sqoop-setupBright Cluster Manager provides cm-sqoop-setup to carry out Sqoopinstallation

copy Bright Computing Inc

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 44: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

42 Hadoop-related Projects

Prerequisites For Sqoop Installation And What Sqoop Installation DoesThe following requirements and conditions apply to running thecm-sqoop-setup script

bull A Hadoop instance must already be installed

bull Before running the script the version of themysql-connector-java package should be checkedSqoop works with releases 5134 or later of this package Ifmysql-connector-java provides a newer release then thefollowing must be done to ensure that Sqoop setup works

ndash a suitable 5134 or later release of ConnectorJ is downloadedfrom httpdevmysqlcomdownloadsconnectorj

ndash cm-sqoop-setup is run with the --conn option in order tospecify the connector version to be used

Example

--conn tmpmysql-connector-java-5134-binjar

bull The cm-sqoop-setup script installs Sqoop only on the activehead node A different node can be specified by using the option--master

bull The script assigns no roles to nodes

bull Sqoop executables are copied by the script to a subdirectory undercmsharedhadoop

bull Sqoop configuration files are copied by the script and placed underetchadoop

bull The Metastore service is started up by the script

An Example Run With cm-sqoop-setupExample

[rootbright70 ~] cm-sqoop-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t tmpsqoop-145bin__hadoop-204-alphatargz--conn tmpmysql-connector-java-5134-binjar --master node005

Using MySQL ConnectorJ from tmpmysql-connector-java-5134-binjar

Sqoop release rsquo145bin__hadoop-204-alpharsquo

Sqoop service will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Sqoop being installed done

Creating directories for Sqoop done

Creating module file for Sqoop done

Creating configuration files for Sqoop done

Updating images done

Installation successfully completed

Finished

copy Bright Computing Inc

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 45: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

67 Storm 43

662 Sqoop Removal With cm-sqoop-setupcm-sqoop-setup should be used to remove the Sqoop instance

Example

[rootbright70 ~] cm-sqoop-setup -u hdfs1

Requested removal of Sqoop for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Sqoop directories done

Updating images done

Removal successfully completed

Finished

67 StormApache Storm is a distributed realtime computation system While Hadoopis focused on batch processing Storm can process streams of data

Other parallels between Hadoop and Storm

bull users run ldquojobsrdquo in Hadoop and ldquotopologiesrdquo in Storm

bull the master node for Hadoop jobs runs the ldquoJobTrackerrdquo or ldquoRe-sourceManagerrdquo daemons to deal with resource management andscheduling while the master node for Storm runs an analogous dae-mon called ldquoNimbusrdquo

bull each worker node for Hadoop runs daemons called ldquoTaskTrackerrdquoor ldquoNodeManagerrdquo while the worker nodes for Storm runs an anal-ogous daemon called ldquoSupervisorrdquo

bull both Hadoop in the case of NameNode HA and Storm leverageldquoZooKeeperrdquo for coordination

671 Storm Installation With cm-storm-setupBright Cluster Manager provides cm-storm-setup to carry out Storminstallation

Prerequisites For Storm Installation And What Storm Installation DoesThe following applies to using cm-storm-setup

bull A Hadoop instance with ZooKeeper must already be installed

bull The cm-storm-setup script only installs Storm on the active headnode and on the DataNodes of the chosen Hadoop instance by de-fault A node other than master can be specified by using the option--master or its alias for this setup script --nimbus

bull The script assigns no roles to nodes

bull Storm executables are copied by the script to a subdirectory undercmsharedhadoop

bull Storm configuration files are copied by the script to under etchadoop This is done both on the active headnode and on thenecessary image(s)

bull Validation tests are carried out by the script

copy Bright Computing Inc

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 46: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

44 Hadoop-related Projects

An Example Run With cm-storm-setup

Example

[rootbright70 ~] cm-storm-setup -i hdfs1 -j usrlibjvmjre-170-openjdkx86_64 -t apache-storm-094targz

--nimbus node005

Storm release rsquo094rsquo

Storm Nimbus and UI services will be run on node node005

Found Hadoop instance rsquohdfs1rsquo release 220

Storm being installed done

Creating directories for Storm done

Creating module file for Storm done

Creating configuration files for Storm done

Updating images done

Initializing worker services for Storm (on DataNodes) done

Initializing Nimbus services for Storm done

Executing validation test done

Installation successfully completed

Finished

The cm-storm-setup installation script submits a validation topol-ogy (topology in the Storm sense) called WordCount After a success-ful installation users can connect to the Storm UI on the host ltnim-busgt the Nimbus server at httpltnimbusgt10080 There theycan check the status of WordCount and can kill it

672 Storm Removal With cm-storm-setupThe cm-storm-setup script should also be used to remove the Storminstance

Example

[rootbright70 ~] cm-storm-setup -u hdfs1

Requested removal of Storm for Hadoop instance rsquohdfs1rsquo

Stoppingremoving services done

Removing module file done

Removing additional Storm directories done

Updating images done

Removal successfully completed

Finished

673 Using StormThe following example shows how to submit a topology and then verifythat it has been submitted successfully (some lines elided)

[rootbright70 ~] module load stormhdfs1

[rootbright70 ~] storm jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-jar stormstarterWordCountTopology WordCount2

470 [main] INFO backtypestormStormSubmitter - Jar not uploaded to m

copy Bright Computing Inc

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm
Page 47: Hadoop Deployment Manual - Bright Computingsupport.brightcomputing.com/manuals/7.0/hadoop-deployment-manu… · Preface Welcome to the Hadoop Deployment Manual for Bright Cluster

67 Storm 45

aster yet Submitting jar

476 [main] INFO backtypestormStormSubmitter - Uploading topology jar cmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jar to assigned location tmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

Start uploading file rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo (3248859 bytes)

[==================================================] 3248859 3248859

File rsquocmsharedappshadoopApacheapache-storm-093examplesstorm-starterstorm-starter-topologies-093jarrsquo uploaded to rsquotmpstorm-hdfs1-localnimbusinboxstormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jarrsquo(3248859 bytes)

508 [main] INFO backtypestormStormSubmitter - Successfully uploadedtopology jar to assigned location tmpstorm-hdfs1-localnimbusinbox

stormjar-bf1abdd0-f31a-41ff-b808-4daad1dfdaa3jar

508 [main] INFO backtypestormStormSubmitter - Submitting topology WordCount2 in distributed mode with conf topologyworkers3topologydebugtrue

687 [main] INFO backtypestormStormSubmitter - Finished submitting topology WordCount2

[roothadoopdev ~] storm list

Topology_name Status Num_tasks Num_workers Uptime_secs

-------------------------------------------------------------------

WordCount2 ACTIVE 28 3 15

copy Bright Computing Inc

  • Table of Contents
  • 01 About This Manual
  • 02 About The Manuals In General
  • 03 Getting Administrator-Level Support
  • 1 Introduction
    • 11 What Is Hadoop About
    • 12 Available Hadoop Implementations
    • 13 Further Documentation
      • 2 Installing Hadoop
        • 21 Command-line Installation Of Hadoop Using cm-hadoop-setup -c ltfilenamegt
          • 211 Usage
          • 212 An Install Run
            • 22 Ncurses Installation Of Hadoop Using cm-hadoop-setup
            • 23 Avoiding Misconfigurations During Hadoop Installation
              • 231 NameNode Configuration Choices
                • 24 Installing Hadoop With Lustre
                  • 241 Lustre Internal Server Installation
                  • 242 Lustre External Server Installation
                  • 243 Lustre Client Installation
                  • 244 Lustre Hadoop Configuration
                    • 25 Hadoop Installation In A Cloud
                      • 3 Viewing And Managing HDFS From CMDaemon
                        • 31 cmgui And The HDFS Resource
                          • 311 The HDFS Instance Overview Tab
                          • 312 The HDFS Instance Settings Tab
                          • 313 The HDFS Instance Tasks Tab
                          • 314 The HDFS Instance Notes Tab
                            • 32 cmsh And hadoop Mode
                              • 321 The show And overview Commands
                              • 322 The Tasks Commands
                                  • 4 Hadoop Services
                                    • 41 Hadoop Services From cmgui
                                      • 411 Configuring Hadoop Services From cmgui
                                      • 412 Monitoring And Modifying The Running Of Hadoop Services From cmgui
                                          • 5 Running Hadoop Jobs
                                            • 51 Shakedown Runs
                                            • 52 Example End User Job Run
                                              • 6 Hadoop-related Projects
                                                • 61 Accumulo
                                                  • 611 Accumulo Installation With cm-accumulo-setup
                                                  • 612 Accumulo Removal With cm-accumulo-setup
                                                  • 613 Accumulo MapReduce Example
                                                    • 62 Hive
                                                      • 621 Hive Installation With cm-hive-setup
                                                      • 622 Hive Removal With cm-hive-setup
                                                      • 623 Beeline
                                                        • 63 Kafka
                                                          • 631 Kafka Installation With cm-kafka-setup
                                                          • 632 Kafka Removal With cm-kafka-setup
                                                            • 64 Pig
                                                              • 641 Pig Installation With cm-pig-setup
                                                              • 642 Pig Removal With cm-pig-setup
                                                              • 643 Using Pig
                                                                • 65 Spark
                                                                  • 651 Spark Installation With cm-spark-setup
                                                                  • 652 Spark Removal With cm-spark-setup
                                                                  • 653 Using Spark
                                                                    • 66 Sqoop
                                                                      • 661 Sqoop Installation With cm-sqoop-setup
                                                                      • 662 Sqoop Removal With cm-sqoop-setup
                                                                        • 67 Storm
                                                                          • 671 Storm Installation With cm-storm-setup
                                                                          • 672 Storm Removal With cm-storm-setup
                                                                          • 673 Using Storm