July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

24
Setting Up Your First Hadoop Cluster Setting Up Your First Hadoop Cluster Chad Chad Vawter Vawter TriHUG Meeting: July 20, 2010 TriHUG Meeting: July 20, 2010

description

Slides from Chad Vawter's presentation at July 2010 Triangle Hadoop Users Group

Transcript of July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Page 1: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Setting Up Your First Hadoop ClusterSetting Up Your First Hadoop Cluster

Chad VawterChad Vawter

TriHUG Meeting: July 20, 2010TriHUG Meeting: July 20, 2010

Page 2: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Speaker BackgroundSpeaker Background

Netlib and Parallel Virtual Machine (PVM)Netlib and Parallel Virtual Machine (PVM) High-volume messaging, complex event High-volume messaging, complex event

processing (CEP), and predictive data miningprocessing (CEP), and predictive data mining SOA/ESB at the U.S. Department of SOA/ESB at the U.S. Department of

Homeland SecurityHomeland Security Banking: BPM, ETL, Reporting and AnalyticsBanking: BPM, ETL, Reporting and Analytics Interests: Mahout and R/Hadoop, Functional Interests: Mahout and R/Hadoop, Functional

and OO languages for the JVM (Clojure, and OO languages for the JVM (Clojure, Scala, etc.)Scala, etc.)

Page 3: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

GoalsGoals

High-level overview of the prerequisites to High-level overview of the prerequisites to Hadoop cluster installation and operationHadoop cluster installation and operation

High-level overview of the Hadoop High-level overview of the Hadoop configuration filesconfiguration files

Page 4: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Hadoop PrerequisitesHadoop Prerequisites

Supported Operating SystemsSupported Operating Systems LinuxLinux Mac OS/XMac OS/X BSDBSD OpenSolarisOpenSolaris WindowsWindows

Need Cygwin (especially OpenSSH)Need Cygwin (especially OpenSSH) Java Service Wrapper from Tanuki SoftwareJava Service Wrapper from Tanuki Software

Supported Java (JRE) versionsSupported Java (JRE) versions Java 6 or laterJava 6 or later

Page 5: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Let’s use Linux…Let’s use Linux…

Page 6: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Hadoop DistributionsHadoop Distributions

Apache HadoopApache Hadoop ClouderaCloudera

Cloudera’s Distribution for Hadoop (CDH)Cloudera’s Distribution for Hadoop (CDH) FlumeFlume – streaming data collection (e.g., log files) – streaming data collection (e.g., log files) OozieOozie – Yahoo!’s workflow engine for complex Hadoop jobs and data – Yahoo!’s workflow engine for complex Hadoop jobs and data

pipelinespipelines SqoopSqoop - SQL-to-Hadoop database import and export tool - SQL-to-Hadoop database import and export tool Hadoop User Environment (Hue)Hadoop User Environment (Hue) – UI framework and SDK for visual – UI framework and SDK for visual

Hadoop applicationsHadoop applications Cloudera EnterpriseCloudera Enterprise

CDH + management and monitoring tools and production support servicesCDH + management and monitoring tools and production support services Yahoo! Distribution of HadoopYahoo! Distribution of Hadoop

Code patches for performance and stabilityCode patches for performance and stability SecuritySecurity OozieOozie

Page 7: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Install the Apache Hadoop DistributionInstall the Apache Hadoop Distribution

Create a user and group for ownership and Create a user and group for ownership and permissionspermissions e.g., e.g., hadoop:hadoophadoop:hadoop

Download Hadoop from the Apache Hadoop Download Hadoop from the Apache Hadoop releases page:releases page: http://hadoop.apache.org/common/releases.htmlhttp://hadoop.apache.org/common/releases.html

Page 8: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Hadoop ConfigurationHadoop Configuration

SSHSSH configuration configuration Hadoop control scripts communicate with machines in a Hadoop control scripts communicate with machines in a

Hadoop cluster via SSH.Hadoop cluster via SSH.

Hadoop Hadoop environmentenvironment configuration configuration Configure the environment in which the Hadoop daemons Configure the environment in which the Hadoop daemons

run.run.

Configuration parameters for the Hadoop Configuration parameters for the Hadoop daemonsdaemons NameNode / DataNodeNameNode / DataNode JobTracker / TaskTrackerJobTracker / TaskTracker

Page 9: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

SSH ConfigurationSSH Configuration

Hadoop control scripts use SSH for cluster-Hadoop control scripts use SSH for cluster-wide operations, so…wide operations, so…

In the In the hadoophadoop user account’s home directory, user account’s home directory, generate a public/private key pair:generate a public/private key pair:

ssh-keygen –t rsa –f ~/.ssh/id_rsassh-keygen –t rsa –f ~/.ssh/id_rsa

The private key will be in the The private key will be in the ~/.ssh/id_rsa~/.ssh/id_rsa file. file. The public key will be in the The public key will be in the ~/.ssh/id_rsa.pub~/.ssh/id_rsa.pub file. file.

Page 10: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

SSH Configuration (continued)SSH Configuration (continued)

The public key must be in the The public key must be in the ~/.ssh/authorized_keys~/.ssh/authorized_keys file on each machine in the Hadoop cluster:file on each machine in the Hadoop cluster:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keyscat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Use Use ssh-agentssh-agent to avoid having to type the passphrase to avoid having to type the passphrase of the private key when connecting from one machine of the private key when connecting from one machine in the Hadoop cluster to another.in the Hadoop cluster to another.

Run Run ssh-addssh-add to store the passphrase. to store the passphrase. We now have secure, encrypted passwordless logins.We now have secure, encrypted passwordless logins.

Page 11: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

The Hadoop “Environment”The Hadoop “Environment”

Each machine in a Hadoop cluster has a configuration Each machine in a Hadoop cluster has a configuration script for environment settings.script for environment settings.

Edit the Edit the hadoop-env.shhadoop-env.sh Bash script on each machine Bash script on each machine or have a mechanism for sharing environment or have a mechanism for sharing environment settings; e.g., settings; e.g., rsyncrsync..

Values for many environment variables can be Values for many environment variables can be identical for all machines in the cluster. Not all identical for all machines in the cluster. Not all machines will have the same hardware profile, machines will have the same hardware profile, though. Configure each machine’s Hadoop though. Configure each machine’s Hadoop environment so that it best uses its resources. environment so that it best uses its resources.

Page 12: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

hadoop-env.shhadoop-env.sh

JAVA_HOMEJAVA_HOME HADOOP_HOMEHADOOP_HOME HADOOP_LOG_DIRHADOOP_LOG_DIR HADOOP_PID_DIRHADOOP_PID_DIR HADOOP_NAMENODE_OPTSHADOOP_NAMENODE_OPTS HADOOP_DATANODE_OPTSHADOOP_DATANODE_OPTS HADOOP_SECONDARYNAMENODE_OPTSHADOOP_SECONDARYNAMENODE_OPTS HADOOP_JOBTRACKER_OPTSHADOOP_JOBTRACKER_OPTS HADOOP_TASKTRACKER_OPTSHADOOP_TASKTRACKER_OPTS HADOOP_HEAPSIZEHADOOP_HEAPSIZE HADOOP_SLAVESHADOOP_SLAVES HADOOP_SSH_OPTSHADOOP_SSH_OPTS HADOOP_MASTER and HADOOP_SLAVE_SLEEPHADOOP_MASTER and HADOOP_SLAVE_SLEEP ……

Page 13: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Read-Only Default Configuration FilesRead-Only Default Configuration Files

src/core/core-default.xmlsrc/core/core-default.xml src/hdfs/hdfs-default.xmlsrc/hdfs/hdfs-default.xml src/mapred/mapred-default.xmlsrc/mapred/mapred-default.xml

Page 14: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Site-Specific Configuration FilesSite-Specific Configuration Files

Override the values provided in the default Override the values provided in the default configuration files:configuration files: conf/core-site.xmlconf/core-site.xml conf/hdfs-site.xmlconf/hdfs-site.xml conf/mapred-site.xmlconf/mapred-site.xml

Page 15: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Other Configuration FilesOther Configuration Files

slavesslaves This file defines which machines will run This file defines which machines will run

datanodes and/or tasktrackersdatanodes and/or tasktrackers Note: We don’t need to specify which machine(s) Note: We don’t need to specify which machine(s)

will run a NameNode and/or a JobTracker. The will run a NameNode and/or a JobTracker. The Hadoop control scripts are responsible for Hadoop control scripts are responsible for NamNodeNamNode and and JobTrackerJobTracker nodes when they are nodes when they are run on a given machine. run on a given machine.

hadoop-metrics.propertieshadoop-metrics.properties log4j.propertieslog4j.properties

Page 16: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Hadoop StartupHadoop Startup

Format a new distributed file system:Format a new distributed file system:bin/hadoop namenode –formatbin/hadoop namenode –format

Start the HDFS on the designated NameNode:Start the HDFS on the designated NameNode:bin/start-dfs.shbin/start-dfs.shThe start-dfs.sh scripts consults the conf/slaves file on the The start-dfs.sh scripts consults the conf/slaves file on the NameNode and starts a DataNode daemon on each of the NameNode and starts a DataNode daemon on each of the listed slaves.listed slaves.

Start MapReduce on the designated JobTracker:Start MapReduce on the designated JobTracker:bin/start-mapred.shbin/start-mapred.shThe start-mapred.sh scripts consults the conf/slaves file on the The start-mapred.sh scripts consults the conf/slaves file on the JobTracker and starts a TaskTracker daemon on each of the JobTracker and starts a TaskTracker daemon on each of the listed slaves.listed slaves.

Page 17: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Hadoop ShutdownHadoop Shutdown

Stop the HDFS on the designated NameNode:Stop the HDFS on the designated NameNode:bin/stop-dfs.shbin/stop-dfs.shThe start-dfs.sh scripts consults the conf/slaves file The start-dfs.sh scripts consults the conf/slaves file on the NameNode and stops the DataNode daemon on the NameNode and stops the DataNode daemon on each of the listed slaves.on each of the listed slaves.

Stop MapReduce on the designated JobTracker:Stop MapReduce on the designated JobTracker:bin/stop-mapred.shbin/stop-mapred.shThe stop-mapred.sh scripts consults the conf/slaves The stop-mapred.sh scripts consults the conf/slaves file on the JobTracker and stops the TaskTracker file on the JobTracker and stops the TaskTracker daemon on each of the listed slaves.daemon on each of the listed slaves.

Page 18: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Other Hadoop Installation OptionsOther Hadoop Installation Options

Cloud Computing with Hadoop Cloud Computing with Hadoop Amazon EC2Amazon EC2

Xen open-source virtual machine monitor (hypervisor)Xen open-source virtual machine monitor (hypervisor)

Amazon Elastic MapReduceAmazon Elastic MapReduce VMware vCloudVMware vCloud Windows Azure?Windows Azure?

……

Page 19: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

TriHUG Meeting Suggestions?TriHUG Meeting Suggestions?

Hadoop Performance-Tuning with Advanced Hadoop Performance-Tuning with Advanced ConfigurationConfiguration

Data Warehousing and Large-Scale Data Warehousing and Large-Scale Extraction, Transformation and Loading Extraction, Transformation and Loading (ETL) with Hadoop(ETL) with Hadoop

High-Volume Reporting with HadoopHigh-Volume Reporting with Hadoop Hadoop and Object-Functional Languages for Hadoop and Object-Functional Languages for

the JVMthe JVM Others?Others?

Page 20: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Resources - HadoopResources - Hadoop

Apache HadoopApache Hadoop http://hadoop.apache.org/http://hadoop.apache.org/

Hadoop: The Definitive GuideHadoop: The Definitive Guide http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/

0596521979/ref=sr_1_1?ie=UTF8&s=books&qid=1279640275&sr=8-0596521979/ref=sr_1_1?ie=UTF8&s=books&qid=1279640275&sr=8-11

Downloading and Installing HadoopDownloading and Installing Hadoop http://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoop

Cloudera’s Hadoop DistributionCloudera’s Hadoop Distribution http://www.cloudera.com/http://www.cloudera.com/

Yahoo’s Hadoop DistributionYahoo’s Hadoop Distribution http://developer.yahoo.com/hadoop/http://developer.yahoo.com/hadoop/

Page 21: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Resources - Hadoop (Resources - Hadoop (continuedcontinued))

Supported Java VersionsSupported Java Versions http://wiki.apache.org/hadoop/HadoopJavaVersionshttp://wiki.apache.org/hadoop/HadoopJavaVersions

Hadoop on Windows with EclipseHadoop on Windows with Eclipse http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-

%20Intro.html%20Intro.html

Page 22: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Resources - Amazon EC2Resources - Amazon EC2

Amazon Elastic Compute Cloud (EC2)Amazon Elastic Compute Cloud (EC2) http://aws.amazon.com/ec2/http://aws.amazon.com/ec2/

Amazon Elastic MapReduceAmazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/http://aws.amazon.com/elasticmapreduce/

EC2 Starter’s Guide for UbuntuEC2 Starter’s Guide for Ubuntu https://help.ubuntu.com/community/EC2StartersGhttps://help.ubuntu.com/community/EC2StartersG

uideuide

Page 23: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Resources - MiscellaneousResources - Miscellaneous

Xen Open-Source Virtual Machine MonitorXen Open-Source Virtual Machine Monitor http://www.xen.org/http://www.xen.org/

Virtualization - ComparisonVirtualization - Comparison http://www.virtualbox.org/wiki/VBox_vs_Othershttp://www.virtualbox.org/wiki/VBox_vs_Others

Page 24: July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Keep in TouchKeep in Touch

[email protected]@gmail.com