Hadoop HP Day2
description
Transcript of Hadoop HP Day2
-
hPot-Tech
1 Map reduce Programming
The Configuration API
Components in Hadoop are configured using Hadoops own configuration API. org.apache.hadoop.conf.Configuration : - represents a collection of configuration properties
and their values. Each property is named by a String, and the type of a value may be one of several types
o including Java primitives such as boolean, int, long, and float o other useful types such as String, Class, and java.io.File, and collections of Strings
-
hPot-Tech
2 Map reduce Programming
The Configuration API..
-
hPot-Tech
3 Map reduce Programming
Tool implementation :
-
hPot-Tech
4 Map reduce Programming
Packaging a Job
a jobs classes must be packaged into a job JAR file to send to the cluster Any dependent JAR files can be packaged in a lib subdirectory in the job JAR file.
The client classpath
The users client-side classpath set by hadoop jar is made up of: The job JAR file Any JAR files in the lib directory of the job JAR file, and the classes directory. The classpath defined by HADOOP_CLASSPATH, if set
-
hPot-Tech
5 Map reduce Programming
Launching a Job
To launch the job, we need to run the driver, specifying the cluster that we want to run the job on with the -conf option
-
hPot-Tech
6 Map reduce Programming
The Job output..
-
hPot-Tech
7 Map reduce Programming
The Job output..
-
hPot-Tech
8 Map reduce Programming
The MapReduce Web UI.
A web UI for viewing information about your jobs. useful for
o following a jobs progress while it is running o finding job statistics and logs after the job has completed.
http://jobtracker-host:50030/.
-
hPot-Tech
9 Map reduce Programming
The jobtracker page
-
hPot-Tech
10 Map reduce Programming
The jobtracker page
-
hPot-Tech
11 Map reduce Programming
The job page
-
hPot-Tech
12 Map reduce Programming
The job page
-
hPot-Tech
13 Map reduce Programming
Map Reduce Programming
-
hPot-Tech
14 Map reduce Programming
The MapReduce Approach
Shared memory approach (OpenMP, MPI, ...)
I Developer needs to take care of (almost) everything I Synchronization, Concurrency I Resource allocation
MapReduce: a shared nothing approach
I Most of the above issues are taken care of I Problem decomposition and sharing partial results need particular attention I Optimizations (memory and network consumption) are tricky
-
hPot-Tech
15 Map reduce Programming
Functional Programming Roots
Key feature: higher order functions I Functions that accept other functions as arguments I Map and Fold
Figure: Illustration of map and fold.
-
hPot-Tech
16 Map reduce Programming
Functional Programming Roots
map phase: Given a list, map takes as an argument a function f (that takes a
single argument) and applies it to all element in a list
fold phase: Given a list, fold takes as arguments a function g (that takes two
arguments) and an initial value g is first applied to the initial value and the first item in the list The result is stored in an intermediate variable, which is used as an
input together with the next item to a second application of g The process is repeated until all items in the list have been
Consumed
-
hPot-Tech
17 Map reduce Programming
Functional Programming Roots
We can view map as a transformation over a dataset This transformation is specified by the function f Each functional application happens in isolation The application of f to each element of a dataset can be parallelized in a
straightforward manner
We can view fold as an aggregation operation The aggregation is defined by the function g Data locality: elements in the list must be brought together If we can group element of the list, also the fold phase can proceed in parallel
Associative and commutative operations Allow performance gains through local aggregation and reordering
-
hPot-Tech
18 Map reduce Programming
Functional Programming and MapReduce
Equivalence of MapReduce and Functional Programming: The map of MapReduce corresponds to the map operation The reduce of MapReduce corresponds to the fold operation
The framework coordinates the map and reduce phases: How intermediate results are grouped for the reduce to happen in parallel
In practice: User-specified computation is applied (in parallel) to all input records of a dataset
Intermediate results are aggregated by another user-specified Computation
-
hPot-Tech
19 Map reduce Programming
Mappers and Reducers
-
hPot-Tech
20 Map reduce Programming
Data Structures
Key-value pairs are the basic data structure in MapReduce
Keys and values can be: integers, float, strings, raw bytes They can also be arbitrary data structures
The design of MapReduce algorithms involes: Imposing the key-value structure on arbitrary datasets
o E.g.: for a collection of Web pages, input keys may be URLs and values may be the HTML content
In some algorithms, input keys are not used, in others they uniquely identify a record
Keys can be combined in complex ways to design various algorithms
-
hPot-Tech
21 Map reduce Programming
A MapReduce job
The programmer defines a mapper and a reducer as follows2:
o map: (k1; v1) ! [(k2; v2)] o reduce: (k2; [v2]) ! [(k3; v3)]
A MapReduce job consists in:
o A dataset stored on the underlying distributed filesystem, which is split in a number of files across machines
o The mapper is applied to every input key-value pair to generate intermediate key-value pairs
o The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs
-
hPot-Tech
22 Map reduce Programming
Where the magic happens
Implicit between the map and reduce phases is a distributed group by operation on intermediate keys
Intermediate data arrive at each reducer in order, sorted by the key No ordering is guaranteed across reducers
Output keys from reducers are written back to the distributed filesystem
The output may consist of r distinct files, where r is the number of reducers Such output may be the input to a subsequent MapReduce phase
Intermediate keys are transient:
They are not stored on the distributed filesystem They are spilled to the local disk of each machine in the cluster
-
hPot-Tech
23 Map reduce Programming
A Simplified view of MapReduce
Figure: Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs. Reducers are applied to all intermediate values associated with the same intermediate key. Between the map and reduce phase lies a barrier that involves a large distributed sort and group by
-
hPot-Tech
24 Map reduce Programming
-
hPot-Tech
25 Map reduce Programming
Hello World in MapReduce
Input: Key-value pairs: (docid, doc) stored on the distributed filesystem docid: unique identifier of a document doc: is the text of the document itself
Mapper:
Takes an input key-value pair, tokenize the document Emits intermediate key-value pairs: the word is the key and the integer is the value
The framework: Guarantees all values associated with the same key (the word) are brought to the
same reducer
The reducer: Receives all values associated to some keys Sums the values and writes output key-value pairs: the key is the word and the value
is the number of occurrences
-
hPot-Tech
26 Map reduce Programming
Implementation and Execution Details
The partitioner is in charge of assigning intermediate keys (words) to reducers
Note that the partitioner can be customized
How many map and reduce tasks?
The framework essentially takes care of map tasks The designer/developer takes care of reduce tasks
-
hPot-Tech
27 Map reduce Programming
Restrictions
Using external resources
E.g.: Other data stores than the distributed file system Concurrent access by many map/reduce tasks
Side effects Not allowed in functional programming E.g.: preserving state across multiple inputs State is kept internal
I/O and execution External side effects using distributed data stores (e.g. BigTable) No input (e.g. computing _), no reducers, never no mappers
-
hPot-Tech
28 Map reduce Programming
The Execution Framework
-
hPot-Tech
29 Map reduce Programming
The Execution Framework
MapReduce program, a.k.a. a job:
Code of mappers and reducers Code for combiners and partitioners (optional) Configuration parameters All packaged together
A MapReduce job is submitted to the cluster
The framework takes care of eveything else
-
hPot-Tech
30 Map reduce Programming
Tutorial: Map Reduce
-
hPot-Tech
31 Map reduce Programming
-
hPot-Tech
32 Map reduce Programming
Debugging a Job
The web UI (debug statement to log to standard error) custom counter
-
hPot-Tech
33 Map reduce Programming
Add debugging to the mapper:
-
hPot-Tech
34 Map reduce Programming
The tasks page
-
hPot-Tech
35 Map reduce Programming
The task details page
-
hPot-Tech
36 Map reduce Programming
Hadoop Logs
-
hPot-Tech
37 Map reduce Programming
Anything written to standard output or standard error is directed to the relevant log file.
-
hPot-Tech
38 Map reduce Programming
Remote Debugging
debugger is hard to arrange when running the job on a cluster options :
o Reproduce the failure locally o Use JVM debugging options o Use task profiling o Use IsolationRunner
set keep.failed.task.files to true to keep a failed tasks files.
-
hPot-Tech
39 Map reduce Programming
Tuning a Job
-
hPot-Tech
40 Map reduce Programming
Tuning a Job
-
hPot-Tech
41 Map reduce Programming
Job Submission
JobClient class
The runJob() method creates a new instance of a JobClient Then it calls the submitJob() on this class
Simple verifications on the Job
Is there an output directory? Are there any input splits? Can I copy the JAR of the job to HDFS?
NOTE: the JAR of the job is replicated 10 times
-
hPot-Tech
42 Map reduce Programming
MapReduce Workflows
o When the processing gets more complex : o As a rule of thumb, think about adding more jobs, rather than adding complexity to jobs.
o For more complex problems, o Consider a higher-level language than Map-Reduce, such as Pig, Hive, Cascading,
Cascalog, or Crunch. o One immediate benefit is that it frees you from the translation into MapReduce jobs,
allowing you to concentrate on the analysis you are performing.
-
hPot-Tech
43 Map reduce Programming
JobControl:
When there is more than one job in a MapReduce workflow : For a linear chain, the simplest approach is to run each job one after another :
For anything more complex than a linear chain,
o org.apache.hadoop.mapreduce.jobcontrol.JobControl : o represents a graph of jobs to be run. o add the job configurations, o tell the JobControl instance the dependencies between jobs. o run the JobControl in a thread, and it runs the jobs in dependency order. o can poll for progress, and when the jobs have finished, you can query for all the jobs
statuses and the associated errors for any failures. o If a job fails, JobControl wont run its dependencies.
-
hPot-Tech
44 Map reduce Programming
Advance MapReduce How Map reduce works?
-
hPot-Tech
45 Map reduce Programming
Classic MapReduce
-
hPot-Tech
46 Map reduce Programming
Failures
Major benefits of using Hadoop is its ability to handle failures and allow job to complete. Task failure:
When user code in the map or reduce task throws a runtime exception. The error ultimately makes it into the user logs. Hanging tasks are dealt with differently : mapred.task.timeout When the jobtracker is notified of a task attempt that has failed (by the tasktrackers heartbeat call), it will reschedule execution of the task. The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously
failed
-
hPot-Tech
47 Map reduce Programming
Failures Tasktracker failure :
The jobtracker will notice a tasktracker that has stopped sending heartbeats if it hasnt received one for 10 minutes (configured via the mapred.task tracker.expiry.interval property, in milliseconds)
And remove it from its pool of tasktrackers to schedule tasks on.
Jobtracker failure
Failure of the jobtracker is the most serious failure mode. Hadoop has no mechanism for dealing with jobtracker failureit is a single point of failure
so in this case all running jobs fail. After restarting a jobtracker, any jobs that were running at the time it was stopped will need to
be resubmitted
-
hPot-Tech
48 Map reduce Programming
Partitioners and Combiners
-
hPot-Tech
49 Map reduce Programming
Partitioners
Partitioners are responsible for:
Dividing up the intermediate key space Assigning intermediate key-value pairs to reducers Specify the task to which an intermediate key-value pair must be copied
Hash-based partitioner
Computes the hash of the key modulo the number of reducers r This ensures a roughly even partitioning of the key space
o However, it ignores values: this can cause imbalance in the data processed by each reducer
When dealing with complex keys, even the base partitioner may need customization
-
hPot-Tech
50 Map reduce Programming
Combiners
Combiners are an (optional) optimization:
Allow local aggregation before the shuffle and sort phase Each combiner operates in isolation
Essentially, combiners are used to save bandwidth
E.g.: word count program
Combiners can be implemented using local data-structures
E.g., an associative array keeps intermediate computations and aggregation thereof The map function only emits once all input records (even all input splits) are
processed
-
hPot-Tech
51 Map reduce Programming
Partitioners and Combiners, an Illustration
Note: in Hadoop, partitioners are executed before combiners
-
hPot-Tech
52 Map reduce Programming
-
hPot-Tech
53 Map reduce Programming
Lab : Combiner & Partitioners
-
hPot-Tech
54 Map reduce Programming
MRUnit Map Reduce Unit Testing. The map and reduce functions in MapReduce are easy to test in isolation
MRUnit :
a testing library that makes easy to pass known inputs to a mapper or a reducer and check that the outputs are as expected.
used in conjunction with a standard test execution framework, such as JUnit.
-
hPot-Tech
55 Map reduce Programming
Mapper
-
hPot-Tech
56 Map reduce Programming
Reducer
-
hPot-Tech
57 Map reduce Programming
Tutorial : MRUnit.
-
hPot-Tech
58 Map reduce Programming
-
hPot-Tech
59 Map reduce Programming
Hadoop MapReduce Types and Formats
-
hPot-Tech
60 Map reduce Programming
MapReduce Types Input / output to mappers and reducers
a. map: (k1; v1) ! [(k2; v2)] b. reduce: (k2; [v2]) ! [(k3; v3)]
In Hadoop, a mapper is created as follows: a. void map(K1 key, V1 value, OutputCollector output, Reporter reporter) b.
Types: a. K types implement WritableComparable b. V types implement Writable
-
hPot-Tech
61 Map reduce Programming
What is a Writable
Hadoop defines its own classes for strings (Text), integers (intWritable), etc...
All keys are instances of WritableComparable o Why comparable?
All values are instances of Writable
-
hPot-Tech
62 Map reduce Programming
-
hPot-Tech
63 Map reduce Programming
Reading Data
Datasets are specified by InputFormats
I InputFormats define input data (e.g. a file, a directory) I InputFormats is a factory for RecordReader objects to extract key-value records from the input source
InputFormats identify partitions of the data that form an InputSplit
InputSplit is a (reference to a) chunk of the input processed by a single map o Largest split is processed first
Each split is divided into records, and the map processes each record (a key-value pair) in turn
Splits and records are logical, they are not physically bound to a file
-
hPot-Tech
64 Map reduce Programming
The relationship between InputSplit and HDFS blocks
-
hPot-Tech
65 Map reduce Programming
FileInputFormat and Friends
TextInputFormat Traeats each newline-terminated line of a file as a value
KeyValueTextInputFormat Maps newline-terminated text lines of key SEPARATOR value
SequenceFileInputFormat Binary file of key-value pairs with some additional metadata
SequenceFileAsTextInputFormat Same as before but, maps (k.toString(), v.toString())
-
hPot-Tech
66 Map reduce Programming
Filtering File Inputs
FileInputFormat reads all files out of a specified directory and send them to the mapper
Delegates filtering this file list to a method subclasses may override
Example: create your own xyzFileInputFormat to read *.xyz from a directory list
-
hPot-Tech
67 Map reduce Programming
Record Readers
Each InputFormat provides its own RecordReader implementation
LineRecordReader
Reads a line from a text file
KeyValueRecordReader
Used by KeyValueTextInputFormat
-
hPot-Tech
68 Map reduce Programming
Input Split Size
-
hPot-Tech
69 Map reduce Programming
Sending Data to Reducers
Map function receives OutputCollector object
OutputCollector.collect() receives key-value elements
Any (WritableComparable, Writable) can be used By defalut, mapper output type assumed to be the same as the reducer output type
-
hPot-Tech
70 Map reduce Programming
WritableComparator
Compares WritableComparable data
Will call the WritableComparable.compare() method Can provide fast path for serialized data
Configured through:
JobConf.setOutputValueGroupingComparator()
-
hPot-Tech
71 Map reduce Programming
Partitioner
int getPartition(key, value, numPartitions)
Outputs the partition number for a given key One partition == all values sent to a single reduce task
HasPartitioner used by default Uses key.hashCode() to return partion number
JobConf used to set Partitioner implementation
-
hPot-Tech
72 Map reduce Programming
The Reducer
void reduce(k2 key, Iterator values,OutputCollector output, Reporter reporter )
Keys and values sent to one partition all go to the same reduce task
Calls are sorted by key Early keys are reduced and output before late keys
-
hPot-Tech
73 Map reduce Programming
Writing the Output
-
hPot-Tech
74 Map reduce Programming
Writing the Output
Analogous to InputFormat
TextOutputFormat writes key value strings to output file
SequenceFileOutputFormat uses a binary format to pack key-value pairs
NullOutputFormat discards output
-
hPot-Tech
75 Map reduce Programming
Lab :- Input and Output
-
hPot-Tech
76 Map reduce Programming
Map Side and Reduce Side Joins
-
hPot-Tech
77 Map reduce Programming
Joins
MapReduce can perform joins between large datasets
-
hPot-Tech
78 Map reduce Programming
Join:-
performed by the mapper, it is called a map-side join
performed by the reducer it is called a reduce-side join.
-
hPot-Tech
79 Map reduce Programming
Map-Side Joins A map-side join between large inputs works by
performing the join before the data reaches the map function.
The inputs to each map must be partitioned and sorted in a particular way.
Each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source.
All the records for a particular key must reside in the same partition.
-
hPot-Tech
80 Map reduce Programming
Reduce-Side Joins A reduce-side join is more general than a map-
side join the input datasets dont have to be structured in
any particular way the mapper tags each record with its source and
uses the join key as the map output key, so that the records with the same key are brought together in the reducer.
-
hPot-Tech
81 Map reduce Programming
Lab : Map Side Join.
-
Managing a Hadoop Cluster
-
Hadoop Cluster Component
NameNode: Manages the namespace, file
system metadata, and access control. There is
exactly one NameNode in each cluster.
SecondaryNameNode: Downloads SecondaryNameNode: Downloads
periodic checkpoints from the NameNode for
fault-tolerance. There is exactly one
SecondaryNameNode in each cluster.
-
Hadoop Cluster Component
JobTracker: Hands out tasks to the slave nodes. There is exactly one JobTracker in each cluster.
DataNode: Holds file system data; each data node manages its own locally-attached storage node manages its own locally-attached storage (i.e., the node's hard disk) and stores a copy of some or all blocks in the file system. There are one or more DataNodes in each cluster. If your cluster has only one DataNode then file system data cannot be replicated.
-
Hadoop Cluster Component
TaskTracker: Slaves that carry
out map and reduce tasks. There are one or
more TaskTrackers in each cluster.
-
HDFS Architecture
Namenode
Datanodes Datanodes
Client
Read
Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..
Block ops
3/3/2013 5
Breplication
Rack1 Rack2
Client
Blocks
Write
-
Platform requirements for Hadoop
Java Requirements
Hadoop is a Java-based system. Recent versions of
Hadoop require Sun Java 1.6.
Operating System Operating System
As Hadoop is written in Java, it is mostly portable
between different operating systems
Downloading and Installing Hadoop
-
Topology of a typical Hadoop cluster .
-
Installation Steps
Installed java
ssh and sshd
gunzip hadoop-0.18.0.tar.gz
Or tar vxf hadoop-0.18.0.tar Or tar vxf hadoop-0.18.0.tar
Set JAVA_HOME in conf/hadoop-env.sh
Modified hadoop-site.xml
-
Hadoop Installation Flavors
Standalone
Pseudo-distributed
Hadoop clusters of multiple nodes
-
Additional Configuration
conf/masters
contains the hostname of the SecondaryNameNode
It should be fully-qualified domain name.
conf/slaves conf/slaves
the hostname of every machine in the cluster which
should start TaskTracker and DataNode daemons
Ex:slave01
slave02
slave03
-
Advance Configuration
enable passwordless ssh
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
The ~/.ssh/id_dsa.pub and authorized_keys
files should be replicated on all machines in
the cluster.
-
Advance Configuration
Various directories should be created on each
node
The NameNode requires the NameNode metadata
directorydirectory
$ mkdir -p /home/hadoop/dfs/name
Every node needs the Hadoop tmp directory and
DataNode directory created
-
Advance Configuration..
bin/slaves.sh allows a command to be
executed on all nodes in the slaves file. $ mkdir -p /tmp/hadoop
$ export HADOOP_CONF_DIR=${HADOOP_HOME}/conf
$ export HADOOP_SLAVES=${HADOOP_CONF_DIR}/slaves
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /tmp/hadoop"$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /tmp/hadoop"
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /home/hadoop/dfs/data
Format HDFS
$ bin/hadoop namenode -format
start the cluster:
$ bin/start-all.sh
-
Important Directories
Directory Description Default location Suggested location
HADOOP_LOG_DIROutput location for log files
from daemons${HADOOP_HOME}/logs /var/log/hadoop
hadoop.tmp.dirA base for other temporary
directories/tmp/hadoop-${user.name} /tmp/hadoop
dfs.name.dirWhere the NameNode
metadata should be stored${hadoop.tmp.dir}/dfs/name /home/hadoop/dfs/name
dfs.data.dirWhere DataNodes store their
blocks${hadoop.tmp.dir}/dfs/data /home/hadoop/dfs/data
mapred.system.dirThe in-HDFS path to shared
MapReduce system files
${hadoop.tmp.dir}/mapred/sy
stem/hadoop/mapred/system
-
Recommended configuration
dfs.name.dir and dfs.data.dir be moved out
from hadoop.tmp.dir.
Adjust mapred.system.dir
-
Selecting Machines
Hadoop is designed to take advantage of
whatever hardware is available
Hadoop jobs written in Java can consume
between 1 and 2 GB of RAM per corebetween 1 and 2 GB of RAM per core
If you use HadoopStreaming to write your jobs
in a scripting language such as Python, more
memory may be advisable.
-
Cluster Configurations
Small Clusters: 2-10 Nodes
Medium Clusters: 10-40 Nodes
Large Clusters: Multiple Racks
-
Small Clusters: 2-10 Nodes
In two nodes,
one node: NameNode/JobTracker and a
DataNode/TaskTracker;
the other node: DataNode/TaskTracker. the other node: DataNode/TaskTracker.
Clusters of three or more machines typically
use a dedicated NameNode/JobTracker, and
all other nodes are workers.
-
configuration in conf/hadoop-site.xml
mapred.job.trackerhead.server.node.com:9001
fs.default.name
hdfs://head.server.node.com:9000
hadoop.tmp.dir/tmp/hadooptrue
mapred.system.dir/hadoop/mapred/systemtrue
ue>
dfs.data.dir/home/hadoop/dfs/datatrue
dfs.name.dir/home/hadoop/dfs/nametrue
dfs.replication2
-
Medium Clusters: 10-40 Nodes
The single point of failure in a Hadoop cluster
is the NameNode
Hence, back up the NameNode metadata.
One machine in the cluster should be designated One machine in the cluster should be designated
as the NameNode's backup
It does not run the normal Hadoop daemons
it exposes a directory via NFS which is only
mounted on the NameNode
-
NameNodes backup
The cluster's hadoop-site.xml file should then
instruct the NameNode to write to this
directory as well:
dfs.name.dir
/home/hadoop/dfs/name,/mnt/namenode-backup
true
-
Backup NameNode
the backup machine can be used for is to
serve as the SecondaryNameNode
this is not a failover NameNode process
It takes periodic snapshots of its metadata It takes periodic snapshots of its metadata
-
conf/hadoop-site.xml
Nodes must be decommissioned on a schedule that permits replication of blocks being decommissioned.
conf/hadoop-site.xml
dfs.hosts.exclude/home/hadoop/excludes/home/hadoop/excludestrue
mapred.hosts.exclude/home/hadoop/excludestrue
create an empty file with this name: $ touch /home/hadoop/excludes
-
Replication Setting
dfs.replication
3
-
Disk & heap
dfs.datanode.du.reserved
1073741824
true
mapred.child.java.opts
-Xmx512m
-
Using multiple drives per machine
DataNodes can be configured to write blocks
out to multiple disks via the dfs.data.dir
property.
dfs.data.dirdfs.data.dir
/d1/dfs/data,/d2/dfs/data,/d3/dfs/data,/d4/dfs/data
true
-
Using multiple drives per machine..
mapred.local.dir
/d1/mapred/local,/d2/mapred/local,
/d3/mapred/local,/d4/mapred/local/d3/mapred/local,/d4/mapred/local
true
-
Tutorial
Configure Hadoop Cluster in two nodes.
Tutorial-Installed Hadoop in Cluster.docx
-
Large Clusters: Multiple Racks
possibility of rack failure now exists
operational racks should be able to continue
even if entire other racks are disabled
the amount of metadata under the care of the the amount of metadata under the care of the
NameNode increases
-
Large Clusters: Multiple Racks
The NameNode is responsible for managing metadata associated with each block in the HDFS
the amount of information in the rack scales the amount of information in the rack scales into the 10's or 100's of TB
dfs.block.size134217728
-
Large Clusters: Multiple Racks
The NFS-mounted write-through backup
should be placed in a different rack from the
NameNode.
The SecondaryNameNode should be The SecondaryNameNode should be
instantiated on a separate rack
-
Large Clusters: Multiple Racks
dfs.namenode.handler.count
40
mapred.job.tracker.handler.count
40
-
Large Clusters: Multiple Racks
Property Range Description
io.file.buffer.size 32768-131072
Read/write buffer size used in
SequenceFiles (should be in multiples of
the hardware page size)
io.sort.factor 50-200Number of streams to merge concurrently
when sorting files during shuffling
io.sort.mb 50-200Amount of memory to use while sorting
datadata
mapred.reduce.parallel.copies 20-50
Number of concurrent connections a
reducer should use when fetching its input
from mappers
tasktracker.http.threads 40-50
Number of threads each TaskTracker uses
to provide intermediate map output to
reducers
mapred.tasktracker.map.tasks.maximum 1/2 * (cores/node) to 2 * (cores/node)Number of map tasks to deploy on each
machine.
mapred.tasktracker.reduce.tasks.maximum 1/2 * (cores/node) to 2 * (cores/node)Number of reduce tasks to deploy on each
machine.