Hadoop training-in-hyderabad
-
Author
kelly-technologies -
Category
Education
-
view
123 -
download
0
Embed Size (px)
Transcript of Hadoop training-in-hyderabad
Slide 1
Hadoop Technical IntroductionPresented By
1
TerminologyGoogle calls it:Hadoop equivalent:MapReduceHadoopGFSHDFSBigtableHBaseChubbyZookeeper
www.kellytechno.comww
2
Some MapReduce TerminologyJob A full program - an execution of a Mapper and Reducer across a data setTask An execution of a Mapper or a Reducer on a slice of data a.k.a. Task-In-Progress (TIP)Task Attempt A particular instance of an attempt to execute a task on a machinewww.kellytechno.comww
3
Task AttemptsA particular task will be attempted at least once, possibly more times if it crashesIf the same input causes crashes over and over, that input will eventually be abandonedMultiple attempts at one task may occur in parallel with speculative execution turned onTask ID from TaskInProgress is not a unique identifier; dont use it that waywww.kellytechno.comww
4
MapReduce: High Level
In our case: circe.rc.usf.eduwww.kellytechno.comww
5
Nodes, Trackers, TasksMaster node runs JobTracker instance, which accepts Job requests from clients
TaskTracker instances run on slave nodes
TaskTracker forks separate Java process for task instanceswww.kellytechno.comww
6
Job DistributionMapReduce programs are contained in a Java jar file + an XML file containing serialized program configuration optionsRunning a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code
Wheres the data distribution?www.kellytechno.comww
7
Data DistributionImplicit in design of MapReduce!All mappers are equivalent; so map whatever data is local to a particular node in HDFSIf lots of data does happen to pile up on the same node, nearby nodes will map insteadData transfer is handled implicitly by HDFSwww.kellytechno.comww
8
What Happens In Hadoop?Depth Firstwww.kellytechno.comww
9
Job Launch Process: ClientClient program creates a JobConfIdentify classes implementing Mapper and Reducer interfaces JobConf.setMapperClass(), setReducerClass()Specify inputs, outputsFileInputFormat.setInputPath(),FileOutputFormat.setOutputPath()Optionally, other options too:JobConf.setNumReduceTasks(), JobConf.setOutputFormat()www.kellytechno.comww
10
Job Launch Process: JobClientPass JobConf to JobClient.runJob() or submitJob()runJob() blocks, submitJob() does notJobClient: Determines proper division of input into InputSplitsSends job data to master JobTracker serverwww.kellytechno.comww
11
Job Launch Process: JobTrackerJobTracker: Inserts jar and JobConf (serialized to XML) in shared location Posts a JobInProgress to its run queue
www.kellytechno.comww
12
Job Launch Process: TaskTrackerTaskTrackers running on slave nodes periodically query JobTracker for workRetrieve job-specific jar and configLaunch task in separate instance of Javamain() is provided by Hadoopwww.kellytechno.comww
13
Job Launch Process: TaskTaskTracker.Child.main():Sets up the child TaskInProgress attemptReads XML configurationConnects back to necessary MapReduce components via RPCUses TaskRunner to launch user processwww.kellytechno.comww
14
Job Launch Process: TaskRunnerTaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it should be mappingCalls Mapper once for each record retrieved from the InputSplitRunning the Reducer is much the samewww.kellytechno.comww
15
Creating the MapperYou provide the instance of MapperShould extend MapReduceBaseOne instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgressExists in separate process from all other instances of Mapper no data sharing!www.kellytechno.comww
16
Mappervoid map(K1 key, V1 value, OutputCollector output, Reporter reporter)
K types implement WritableComparableV types implement Writablewww.kellytechno.comww
17
What is Writable?Hadoop defines its own box classes for strings (Text), integers (IntWritable), etc. All values are instances of WritableAll keys are instances of WritableComparablewww.kellytechno.comww
18
Getting Data To The Mapper
www.kellytechno.comww
19
Reading DataData sets are specified by InputFormatsDefines input data (e.g., a directory)Identifies partitions of the data that form an InputSplitFactory for RecordReader objects to extract (k, v) records from the input source
www.kellytechno.comww
20
FileInputFormat and FriendsTextInputFormat Treats each \n-terminated line of a file as a valueKeyValueTextInputFormat Maps \n- terminated text lines of k SEP vSequenceFileInputFormat Binary file of (k, v) pairs with some addl metadataSequenceFileAsTextInputFormat Same, but maps (k.toString(), v.toString())www.kellytechno.comww
21
Filtering File InputsFileInputFormat will read all files out of a specified directory and send them to the mapperDelegates filtering this file list to a method subclasses may overridee.g., Create your own xyzFileInputFormat to read *.xyz from directory listwww.kellytechno.comww
22
Record ReadersEach InputFormat provides its own RecordReader implementationProvides (unused?) capability multiplexingLineRecordReader Reads a line from a text fileKeyValueRecordReader Used by KeyValueTextInputFormat www.kellytechno.comww
23
Input Split SizeFileInputFormat will divide large files into chunksExact size controlled by mapred.min.split.size RecordReaders receive file, offset, and length of chunkCustom InputFormat implementations may override split size e.g., NeverChunkFilewww.kellytechno.comww
24
Sending Data To ReducersMap function receives OutputCollector objectOutputCollector.collect() takes (k, v) elementsAny (WritableComparable, Writable) can be usedBy default, mapper output type assumed to be same as reducer output type
www.kellytechno.comww
25
WritableComparatorCompares WritableComparable dataWill call WritableComparable.compare()Can provide fast path for serialized dataJobConf.setOutputValueGroupingComparator()
www.kellytechno.comww
26
Sending Data To The ClientReporter object sent to Mapper allows simple asynchronous feedbackincrCounter(Enum key, long amount) setStatus(String msg)Allows self-identification of inputInputSplit getInputSplit()
www.kellytechno.comww
27
Partition And Shuffle
www.kellytechno.comww
28
Partitionerint getPartition(key, val, numPartitions)Outputs the partition number for a given keyOne partition == values sent to one Reduce taskHashPartitioner used by defaultUses key.hashCode() to return partition numJobConf sets Partitioner implementationwww.kellytechno.comww
29
Reductionreduce(K2 key, Iterator values, OutputCollector output, Reporter reporter )Keys & values sent to one partition all go to the same reduce taskCalls are sorted by key earlier keys are reduced and output before later keyswww.kellytechno.comww
30
Finally: Writing The Output
www.kellytechno.comww
31
OutputFormatAnalogous to InputFormatTextOutputFormat Writes key val\n strings to output fileSequenceFileOutputFormat Uses a binary format to pack (k, v) pairsNullOutputFormat Discards outputOnly useful if defining own output methods within reduce()www.kellytechno.comww
32
Example Program - Wordcountmap()Receives a chunk of textOutputs a set of word/count pairsreduce()Receives a key and all its associated valuesOutputs the key and the sum of the values
package org.myorg;import java.io.IOException;import java.util.*;import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;
public class WordCount {www.kellytechno.comww
33
Wordcount main( )public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);}
www.kellytechno.comww
34
Wordcount map( )public static class Map extends MapReduceBase { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, ) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }}www.kellytechno.comww
35
Wordcount reduce( )public static class Reduce extends MapReduceBase { public void reduce(Text key, Iterator values, OutputCollector output, ) { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}}www.kellytechno.comww
36
Hadoop StreamingAllows you to create and run map/reduce jobs with any executableSimilar to unix pipes, e.g.:format is: Input | Mapper | Reducerecho this sentence has five lines | cat | wcwww.kellytechno.comww
37
Hadoop StreamingMapper and Reducer receive data from stdin and output to stdoutHadoop takes care of the transmission of data between the map/reduce tasksIt is still the programmers responsibility to set the correct key/valueDefault format: key \t value\nLets look at a Python example of a MapReduce word count programwww.kellytechno.comww
38
Streaming_Mapper.py# read in one line of input at a time from stdinfor line in sys.stdin: line = line.strip() # string words = line.split() # list of strings
# write data on stdout for word in words: print %s\t%i % (word, 1)
www.kellytechno.comww
39
Hadoop StreamingWhat are we outputting?Example output: the1By default, the is the key, and 1 is the valueHadoop Streaming handles delivering this key/value pair to a ReducerAble to send similar keys to the same Reducer or to an intermediary Combinerwww.kellytechno.comww
40
Streaming_Reducer.py wordcount = { } # empty dictionary # read in one line of input at a time from stdinfor line in sys.stdin: line = line.strip() # string key,value = line.split() wordcount[key] = wordcount.get(key, 0) + value
# write data on stdout for word, count in sorted(wordcount.items()): print %s\t%i % (word, count)
www.kellytechno.comww
41
Hadoop Streaming GotchaStreaming Reducer receives single lines (which are key/value pairs) from stdinRegular Reducer receives a collection of all the values for a particular keyIt is still the case that all the values for a particular key will go to a single Reducerwww.kellytechno.comww
42
Using Hadoop Distributed File System (HDFS)Can access HDFS through various shell commands (see Further Resources slide for link to documentation)hadoop put hadoop get hadoop lshadoop rm filewww.kellytechno.comww
43
Configuring Number of TasksNormal methodjobConf.setNumMapTasks(400)jobConf.setNumReduceTasks(4)Hadoop Streaming method-jobconf mapred.map.tasks=400-jobconf mapred.reduce.tasks=4 Note: # of map tasks is only a hint to the framework. Actual number depends on the number of InputSplits generatedwww.kellytechno.comww
44
Running a Hadoop JobPlace input file into HDFS:hadoop fs put ./input-file input-fileRun either normal or streaming version:hadoop jar Wordcount.jar org.myorg.Wordcount input-file output-filehadoop jar hadoop-streaming.jar \ -input input-file \ -output output-file \ -file Streaming_Mapper.py \ -mapper python Streaming_Mapper.py \ -file Streaming_Reducer.py \ -reducer python Streaming_Reducer.py \
www.kellytechno.comww
45
Submitting to RCs GridEngineAdd appropriate modulesmodule add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2Use the submit script posted in the Further Resources slideScript calls internal functions hadoop_start and hadoop_endAdjust the lines for transferring the input file to HDFS and starting the hadoop job using the commands on the previous slideAdjust the expected runtime (generally good practice to overshoot your estimate)#$ -l h_rt=02:00:00
NOTICE: All jobs are required to have a hard run-time specification. Jobs that do not have this specification will have a default run-time of 10 minutes and will be stopped at that point.www.kellytechno.comww
46
Output ParsingOutput of the reduce tasks must be retrieved:hadoop fs get output-file hadoop-outputThis creates a directory of output files, 1 per reduce taskOutput files numbered part-00000, part-00001, etc.Sample output of Wordcounthead n5 part-00000tis1come2coming1edwin1found1 www.kellytechno.comww
47
Extra OutputThe stdout/stderr streams of Hadoop itself will be stored in an output file (whichever one is named in the startup script)#$ -o output.$job_id
STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.20511/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 111/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_000111/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 111/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43927 bytes11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0%11/03/02 18:28:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.11/03/02 18:28:49 INFO mapred.JobClient: Job complete: job_local_0001
www.kellytechno.comww
48
Thank Youwww.kellytechno.comww
text
JobTracker
MapReduce job submitted by client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Input file
InputSplit
InputSplit
InputSplit
InputSplit
Input file
RecordReader
RecordReader
RecordReader
RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
InputFormat
Mapper
(intermediates)
(intermediates)
Mapper
(intermediates)
(intermediates)
Mapper
(intermediates)
(intermediates)
Mapper
(intermediates)
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner
Partitioner
shuffling
Reducer
RecordWriter
Reducer
Reducer
RecordWriter
RecordWriter
output file
output file
output file
OutputFormat