Hadoop training-in-hyderabad

49
Hadoop Technical Introduction Presented By

Transcript of Hadoop training-in-hyderabad

Slide 1

Hadoop Technical IntroductionPresented By

1

TerminologyGoogle calls it:Hadoop equivalent:MapReduceHadoopGFSHDFSBigtableHBaseChubbyZookeeper

www.kellytechno.comww

2

Some MapReduce TerminologyJob A full program - an execution of a Mapper and Reducer across a data setTask An execution of a Mapper or a Reducer on a slice of data a.k.a. Task-In-Progress (TIP)Task Attempt A particular instance of an attempt to execute a task on a machinewww.kellytechno.comww

3

Task AttemptsA particular task will be attempted at least once, possibly more times if it crashesIf the same input causes crashes over and over, that input will eventually be abandonedMultiple attempts at one task may occur in parallel with speculative execution turned onTask ID from TaskInProgress is not a unique identifier; dont use it that waywww.kellytechno.comww

4

MapReduce: High Level

In our case: circe.rc.usf.eduwww.kellytechno.comww

5

Nodes, Trackers, TasksMaster node runs JobTracker instance, which accepts Job requests from clients

TaskTracker instances run on slave nodes

TaskTracker forks separate Java process for task instanceswww.kellytechno.comww

6

Job DistributionMapReduce programs are contained in a Java jar file + an XML file containing serialized program configuration optionsRunning a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code

Wheres the data distribution?www.kellytechno.comww

7

Data DistributionImplicit in design of MapReduce!All mappers are equivalent; so map whatever data is local to a particular node in HDFSIf lots of data does happen to pile up on the same node, nearby nodes will map insteadData transfer is handled implicitly by HDFSwww.kellytechno.comww

8

What Happens In Hadoop?Depth Firstwww.kellytechno.comww

9

Job Launch Process: ClientClient program creates a JobConfIdentify classes implementing Mapper and Reducer interfaces JobConf.setMapperClass(), setReducerClass()Specify inputs, outputsFileInputFormat.setInputPath(),FileOutputFormat.setOutputPath()Optionally, other options too:JobConf.setNumReduceTasks(), JobConf.setOutputFormat()www.kellytechno.comww

10

Job Launch Process: JobClientPass JobConf to JobClient.runJob() or submitJob()runJob() blocks, submitJob() does notJobClient: Determines proper division of input into InputSplitsSends job data to master JobTracker serverwww.kellytechno.comww

11

Job Launch Process: JobTrackerJobTracker: Inserts jar and JobConf (serialized to XML) in shared location Posts a JobInProgress to its run queue

www.kellytechno.comww

12

Job Launch Process: TaskTrackerTaskTrackers running on slave nodes periodically query JobTracker for workRetrieve job-specific jar and configLaunch task in separate instance of Javamain() is provided by Hadoopwww.kellytechno.comww

13

Job Launch Process: TaskTaskTracker.Child.main():Sets up the child TaskInProgress attemptReads XML configurationConnects back to necessary MapReduce components via RPCUses TaskRunner to launch user processwww.kellytechno.comww

14

Job Launch Process: TaskRunnerTaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it should be mappingCalls Mapper once for each record retrieved from the InputSplitRunning the Reducer is much the samewww.kellytechno.comww

15

Creating the MapperYou provide the instance of MapperShould extend MapReduceBaseOne instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgressExists in separate process from all other instances of Mapper no data sharing!www.kellytechno.comww

16

Mappervoid map(K1 key, V1 value, OutputCollector output, Reporter reporter)

K types implement WritableComparableV types implement Writablewww.kellytechno.comww

17

What is Writable?Hadoop defines its own box classes for strings (Text), integers (IntWritable), etc. All values are instances of WritableAll keys are instances of WritableComparablewww.kellytechno.comww

18

Getting Data To The Mapper

www.kellytechno.comww

19

Reading DataData sets are specified by InputFormatsDefines input data (e.g., a directory)Identifies partitions of the data that form an InputSplitFactory for RecordReader objects to extract (k, v) records from the input source

www.kellytechno.comww

20

FileInputFormat and FriendsTextInputFormat Treats each \n-terminated line of a file as a valueKeyValueTextInputFormat Maps \n- terminated text lines of k SEP vSequenceFileInputFormat Binary file of (k, v) pairs with some addl metadataSequenceFileAsTextInputFormat Same, but maps (k.toString(), v.toString())www.kellytechno.comww

21

Filtering File InputsFileInputFormat will read all files out of a specified directory and send them to the mapperDelegates filtering this file list to a method subclasses may overridee.g., Create your own xyzFileInputFormat to read *.xyz from directory listwww.kellytechno.comww

22

Record ReadersEach InputFormat provides its own RecordReader implementationProvides (unused?) capability multiplexingLineRecordReader Reads a line from a text fileKeyValueRecordReader Used by KeyValueTextInputFormat www.kellytechno.comww

23

Input Split SizeFileInputFormat will divide large files into chunksExact size controlled by mapred.min.split.size RecordReaders receive file, offset, and length of chunkCustom InputFormat implementations may override split size e.g., NeverChunkFilewww.kellytechno.comww

24

Sending Data To ReducersMap function receives OutputCollector objectOutputCollector.collect() takes (k, v) elementsAny (WritableComparable, Writable) can be usedBy default, mapper output type assumed to be same as reducer output type

www.kellytechno.comww

25

WritableComparatorCompares WritableComparable dataWill call WritableComparable.compare()Can provide fast path for serialized dataJobConf.setOutputValueGroupingComparator()

www.kellytechno.comww

26

Sending Data To The ClientReporter object sent to Mapper allows simple asynchronous feedbackincrCounter(Enum key, long amount) setStatus(String msg)Allows self-identification of inputInputSplit getInputSplit()

www.kellytechno.comww

27

Partition And Shuffle

www.kellytechno.comww

28

Partitionerint getPartition(key, val, numPartitions)Outputs the partition number for a given keyOne partition == values sent to one Reduce taskHashPartitioner used by defaultUses key.hashCode() to return partition numJobConf sets Partitioner implementationwww.kellytechno.comww

29

Reductionreduce(K2 key, Iterator values, OutputCollector output, Reporter reporter )Keys & values sent to one partition all go to the same reduce taskCalls are sorted by key earlier keys are reduced and output before later keyswww.kellytechno.comww

30

Finally: Writing The Output

www.kellytechno.comww

31

OutputFormatAnalogous to InputFormatTextOutputFormat Writes key val\n strings to output fileSequenceFileOutputFormat Uses a binary format to pack (k, v) pairsNullOutputFormat Discards outputOnly useful if defining own output methods within reduce()www.kellytechno.comww

32

Example Program - Wordcountmap()Receives a chunk of textOutputs a set of word/count pairsreduce()Receives a key and all its associated valuesOutputs the key and the sum of the values

package org.myorg;import java.io.IOException;import java.util.*;import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;

public class WordCount {www.kellytechno.comww

33

Wordcount main( )public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);}

www.kellytechno.comww

34

Wordcount map( )public static class Map extends MapReduceBase { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, ) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }}www.kellytechno.comww

35

Wordcount reduce( )public static class Reduce extends MapReduceBase { public void reduce(Text key, Iterator values, OutputCollector output, ) { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}}www.kellytechno.comww

36

Hadoop StreamingAllows you to create and run map/reduce jobs with any executableSimilar to unix pipes, e.g.:format is: Input | Mapper | Reducerecho this sentence has five lines | cat | wcwww.kellytechno.comww

37

Hadoop StreamingMapper and Reducer receive data from stdin and output to stdoutHadoop takes care of the transmission of data between the map/reduce tasksIt is still the programmers responsibility to set the correct key/valueDefault format: key \t value\nLets look at a Python example of a MapReduce word count programwww.kellytechno.comww

38

Streaming_Mapper.py# read in one line of input at a time from stdinfor line in sys.stdin: line = line.strip() # string words = line.split() # list of strings

# write data on stdout for word in words: print %s\t%i % (word, 1)

www.kellytechno.comww

39

Hadoop StreamingWhat are we outputting?Example output: the1By default, the is the key, and 1 is the valueHadoop Streaming handles delivering this key/value pair to a ReducerAble to send similar keys to the same Reducer or to an intermediary Combinerwww.kellytechno.comww

40

Streaming_Reducer.py wordcount = { } # empty dictionary # read in one line of input at a time from stdinfor line in sys.stdin: line = line.strip() # string key,value = line.split() wordcount[key] = wordcount.get(key, 0) + value

# write data on stdout for word, count in sorted(wordcount.items()): print %s\t%i % (word, count)

www.kellytechno.comww

41

Hadoop Streaming GotchaStreaming Reducer receives single lines (which are key/value pairs) from stdinRegular Reducer receives a collection of all the values for a particular keyIt is still the case that all the values for a particular key will go to a single Reducerwww.kellytechno.comww

42

Using Hadoop Distributed File System (HDFS)Can access HDFS through various shell commands (see Further Resources slide for link to documentation)hadoop put hadoop get hadoop lshadoop rm filewww.kellytechno.comww

43

Configuring Number of TasksNormal methodjobConf.setNumMapTasks(400)jobConf.setNumReduceTasks(4)Hadoop Streaming method-jobconf mapred.map.tasks=400-jobconf mapred.reduce.tasks=4 Note: # of map tasks is only a hint to the framework. Actual number depends on the number of InputSplits generatedwww.kellytechno.comww

44

Running a Hadoop JobPlace input file into HDFS:hadoop fs put ./input-file input-fileRun either normal or streaming version:hadoop jar Wordcount.jar org.myorg.Wordcount input-file output-filehadoop jar hadoop-streaming.jar \ -input input-file \ -output output-file \ -file Streaming_Mapper.py \ -mapper python Streaming_Mapper.py \ -file Streaming_Reducer.py \ -reducer python Streaming_Reducer.py \

www.kellytechno.comww

45

Submitting to RCs GridEngineAdd appropriate modulesmodule add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2Use the submit script posted in the Further Resources slideScript calls internal functions hadoop_start and hadoop_endAdjust the lines for transferring the input file to HDFS and starting the hadoop job using the commands on the previous slideAdjust the expected runtime (generally good practice to overshoot your estimate)#$ -l h_rt=02:00:00

NOTICE: All jobs are required to have a hard run-time specification. Jobs that do not have this specification will have a default run-time of 10 minutes and will be stopped at that point.www.kellytechno.comww

46

Output ParsingOutput of the reduce tasks must be retrieved:hadoop fs get output-file hadoop-outputThis creates a directory of output files, 1 per reduce taskOutput files numbered part-00000, part-00001, etc.Sample output of Wordcounthead n5 part-00000tis1come2coming1edwin1found1 www.kellytechno.comww

47

Extra OutputThe stdout/stderr streams of Hadoop itself will be stored in an output file (whichever one is named in the startup script)#$ -o output.$job_id

STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.20511/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 111/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_000111/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 111/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43927 bytes11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0%11/03/02 18:28:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.11/03/02 18:28:49 INFO mapred.JobClient: Job complete: job_local_0001

www.kellytechno.comww

48

Thank Youwww.kellytechno.comww

text

JobTracker

MapReduce job submitted by client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

Input file

InputSplit

InputSplit

InputSplit

InputSplit

Input file

RecordReader

RecordReader

RecordReader

RecordReader

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

InputFormat

Mapper

(intermediates)

(intermediates)

Mapper

(intermediates)

(intermediates)

Mapper

(intermediates)

(intermediates)

Mapper

(intermediates)

Reducer

Reducer

Reducer

Partitioner

Partitioner

Partitioner

Partitioner

shuffling

Reducer

RecordWriter

Reducer

Reducer

RecordWriter

RecordWriter

output file

output file

output file

OutputFormat